No pages for pywb archive in ReplayWeb.page. Also, how to combine and name archive in ReplayWeb.page?

tsj · February 16, 2022, 2:49am

I have small archive use case, I just want to archive a few pages now and then.

I installed pywb and followed the docs to create a web archive like so:
wb-manager init my-web-archive

And then captured a page:
wayback --record --live -a --auto-interval 10
Navigate to http://localhost:8080/my-web-archive/record/https://github.com/webrecorder/pywb

I can replay fine with http://localhost:8080/my-web-archive/https://github.com/webrecorder/pywb. I notice that http://localhost:8080/my-web-archive is not able to show just a list of captured pages, I have to know the url beforehand. I guess I am meant to use ReplayWeb.Page.

I have two issues at this point. First, using the process above to capture any web page, if I load the warc.gz that pywb generates into ReplayWeb.Page, I see no “Pages” defined. (The message is No Pages are defined in this archive. The archive may be empty. Try browsing by URL.) I have to browse by URL which shows all static resources.

The second issue is that every time I capture a new page (or maybe every time I stop/restart wayback) I end up with a new warc.gz. This has to be loaded into ReplayWeb.Page, and ultimately I cannot just view a list of all pages I have recorded, I have to pick a warc.gz.

Is there some way to save all captures with pywb into the same warc.gz? Also, how can I give the warc a nice name that will appear in ReplayWeb.Page, like the “Temporary Collection” demo? As is, I just pick loaded warcs by filename.

ilya · February 17, 2022, 7:28am

Hi, all good questions!

Currently, the interoperability between pywb and replayweb.page isn’t very far along - pywb doesn’t yet have the ability to record pages that ReplayWeb.page could detect - this is something that we’d like to add in the future, along with native WACZ support for pywb. The two tools were developed at different times and for different use cases.

One way it could work in the future is pywb could generate an ‘unzipped WACZ’, that could then be zipped and loaded in ReplayWeb.page.

If you’re just trying to archive a few pages manually, you should try the ArchiveWeb.page extension or app, which should work nicely with ReplayWeb.page. If you’d like to have WARCs, you can still export a WARC (or WACZ) from ArchiveWeb.page periodically. I think that should work well for your use case of occasional archiving.