Py-wacz to Replay Web.page issues

lawork · July 21, 2022, 9:58pm

Hi all,

I’m attempting to use the py-wacz command line utility to create a single WACZ file from around 70 WARC files which were originally created with webrecorder technology around 2020. The goal is for this WACZ to be hosted in ReplayWeb.page for access and use.

I’m running into a few issues, first in the form of a few py-wacz utility errors, and then in ReplayWeb.page playback. I wanted to present them both here in case they are connected.

I tested a handful of WARCs to start (4 or 5) with py-wacz to ReplayWeb.page (just loading them locally from my machine). The py-wacz simple commands worked without error, and a wacz was created that I loaded into ReplayWeb.page. I did notice a message when viewing these materials before they would load that looked like this: Screen Shot 2022-07-21 at 5.46.39 PM

However, when I tried to start scaling up and loading 50-70 WARCS, I started to run into issues in py-wacz, which I thought might have stemmed from the warcs themselves. The errors that were thrown typically followed this pattern (I’ve used ellipses where huge blocks of text followed, except for the Error:\n):

Error parsing: {“action_group_id”:“8f1e0c32-5f55-d853-d9f2-bd7a60f15fff”,“version”:“3.11.11-generic”,“xkey”:“cvYGNal5xGaRZYDiS2z80aL3JULuLgSOakuDmdmt”…

Error:\n

Error: Unexpected coder setup failure:\nfunction(){var…

Even with these errors, py-wacz still created an over 2GB wacz file from my 70 or so WARCs, so I tried to load it into ReplayWeb.page. There, I ran into other issues that may or may not have stemmed from the py-wacz errors. Even though most pages showed as properly indexed, clicking on them to view showed a message that the page was not part of the archive (screenshot below):

This was the case for every page I tried to navigate to within the archive. I also tried to search and load specific URLs within the WACZ, but was met with the same result.

Thanks very much for any insight you can provide.

edsu · July 24, 2022, 11:12am

Hi @lawork, is there any chance the WARC files you are bundling together are available to try to reproduce the py-wacz errors? If not you can email me at ehs@pobox.com if you want to share them that way.

lawork · July 25, 2022, 2:45pm

Thanks so much @edsu, I’ll send you an email.

edsu · July 25, 2022, 5:05pm

I saw similar errors when running wacz create -o uva.wacz UVA_UTRCR_WARConly/*.warc with the directory of WARC files you supplied.

I narrowed them down to these WARCs:

Reddit_6tj249_20200323.warc
newsweek_idahorepublican_20200325-20200408184049.warc
newsweek_alexjonescvilleviolencefalseflag_20200326-20200408192303.warc
beacon_broadside_20200406-20200413192748.warc

Once I regenerated the WACZ without these four I could view the content in ReplayWeb.page.

Do you remember how you created the four WARC files? I seem to remember @ilya mentioning that there was a bug in Webrecorder-Desktop and/or ArchiveWeb.page that was causing incorrect Content-Length headers to be written to the WARC file. It looks like these errors concern trying to parse truncated JSON, which makes me think that could be the problem here? If you still have the archived content and can reexport the WARC files that might fix the problem?

lawork · July 25, 2022, 6:47pm

Thanks so much, @edsu. There is a good chance that those four WARCs you pinpointed were collected with Webrecorder Desktop, as the work was distributed across a small team several years ago, and we uploaded WARCs to a shared location. It was long enough ago that I don’t think any were collected with ArchiveWeb.page. If it turns out these files were created in someone else’s account (team members were a now-graduated student and now-retired staff member) and I can’t re-export the WARC files, are there any other steps I could potentially take to address truncated JSON? (Other than just leaving those WARCs out of the collection).

edsu · July 25, 2022, 7:09pm

I’m not sure. In theory I think the Content-Length headers could be corrected in the WARC data. @ilya what do you think? The warcio library should be helpful to do this?

lawork · August 3, 2022, 5:17pm

Thanks @edsu. For now, I’m going with removing the four files for our current replay needs, and it looks like everything is up and running well. Thanks again for your help!