Continued problems with warcit and wacz

DHK · September 2, 2023, 8:26pm

I’ve finally gotten warcit to create warc files. The attached Goldenage warc seems to work fine loading the URLs, although no pages seem to have been defined. The WTICAlumni warc won’t even load any of the pages it finds. Also the resultant wacz doesn’t show anything even when I used a --detect-pages option. Since I can’t upload files here, a small 780K file is at https://www.wticalumni.com/warc/Warc-Wacz_Problems.zip
Thanks.

Hank · September 20, 2023, 9:08pm

Hey! I’ve also run into this issue on a personal archiving project. The way that warcit creates records isn’t completely compatible with py-wacz or ReplayWebpage right now. The WARC file should work in ReplayWebpage if you manually navigate to the correct URL, but it won’t show up in the pages list.

Ultimately, I think warcit needs to be updated? Unfortunately this is a very low priority task for the team right now as we are focused on getting Browsertrix out the door!

I know we’ll get there eventually, but it may take us a little bit of time. :\

DHK · September 21, 2023, 12:31am

Thanks for the reply. At least it’s not me : ) I needed it for a test project of a very small wacz file, so had been hoping to get my warcs combined into a wacz. Need a wacz as I have a two website combined wacz that links two related websites… and there are calls to the other that work great… but don’t want people to have to download a 25g warc file!

I’m working with the state of Connecticut (USA) Digital Archive who are interested in hosting the wacz file, but had hoped to test with a small file first… so I gave them a small single file to test with, instead of a small combined file. We’ll see how the project goes.

Hank · September 29, 2023, 1:26am

If you need a smaller file as a test, feel free to use this WACZ of browsertrix.cloud (192KB)

Fun fact, (this isn’t well documented yet, I plan to do a pass on the ReplayWebpage docs in the nearish future) but ReplayWebpage (when embedded, not in the UI on the site) supports loading multiple WACZ files! You don’t actually have to rip them apart and re-create them (though that can be nice for portability).

You can do this by passing a JSON file to the sourceurl of ReplayWebpage using the following spec (should work )

{
  "name": "Collection Name",
  "description": "Description of all files for curation reasons",
  "modified": "2023-09-23T04:57:33",
  "crawlCount": 2,
  "tags": ["collectiontag1", "collectiontag2"],
  "resources": [
    {
      "name": "archive1.wacz",
      "path": "path/to/archive1.wacz",
      "hash": "e511cc962b156a37d3c7546d8e0533b31e7b49f8ef902f5cdd0c0e093a10522f",
    },
    {
      "name": "archive2.wacz",
      "path": "path/to/archive2.wacz",
      "hash": "caac5f6789d7f112a2717574f276861cf583db2eb145ac67d0c1d9b4f2713b1a",
    }
  ]
}

file hash, description (though encouraged!), tags, and modified timestamp should be optional??

Note that as this is undocumented some things are subject to change, also I’ve omitted some of the fields we fill in automatically in Browsertrix so if you try it out with only these fields LMK if it works for you!

DHK · September 29, 2023, 2:25am

Thanks @Hank. I solved my tiny wacz file need by just downloading a random small website. My two warc files when combined in a wacz are around 25G or so… The interesting thing is that they have calls to each other that stay within the wacz “ecosystem” and don’t go out to the real world!

Hank · September 29, 2023, 2:26am

Ah yeah, the above is meant for if you have multiple WACZ files already… But hey, bonus knowledge!

Hank · November 7, 2023, 5:12am

@DHK and for anyone who might run across this in the future… There’s an issue filed for this now!

No promises on when this will be addressed, but if you’re a GitHub user and you’d like to be notified when this issue is eventually closed, you can do so by subscribing in the sidebar.