I have been archiving a large Website using Browsertrix. It finished the job after 3 days, suc-cessfully creating a 440GB-sized WACZ file, created from 425 WARC-Files. I would like to open and review it somehow and tried using the Webrecorder Desktop-App but so far I failed loading it - it just gives me a blank screen when trying to load it (after about 5 Minutes).
I tried to upload it to a locally deployed Browsertrix-Cloud instance - but after the upload reaches 440 out 440 GB nothing else happens - it just gets stuck there.
Both Methods worked perfectly for a 14GB Test-WACZ of that Web-site generated with the same Configuration.
Any ideas how I could handle such a large file?
On our server we set the crawler to create a new WACZ file after crawling 10GB of content to avoid these issues. Multiple WACZs can then be loaded with a replay.json file in ReplayWeb.page.
If you open the WACZ files, extract the WARCs and piece them together using py-wacz into a bunch of smaller files, that may get around some of the issues here? As for exactly what they are, unsure.
I think I’d file the uploading issue as a bug report?? If files that big are out of scope, we should impose a size limit.