Hi everyone,
I have been archiving a large Website using Browsertrix. It finished the job after 3 days, suc-cessfully creating a 440GB-sized WACZ file, created from 425 WARC-Files. I would like to open and review it somehow and tried using the Webrecorder Desktop-App but so far I failed loading it - it just gives me a blank screen when trying to load it (after about 5 Minutes).
I tried to upload it to a locally deployed Browsertrix-Cloud instance - but after the upload reaches 440 out 440 GB nothing else happens - it just gets stuck there.
Both Methods worked perfectly for a 14GB Test-WACZ of that Web-site generated with the same Configuration.
Any ideas how I could handle such a large file?
On our server we set the crawler to create a new WACZ file after crawling 10GB of content to avoid these issues. Multiple WACZs can then be loaded with a replay.json file in ReplayWeb.page.
If you open the WACZ files, extract the WARCs and piece them together using py-wacz into a bunch of smaller files, that may get around some of the issues here? As for exactly what they are, unsure.
I think I’d file the uploading issue as a bug report?? If files that big are out of scope, we should impose a size limit.
Thanks for the reply - i tried your method of loading a replay.json - worked like a charm!
Hooray! I hope to have this documented better in the coming weeks
if I autopilot crawled Instagram profile that has 20K posts 30% video and 70 % photo it will definitely have more then 10 GB of size so where will it save as it crawls live does it instantly gets saved in choose storage directory or it gets saved in browser and memory if its memory and browser then it will crash the browser because of the size so how to solve this.
I believe that the data is saved as it is crawled. But I’ve noticed that infinite scroll interfaces do result in a large page DOM that the browser needs to maintain, which can present problems.
hey thanks for respond, is it better to use it as google chrome extension or standalone app.
which is better.