Wacz file little more than 4GB

macg3110 · March 25, 2025, 8:50am

Hi, I looked some logs in the forum. I found wacz file should be less than 4GB.
I have 4.15GB of archival data. I tried to delete some pages, I don’t know why, the data went up to 4.25GB.
I was able to download the wacz file with no error. Is the wacz file safe? or it’s better to restart archive?

DHK · March 25, 2025, 2:40pm

Your post terrified me: I regularly take a 28G warc.gz file, combined with a 9G warc.gz file to create a 37G wacz file, and have never had problems. I never knew of the 4G limit, and apparently my computers haven’t either.

Hank · March 25, 2025, 4:14pm

No idea where you found this but it’s incorrect. WACZ files can be of any size, in Browsertrix they are saved in 10GB increments while crawling, but that’s really just a resilience precaution / convenience (downloading one giant file in a web browser is harder than many smaller files).

macg3110 · March 25, 2025, 7:58pm

Interesting. I found the information in the forum.

Might be information is too old, and the tool is updated?
But, as fact. I archived 35-45GB, tried to download them, and it failed to download several times.

Anyway, thanks for the information, seems like 4-5GB would be fine to download with no error

Hank · March 25, 2025, 8:20pm

Oh yeah, that’s an old bug! Fixed long ago.

From Browsertrix? As I mentioned, downloading larger files in a web browser can have issues… If your connection drops out, web browsers aren’t great at resuming. You may consider downloading the individual ~10GB WACZ files available in the files tab of your archived item. Granted, the process for loading these into ReplayWeb.page via a JSON file is a little under-documented right now.

macg3110 · March 29, 2025, 8:02pm

Oh, thanks!
I don’t know anything about scripting, the json file might be too smart for me, but I will try if I can handle it

Hank · March 29, 2025, 11:49pm

While ReplayWeb.page accepts a JSON file, you may also have an easier time combining these into a multi-WACZ file.

If you download any collection containing multiple crawls / uploads, unzip the WACZ with any unarchiving program (Archive Utility, Keka, The Unarchiver, etc on macOS or 7Zip on Windows) you should find a similar JSON file within that can serve as a reference! Mirror the directory structure with your downloaded files, list the new files in the JSON where the previous ones where, and zip it back up!