Completed archive with Chrome extenstion

JackMc · July 5, 2021, 7:23pm

I’ve completed the capture I’ve mentioned in various other threads about a 1 year course I took, and tested it in Google Chrome. It works. Along the way I saved various .wacz’s. The most recent one is about 5 GBs. Prior to that I had a 1 GB (lots of multimedia added since then). The 1 GB works. The 5 GB lists zero URLs in https://replayweb.page/. So as far as I can tell there is no way to access the month or so of capturing I did when going back to .wacz. I do see all the data, and can access it in the Chrome extension, but would much prefer it backed up, and saved to .wacz.

Any reason I see no URLs in the larger .wacz? But the older, smaller, .wacz works just fine.

JackMc · July 5, 2021, 8:07pm

In case it is not clear the archive was done with the extension ArchiveWeb.page and the version is 0.6.9.

Have tried .warc but it is taking a long time to reload (not useable at around 5 GB - on a Dell XPS from 2017 with Intel(R) Core™ i7-7700HQ CPU @ 2.80GHz and 16 GB of RAM, generally fast enough to still surf the web just fine).

ilya · July 7, 2021, 3:48am

That’s great that you were able to finish it!

Yes, so unfortunately, currently the extension doesn’t produce valid WACZ files if they exceed 4GB. Probably should make this more clear as well, or warn about this. The plan is to fix this via this issue: Switch to zip.js to have zip64 support · Issue #36 · webrecorder/archiveweb.page · GitHub

One option is to download as WARC, then use the py-wacz tools to create a WACZ, which does support WACZ files >4GB. Another option if you have several WACZ files, let me know and I can help you combine them into one (also using py-wacz, but not yet an automated way to do it).

JackMc · July 8, 2021, 6:17am

Well that is good news that it sounds like I have an archive somewhere slightly hidden on my Windows machine. I think it is in here \AppData\Local\Google\Chrome\User Data\Default\IndexedDB

Not sure if I need it, but I downloaded Anaconda (python 3.8), then added git from the Anaconda Navigator, then git cloned https://github.com/webrecorder/wacz-format.git . Then I followed the instructions here wacz-format/README.md at main · webrecorder/wacz-format · GitHub . The install seemed to work but there were a few warnings and then a bunch of ‘best matches’.

Generating the WACZ resulted in hundreds of lines that said: “Expecting value: line 1 column 1 (char 0)” and the final 3 lines said:
File “C:\Users\username\anaconda3\lib\zipfile.py”, line 1556, in open
raise BadZipFile(
zipfile.BadZipFile: File name in directory ‘archive\archivename.warc’ and header b’archive/archivename.warc’ differ.

A 5 GB WACZ was created but it doesn’t work in ReplayWeb.page.

Unforturnately my WACZs all overlap. It is simply an expanding archive over time as I made a few backups during the the process. No need to combine.

ilya · July 15, 2021, 10:11pm

Yes, the data is in indexeddb, but not easily accessible from there.

Would you mind sharing the raw WARC that you’ve downloaded?

Looking at some issues in py-wacz tools, hopefully will be a simple fix.

I was suggesting you download some pages from the extension to avoid exceeding the 4GB limit currently. We should be able to create one WACZ that has everything though!

JackMc · July 18, 2021, 2:09pm

Thanks for being willing to look at my WARC! The first GB of archiving was the most important, and I do have a separate WACZ for that. The next 4 GB was easier to archive as it was mostly media, less links to click taking up that space. My plan is to redo a separate archive for that second tranche. If I really, really need to have one WACZ I’ll see about trying to get you the raw WARC, for now I’ll plan on having multiple ones and staying below the limit. Thank you, and thank you for these archiving tools and your forum.