Error Creating Wacz

apeironn · August 20, 2024, 12:51am

Hello!

I’ve downloaded a rather large website using browsertrix,
I’m now trying to generate the wacz using the command: wacz create -o C:/browsertrix_data/collections/test/archive/myfile.wacz C:/browsertrix_data/collections/test/archive/rec-6967c4b1298a-20240819171531218-13.warc.gz C:/browsertrix_data/collections/test/archive/rec-6967c4b1298a-20240819171531223-2.warc.gz
etcetc with the list of all my warc files.

at some point I get an error zipfile.BadZipFile: File name in directory ‘archive\rec-6967c4b1298a-20240819171531218-13.warc.gz’ and header b’archive/rec-6967c4b1298a-20240819171531218-13.warc.gz’ differ. and the generation fails.

I tried recrawling the website and I still get it on different files.
the files extract fine with 7z.
I’m not sure why this is happening or if there’s a workaround or a fix?
thanks in advance!

tessa-webrecorder · August 26, 2024, 3:32pm

Hi, it looks like you’ve hit this known issue with our py-wacz library on Windows: zipfile.BadZipFile error during wacz creation from warc file - Windows only · Issue #18 · webrecorder/py-wacz · GitHub.

We haven’t had time to prioritize looking into this to date but it seems as though quite a few people have run into it so I’ll see if I can take a look in the coming weeks.

In the meantime, you could:

recrawl the site using Browsertrix Crawler with the --generateWACZ flag, which will generate the WACZ within the crawling Docker container and won’t have the same issue; or
try using the js-wacz library maintained by Harvard LIL to create your WACZ file from the existing WARCS: GitHub - harvard-lil/js-wacz: JavaScript module and CLI tool for working with web archive data using the WACZ format specification.