How to extract WARC files

TripleCamera · December 30, 2024, 2:14pm

Hi. I have created a WARC file using the ArchiveWeb.page browser extension. Now I want to extract all the files while preserving the original directory structure.

I have read former posts and tried both unwarcit and warcat. Unwarcit categorizes the files and places them under different folders, which isn’t what I want. Warcat hasn’t been updated for a long time, and there is a strange bug: Some of the files in the WARC are not extracted, and there is no error or warning reported. Is there an alternative to warcat?

edsu · January 2, 2025, 1:43pm

It looks like the author of warcat has moved on to focus on a Rust version?

Depending on what you are trying to do with the extracted files maybe it will we work better, and (hopefully) the author might be more responsive to bug reports?

TripleCamera · January 5, 2025, 10:52am

Indeed. I just tried it out. It worked way much better. However, the issue persisted. I am investigating why…

TripleCamera · January 9, 2025, 2:55pm

I found out the reason: For multiple responses with the same URL path and different parameters (e.g. /path/to/api?param=1 and /path/to/api?param=2), ArchiveWeb.page would save only one response and mark the others as “revisits” to save some space. Warcat (and warcat-rs) would ignore revisits, so many of the duplicated responses are not extracted.

Update: I just created a feature request for warcat-rs.