I have a problem with the page index in a WARC file

martinus7 · January 20, 2024, 7:24pm

The WARC file is generated using “Grab-Site” which uses wpull

Browsertrix Crawler generates a WARC file several times larger than “Grab-site”. That’s why I use “Grab-site”.

ReplayWeb.Pages reads the WARC file, but the “Pages” tab is empty.

I have three questions:

Is there any tool that will generate an appropriate page index based on the WARC file, which will be visible in ReplayWeb.Pages?
Is it possible to search the entire WARC content in full text via ReplayWeb.Pages?
Can WARC files be searched in full text mode?
How can I create a WARC file with this capability? with such a built-in index?

Please advise what tools can be used for this purpose.

Regards
Martin

Hank · January 26, 2024, 6:46pm

ReplayWeb.Pages reads the WARC file, but the “Pages” tab is empty.

This is because WARC files themselves don’t contain an index of pages.

Is there any tool that will generate an appropriate page index based on the WARC file, which will be visible in ReplayWeb.Pages?

Yep! You’re looking for PY-WACZ which can be found here and installed using pip with pip install wacz. This will allow you to create a WACZ file with a full text index and page index. WACZs bundle the WARC file with some extra metadata and indexing information within what is essentially a ZIP file. They still ultimately contain WARC files. No graphical tools are available at this time to convert WARCs to WACZ files.

Is it possible to search the entire WARC content in full text via ReplayWeb.Pages?

Yep! But again, it must be in a WACZ where that text index has been generated.

martinus7 · January 31, 2024, 7:34pm

Hello!

Thanks for the answers.
I use Linux and I can use the command line.

I recently had a problem with Browsertrix Cralwer.
He was unable to generate the WACZ file

Errors occurred:

{"timestamp":"2024-01-26T17:25:12.206Z","logLevel":"info","context":"general","message":"Generating Combined WARCs","details":{}}
{"timestamp":"2024-01-26T17:25:52.311Z","logLevel":"info","context":"general","message":"Generating CDX","details":{}}
{"timestamp":"2024-01-26T17:26:19.730Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2024-01-26T17:26:19.731Z","logLevel":"info","context":"general","message":"Generating WACZ","details":{}}
{"timestamp":"2024-01-26T17:26:19.732Z","logLevel":"info","context":"general","message":"Num WARC Files: 40","details":{}}
{"timestamp":"2024-01-26T17:26:28.014Z","logLevel":"error","context":"general","message":"Error creating WACZ","details":{"status code":1}}
{"timestamp":"2024-01-26T17:26:28.014Z","logLevel":"fatal","context":"general","message":"Unable to write WACZ successfully. Quitting","details":{}}

While Browsertrix Cralwer was operating, there was a power outage in the power grid. I don’t have UPS batteries.
And I suspect that some WARC file was written incorrectly.

I used the warc-extractor tool to resave the WARC file
and then “PY-WACZ”.
I managed to create a working WACZ.

Are there any programs that repair WARC files if they contain errors?

PY-WACZ does not always work.
Then I use JS-WACZ. However, it does not have the option of a full-text index of a WARC file.

Is there any program that also indexes the full-text WARC file and creates a pages.jsonl file