Hello!
Thanks for the answers.
I use Linux and I can use the command line.
I recently had a problem with Browsertrix Cralwer.
He was unable to generate the WACZ file
Errors occurred:
{"timestamp":"2024-01-26T17:25:12.206Z","logLevel":"info","context":"general","message":"Generating Combined WARCs","details":{}}
{"timestamp":"2024-01-26T17:25:52.311Z","logLevel":"info","context":"general","message":"Generating CDX","details":{}}
{"timestamp":"2024-01-26T17:26:19.730Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2024-01-26T17:26:19.731Z","logLevel":"info","context":"general","message":"Generating WACZ","details":{}}
{"timestamp":"2024-01-26T17:26:19.732Z","logLevel":"info","context":"general","message":"Num WARC Files: 40","details":{}}
{"timestamp":"2024-01-26T17:26:28.014Z","logLevel":"error","context":"general","message":"Error creating WACZ","details":{"status code":1}}
{"timestamp":"2024-01-26T17:26:28.014Z","logLevel":"fatal","context":"general","message":"Unable to write WACZ successfully. Quitting","details":{}}
While Browsertrix Cralwer was operating, there was a power outage in the power grid. I don’t have UPS batteries.
And I suspect that some WARC file was written incorrectly.
I used the warc-extractor tool to resave the WARC file
and then “PY-WACZ”.
I managed to create a working WACZ.
Are there any programs that repair WARC files if they contain errors?
PY-WACZ does not always work.
Then I use JS-WACZ. However, it does not have the option of a full-text index of a WARC file.
Is there any program that also indexes the full-text WARC file and creates a pages.jsonl file