It would be helpful if you could write down exactly how these WACZs are being constructed. I’m not able to follow unfortunately.
The WACZ I shared above didn’t have
-t removed. I generated it using the wacz command you provided but packaged up the one warc.gz file that was created by the wget command you provided.
wget --recursive --warc-file=WTICAlumni --user-agent=Mozilla https://www.wticalumni.com
wget --recursive --warc-file=GoldenAge --user-agent=Mozilla https://www.goldenage-wtic.org
wacz create -f *.warc -o WTIC.wacz --detect-pages or
wacz create -f *.gz -o WTIC.wacz --detect-pages If I delete the gz from the filetype
@edsu Any further thoughts?
Sorry I got distracted I’m trying again to see if I can replicate.
Thanks so much, I really do appreciate your efforts. I’m trying to make the wacz file for a website to be posted in the CT Digital Archives for posterity.,
I followed your steps using the the
WTICAlumni.warc.gz that wget created, and then packaged up with wacz using the command you provided.
I then uploaded to AmazonS3 and you should be able to view at: ReplayWeb.page
Here are my wget and wacz versions:
$ wacz --version
wacz 0.4.8 -- WACZ File Format: 1.1.1
$ wget --version
GNU Wget 1.21.3 built on darwin22.1.0.
I think that maybe your web server isn’t configured for CORS correctly?
I see this error in Firefox when I tried to load your WACZ https://www.wticalumni.com/warc/2023-04-04-WTIC.wacz
Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://www.wticalumni.com/warc/2023-04-04-WTIC.wacz. (Reason: CORS request did not succeed). Status code: (null).
Hi @edsu … that’s exciting that you can get it to work. My wget is current, but have wacz 0.4.6, and have just updated it and will try the procedures again.
- I didn’t know you could run ReplayWeb on Firefox… thought only Chrome.
- As for CORS, unless the host (Hostmonster) changed something it USED to work fine. I’ll keep digging. Thanks so much.
Update: Just tried accessing the wacz on my Mac and ran into a similar error of missing pages:
As for the CORS since it DID use to work on the site, I’d have to check with Hostmonster and see if they made any changes…
We just show the first 100 pages until you scroll down, we can tweak the text here to make it less confusing.
I took a brief look, I think one of the WACZ files in questions ended up being corrupt, that’s why it wasn’t working. The text extraction is still a bit experimental, and my guess is prehaps the text file got too big and something happened. We can try to repro it.
Besides our tools, you can test a wacz with a regular unzip, eg.
unzip -t <wacz> - if it says its not a valid zip, then it s also not a valid wacz.
@ilya @edsu @Hank Thank you all for your help. I’m quite confused, but my system is now working. I did a test and unzipped the wacz file on my computer and it had no errors. I then went and accessed it on the web (now named https://www.wticalumni.com/warc/WTIC.wacz) and it works fine. Thanks to Ilya who informed me that all the pages don’t load at once in the left hand panel. I did contact Hostmonster tech support about CORS, who was less than helpful. They told me it was a web developer problem and they weren’t web developers. At any rate, whatever the cause (and I do wish I knew what happened) it’s working again. I’m quite grateful to you all for your time.