One computer can open a wacz file, another cannot

DHK · March 29, 2023, 10:13pm

I’m running Macs on Ventura 13.3. Here’s my URL;
https://replayweb.page/?source=https://www.wticalumni.com/warc/WTIC.wacz&url=https://www.wticalumni.com/#view=resources&url=https://www.wticalumni.com/

One Mac can open the file, Another cannot:

Any help gratefully appreciated. Using Chrome on both, of course.

DHK · April 1, 2023, 4:04pm

No one? In addition I’m finding that a newly created wacz won’t load either on any machine with the error: “An unexpected error occurred: Abort Error: The user aborted a request”

URL is: https://replayweb.page/?source=https://www.wticalumni.com/warc/2023-03-29-WTIC.wacz&url=https://www.wticalumni.com/#view=resources&url=https://www.wticalumni.com/ I’m using the same commands I always have to create the warc and wacz files…
Thanks…

Hank · April 5, 2023, 6:57am

To get to the bottom of any issue you’re having here we’d need the actual WACZ file. The WACZs aren’t actually uploaded anywhere and the URLs you have sent here don’t actually link to them on the web. In order to load the file on your other computer you’ll need to move the file over onto its storage and locate it in replayweb.page again.

DHK · April 5, 2023, 11:53am

The actual file IS uploaded to my website at:
https://www.wticalumni.com/warc/WTIC.wacz which works, and a newer test version
https://www.wticalumni.com/warc/23-04-04-WTIC.wacz which doesn’t work in https://replayweb.page

DHK · April 6, 2023, 7:37pm

@Hank This wacz is made up of two warc files. Both files can be accessed fine:
https://www.wticalumni.com/warc/WTICAlumni.warc and https://www.wticalumni.com/warc/2022-12-09-GoldenAge.warc However the wacz file at the address in the previous reply cannot be accessed.

Hank · April 6, 2023, 7:56pm

IDK how I missed that… Maybe I shouldn’t be responding to forum posts at 2 AM lol >_<

Trying your first link again seems to result in a CORS error (An unexpected error occured: TypeError: NetworkError when attempting to fetch resource.) on Firefox but works fine in Edge (Chromium). Pretty sure this is known behavior and we don’t have a workaround for it at the moment.

The link in your second post results in this error in Edge: An unexpected error occured: AbortError: The user aborted a request. which is strange becuase as you mention, visiting the link embedded in the URL works fine…

https://www.wticalumni.com/warc/23-04-04-WTIC.wacz results in a 404, I can’t access it.

I’ll have a look at the WTIC.wacz file locally and see if I can figure anything out?

DHK · April 6, 2023, 9:26pm

@Hank OOps I had a typo. Correct link is https://www.wticalumni.com/warc/2023-04-04-WTIC.wacz
Thanks… I’m afraid it’s a huge file: around 25G… sorry

Hank · April 7, 2023, 12:34am

Downloaded, tried loading it, says that 0 pages are found and doesn’t seem to load properly…

Which makes sense because datapackage.json & datapackage-digest.json appear to be missing in 23-04-04-WTIC.wacz? Unsure how you created this file but that would be a place to start.

DHK · April 7, 2023, 1:34am

@Hank I use the command

wacz create -o WTIC.wacz *.warc

which is the same command I’ve used for all my wacz’s. Maybe something changed in my system… upgrade of macOS?

Doesn’t this use python? Could something have changed there?

DHK · April 11, 2023, 10:43am

@Hank Any more thoughts?

Hank · April 12, 2023, 6:04am

I doubt Python is the problem here. For detaills on the following command line flags checkout the readme for py-wacz.

You should be able to validate your wacz files with wacz validate -f path/to/file.wacz. If that doesn’t work something is wrong with the file. In the case of your file 2023-04-04-WTIC.wacz ZipFile for Python reports that it’s not a zip file which is curious… WTIC.wacz validated fine.

I’d recommend creating the file with the --detect-pages and -t flags. Not including these flags means ReplayWeb.Page may be unable to find the index of pages it’s looking for within WACZ files. AFAIK, when you load WARC files it parses the file and generates this index which is why they take longer to load.

I tried re-creating the WACZ out of 2022-12-09 GoldenAge.warc and WTICAlumni.warc. The file was indexed however each page I tried to click got the “Archived Page Not Found” error. When I tried omitting 2022-12-09 GoldenAge.warc the resulting wacz file had the same issue. GoldenAge.warc notably loads without issue into ReplayWeb.Page.

wacz create -f *.warc -o testfile.wacz -t --detect-pages

I’m sorry I can’t find the root cause of the issue? Possibly unrelated, but in the future I’d also recommend making smaller WARC files, ~8GB is probably ideal to not run into browser storage issues when loading them.

DHK · April 12, 2023, 6:39pm

@Hank I really appreciate your trying to help me here. My procedure is to use the following:

wget --recursive --warc-file=GoldenAge --user-agent=Mozilla https://www.goldenage-wtic.org

for example. Could the wget be having problems creating WTICAlumni.warc? I really don’t know how I can use smaller files, as I’m trying to download the entire websites for archival purposes. I CAN do them separately, but that doesn’t help me with WTICAlumni, the bigger one.

edsu · April 12, 2023, 7:15pm

I noticed that the wget command will create GoldenAge.warc.gz not GoldenAge.warc which your previous wacz command would pick up. Did you decompress the WARC file prior to packaging with wacz? I’m trying to replicate the problem myself by following your steps.

DHK · April 12, 2023, 9:07pm

@edsu Thanks, I’ve just been deleting the gz, leaving the file as GoldenAge.warc or WTICAlumni.warc and then the command wacz create -o WTIC.wacz *.warc. It’s always worked fine until this month.

DHK · April 12, 2023, 9:17pm

Actually just added the “.gz” back on the two files and it made no difference. Still produced a wacz of 26G that wouldn’t work.

edsu · April 13, 2023, 11:35am

I did notice some warnings when generating the WACZ. They seemed to go away when removing the -t.

$ wacz create -f *.warc.gz -o testfile.wacz -t --detect-pages
Reading and Indexing All WARCs
Skipping, Text Extraction Failed For: https://www.goldenage-wtic.org/gaor-51.html
'utf-8' codec can't decode byte 0x92 in position 3258: invalid start byte
Warning: SAX input contains nested A elements -- You have probably hit a bug in your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML externally and feed it to BoilerPy3 again. Trying to recover somehow...
Num Pages Detected: 151
Writing archives...
Generating page index...
Generating datapackage.json
Generating datapackage-digest.json

I put the WACZ on Amazon S3 and here it is in ReplayWebpage:

https://replayweb.page/?source=https%3A%2F%2Fedsu-webarchives.s3.amazonaws.com%2Ftmp%2Ftestfile.wacz#view=pages

edsu · April 13, 2023, 11:38am

Although I should note though that my WACZ is 8.6GB. It sounds like your wget process is generating more than one warc.gz file? I’m not sure I’m understanding what you are doing here.

edsu · April 13, 2023, 11:43am

Also https://www.wticalumni.com/warc/23-04-04-WTIC.wacz is a 404 Not Found which would explain why ReplayWebPage can’t display it:

$wget https://www.wticalumni.com/warc/23-04-04-WTIC.wacz
--2023-04-13 07:42:00--  https://www.wticalumni.com/warc/23-04-04-WTIC.wacz
Resolving www.wticalumni.com (www.wticalumni.com)... 67.20.76.220
Connecting to www.wticalumni.com (www.wticalumni.com)|67.20.76.220|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2023-04-13 07:42:00 ERROR 404: Not Found.

DHK · April 13, 2023, 12:31pm

I had a typo. Correct link is https://www.wticalumni.com/warc/2023-04-04-WTIC.wacz

Your Golden Age WACZ is 8.6G, but I have two WARCs combined in the WACZ, also the 16G WTICAlumni.warc

DHK · April 13, 2023, 12:42pm

@edsu Progress… I removed the -t and did get a wacz I can access (on my computer drive). BUT, it says
Screenshot 2023-04-13 at 8.41.11 AM
What happened to the other pages?