Wget does not download files, although log says it did

DHK · May 24, 2023, 8:09pm

I’ve started having problems downloading a warc.gz file. I check the actual website and the file is there, but when I try to find it in the warc.gz file of the complete website in replayweb.page it can’t find it. More interestingly, the log shows it was saved. I’ve successfully downloaded this site for months, but only recently has the problem started occurring. This is only one example of many:

Website https://www.wticalumni.com

wget --recursive --warc-file=WTICAlumni --output-file=logfile.txt --user-agent=Mozilla https://www.wticalumni.com

wget Version: GNU Wget 1.21.3 built on darwin21.3.0.

Page address: WTIC Alumni Site

Image Address: https://www.wticalumni.com/images/Bob-Steele-with-BC-House-Proposed-Drawing.jpg

Logfile Entry:
–2023-05-24 15:33:36-- https://www.wticalumni.com/images/Bob-Steele-with-BC-House-Proposed-Drawing.jpg
Reusing existing connection to www.wticalumni.com:443.
HTTP request sent, awaiting response… 200 OK
Length: 137450 (134K) [image/jpeg]
Saving to: ‘www.wticalumni.com/images/Bob-Steele-with-BC-House-Proposed-Drawing.jpg’

 0K .......... .......... .......... .......... .......... 37% 38.2M 0s
50K .......... .......... .......... .......... .......... 74% 46.2M 0s

100K … … … … 100% 44.1M=0.003s

2023-05-24 15:33:36 (42.4 MB/s) - ‘www.wticalumni.com/images/Bob-Steele-with-BC-House-Proposed-Drawing.jpg’ saved [137450/137450]

Any help gratefully appreciated.

DHK · May 26, 2023, 1:57am

@edsu @hank You both were kind enough to work with me on an earlier problem, maybe you can shed some light on my current problem. Thanks.

edsu · May 26, 2023, 2:18pm

Hi @DHK you mentioned that you are crawling it over time. Are you replacing the WTICAlumni.warc.gz file each time? I wonder if ReplayWebPage may have some data cached that is interfering with the lookup? Does it help if you delete the site data for www.wticalumni.com in your browser and try again?

DHK · May 26, 2023, 9:00pm

@edsu I’ve deleted my Chrome cache as well as purging and reloading the file in ReplayWeb.page and the problem still occurs.

Even more interesting, I unzipped the gz file to the basic warc, and then used “The Unarchiver” to unarchive this back to it’s original file structure and the missing jpgs are available.

edsu · May 26, 2023, 10:04pm

Can you share the URL for the WACZ file?

DHK · May 28, 2023, 11:37am

@edsu I can upload the wacz file, but before I upload a 16G file, I just discovered that my warc.gz file does not show a picture, but the wacz file I made DOES show it. Unfortunately, I don’t have enough storage to upload both wacz and warc.gz at the same time. What would you like my next step to be?

edsu · May 31, 2023, 7:49pm

I’m confused, isn’t the WACZ or WARC data on the web already so you can view it with ReplayWebPage?

DHK · May 31, 2023, 10:31pm

It was, until I discovered that the website that I thought had no quota, really did have one… I did some housecleaning… My current status is that the warc.gz seems to not be able to find pages, but when I make the wacz it does… It’s driving me crazy. Tell me what you want and I’ll upload it. Thanks.