Warcit to Replay web.page

lawork · June 11, 2021, 8:22pm

Hi all,

I’m working on a 10 year old batch of downloaded websites that have been living on a local server. They consist of directories made up of .htm or .html files, images, and other web data that existed at that time, so I was interested to try warcit to see if I could package them into a warc.gz file for replay and use, since this seemed to fit one of the use cases for warcit. I’m running Mac OS 10.15.7, and using the latest version of Chrome if it helps.

After getting warcit up and running, I got successful messages for compiling the web files that live in a test local directory I created into warc.gz files (e.g. “Wrote 12 resources to blogtest.warc.gz”).

I’m guessing some of this is user error, and these files are fairly old, but when I attempted to load the warc.gz into replay web.page (browser online version), the URLs seemed to index correctly, and the mime types and dates looked accurate, yet when I tried to click on them to view, I got the following error for every URL:

“The webpage at https://replayweb.page/w/id-dcbbecbfcedd/20120620152050mp_//Users/lw2cd/Desktop/Working/Blog_Examples/ameliaAbreuim-blowing-up-serious-ladies-with-uva-folks-its.htm might be temporarily down or it may have moved permanently to a new web address.”

I also attempted to load the same warc.gz package into WR player, just to see if it would load, but none of the pages even indexed there (probably not a big surprise).

My questions are:

Is this user error? Am I attempting to use warcit or replay web.page in a way that it wasn’t intended (web files that have been downloaded and have been sitting offline for years, repackaged into a warc.gz to be read by replay?)
If it’s not user error, what might be missing in the info exchange between the warcit warc.gz package, and replay web.page? Where is the likely source of error? Warcit or replay web.page?

Thanks so much for taking a look, and am happy to supply more details if needed.

ilya · June 15, 2021, 1:28am

@lawork Yes, in general warcit → replayweb.page should work, and hopefully can be made easier as well.

It’s hard to tell exactly what went wrong, it could be that the prefix to warcit was not passed in correctly?
Do you have the command-line that you used to run warcit?

Another issue, which may be a bit more difficult to fix, is that the html files have exact absolute links to files on local disk, eg. (/Users/…) and when converted into a WARC, the links remain. warcit does not currently rewrite links if that is the case. You can also look at the HTML file that has the link to see if its pointing to an absolute path? I assume ideally you want it to be loaded from something like: https://myblog.example.com/ameliaAbreuim-blowing-up-serious-ladies-with-uva-folks-its.htm

For example, running warcit -n myblog.warc https://myblog.example.com/ /Users/lw2cd/Desktop/Working/Blog_Examples/ should ideally create myblog.warc with that structure.

Hope this helps!