Creating a wacz from an offline page to view in ReplayWeb

MaverikMS2001 · November 1, 2023, 12:45am

Hi,

I try to arichve and play old websites which we’ve developed over the years. the Problem these sites are mostly made in flash so I can not capture them with the chrome extension. The next thing is that these are all locally.

So what I’ve tried so far. Uploaded the old sites and tried to open these with the web app, which didn’t load the site because of SSL.

Afte that I’ve downloaded warcit-master and tried withy Python311 to make a archive but when i open the setup.py i’ve got the error: no commands supplied.

So i hope anybody here can help me to put the old websites in a archive so that i can Play it with the ReplayWeb Toll and make some screenrecords. I belive i’m not far away from a solution and what i saw your tools are the best solutions. respect for this great work and to help to archive the cool old websites.

Step further tryied to install with pip and got an error:
Could not build wheels for cchardet
greetings,
Markus

Hank · November 1, 2023, 11:51pm

Hey! I have some experience with archiving and replaying Flash content! A few things to know:

ReplayWebpage has Ruffle built into it! I haven’t been able to get this working for my personal archiving project just yet but your mielage may vary! My use case involves Flash content which needs to contact a server running a PHP script, we haven’t had the chance to debug it for embeds but it works properly when loaded into ArchiveWeb.page. Pretty specific! If you get this running after archiving your content I’d like to know about it!
Warcit is currently not quite up to speed with the rest of our tools. It is not yet compatible with how we display pages in ReplayWeb.page and while it does produce a valid WARC file, we need to update it a bit — hard to find the time while we’re launching Browsertrix!
That being the case… Using ArchiveWeb.page may be your best bet here. Install Ruffle to regain Flash compatibility and browse the site as you normally would. It should save everything required and should be viewable once embedded with the useRuffle flag (see first link).
(This is a bit of a hack) If you need to host your files locally, you can run a web server on your machine (I like http-server), running it with the -p 80 flag to run it on port 80, and add the following to your hosts file:

127.0.0.1  www.yourwebsite.com
127.0.0.1  yourwebsite.com

It won’t have https (which may matter if you’re mixing your locally archived flash content with content from other sources), but now you can archive your local content as if it came from that site! Would also recommend removing the items from your hosts file once you’re done!

This approach also isn’t ideal if you have many pages… For that you might have to wait for warcit to get an update!

rijnder · October 25, 2024, 1:54pm

(This is a bit of a hack) If you need to host your files locally, you can run a web server on your machine (I like http-server), running it with the -p 80 flag to run it on port 80

To further complicate this hack: this /etc/hosts trick won’t work if you crawl using the browsertrix-crawler docker container. This is because the container cannot see your machine’s actual /etc/hosts file — or at least by default. To fix this, append --network=host to your docker run command.

rijnder · October 29, 2024, 9:10am

@Hank I’ve been trying the little hack you gracefully provided — that is, locally hosting a folder with some html+css files in order to harvest them using browsertrix crawler. My reason for going with this approach is to workaround an issue with warcit: as you know, archives created with warcit won’t populate the ‘pages’ list on ReplayWeb.page. Your hack does a tiny bit better here.

This approach somewhat works. I get a .wacz that is roughly the same size as my .warc created through warcit (both are >200MB). And if I unpack the .wacz resulting from my crawl, the contents look fine. Finally, my logs also look good (46/46, without fails):

"Crawl statistics","details":{"crawled":46,"total":46,"pending":0,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":[]}}

Where this approach fails me is that when loading the .wacz in question into ReplayWeb.page and then turning off my local webserver, most of the archived pages will fail to display. Instead, I’m greeted with messages such as the following:

Archived Page Not Found
Sorry, this page was not found in this archive:
http://192.168.0.70/r_NL.IMRO.0281.BP00003-on01_0003Wijzevanmeten.html

This mostly — or perhaps, only — happens on pages I hadn’t yet visited on ReplayWeb.page while my local webserver was still running. In other words, It’s almost as though ReplayWeb is only able to load “already cached” results.

Have you seen this before? I’m pretty perplexed by the fact that my harvested webserver needs to be live in order for ReplayWeb.page to able to show all my archived pages. Kinda defeats the purpose, doesn’t it?

I’m sorry for asking here (maybe an issue over at the browsertrix crawler github is more appropriate?), but I thought maybe you had some first-hand experience with this approach.

In terms of software I’ve tried both http-server and darkhttpd, and I’m crawling using the latest docker image. I tried multiple different combinations of parameters for my crawl, but to no avail.

Hank · October 29, 2024, 3:28pm

That’s not the behavior I would expect from this… ReplayWeb.page shouldn’t ever be fetching anything from outside of the archive, and it shouldn’t fail with content that should already be present in the archive! I’m going to forward this along to the rest of the team… Though it sounds like a ReplayWeb.page issue to me. I’d also be interested if it’s related to the local IP being used. Did you try the host file edit method?

rijnder · October 29, 2024, 4:03pm

I’d also be interested if it’s related to the local IP being used. Did you try the host file edit method?

I did, but to no avail. As in, your /etc/hosts trick gives me a prettier URL, but still results in the weird behavior I described (i.e. ReplayWeb.page failing to load pages after I Ctrl-c/kill my webserver, be it http-server or darkhttpd).

I would like to try if this happens if I use 127.0.0.1 instead of 192.168.0.70 (i.e. my host IP adress on my home LAN), but I’m hitting a docker wall if I try to do that:

$ sudo docker run --network=host  -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url http://127.0.0.1 --generateWACZ  --collection localhost_test

{"timestamp":"2024-10-29T15:56:14.079Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 1.2.3 (with warcio.js 2.2.1)","details":{}}
{"timestamp":"2024-10-29T15:56:14.081Z","logLevel":"info","context":"general","message":"Seeds","details":[{"url":"http://127.0.0.1/","scopeType":"prefix","include":["/^https?:\\/\\/127\\.0\\.0\\.1\\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"auth":null,"_authEncoded":null,"maxExtraHops":0,"maxDepth":1000000}]}
{"timestamp":"2024-10-29T15:56:14.155Z","logLevel":"warn","context":"redis","message":"ioredis error","details":{"error":"[ioredis] Unhandled error event:"}}
{"timestamp":"2024-10-29T15:56:14.156Z","logLevel":"warn","context":"state","message":"Waiting for redis at redis://localhost:6379/0","details":{}}
{"timestamp":"2024-10-29T15:56:15.221Z","logLevel":"warn","context":"state","message":"Waiting for redis at redis://localhost:6379/0","details":{}}
{"timestamp":"2024-10-29T15:56:16.241Z","logLevel":"warn","context":"state","message":"Waiting for redis at redis://localhost:6379/0","details":{}}
^C{"timestamp":"2024-10-29T15:56:16.616Z","logLevel":"info","context":"general","message":"SIGINT received...","details":{}}
{"timestamp":"2024-10-29T15:56:16.616Z","logLevel":"error","context":"general","message":"error: no crawler running, exiting

Same thing if I add 127.0.0.1 www.bestemmingsplannen.archive to my /etc/hosts, and then run the crawler with --url http://bestemmingsplannen.archive/ — I get stuck at Waiting for redis at redis://localhost:6379/0

Crawling websites from my home LAN IP works fine with /etc/hosts modifications though, as long as I make sure to include --network=host.

I’ve reproduced this behavior on multiple LANs, btw.

Hank · October 29, 2024, 4:10pm

As one last thing to try, could you hit the Purge Cache + Full Reload button with the web server off and try to load the archive again?

rijnder · October 29, 2024, 5:13pm

Weird. I’m suddenly not able to reproduce my own problem. I certainly didn’t dream it, though — I had it happen on multiple machines, both today, yesterday, and friday.

If I find out more I will get back to you.