Archive tools struggle archiving web.archive.org

I’m trying to archive some webpages that no longer exist but are archived on web.archive.org. I’ve tried using both ArchiveWeb.page.AppImage with replayweb.page and a local self-hosted kubernetes browsertrix cluster on my laptop.

There are two issues with this. The first is that the browsertrix “one hop out” feature seems to strip the https://web.archivemorg/web/DATETIME/ part of the https://web.archive.org/web/DATETIME/BASE_URL URL for the one-hop URLs (the ones that are not in scope) and instead tries to archive just the BASE_URL. This is fine for webpages that still exist but fails for websites that only exist on archive.org.

The second is that replaying WACZ that contain web.archive.org links in it fail to display in both browsertrix replay and replayweb.page. When trying to open the page in the WACZ it loads perfectly but then the replayweb.page browser immediately redirects to https://web.archive.org/replay/index.html?source=local://RANDOMSTRING#url=WEB_ARCHIVE_ORG_URL which is clearly wrong. In browsertrix a similar thing happens but the URL looks fine it just says “An unexpected error occured: TypeError: NetworkError when attempting to fetch resource”

It looks to me like the tools are trying to do some magic that’s getting confused when URLs (both relative and absolute) contain full absolute URLs.

This topic was automatically closed after 15 days. New replies are no longer allowed.

Yes, this is a known issue - archiving from an existing archive can be done, but not something that’s currently supported, due to us not having had the time to implement this. You can track support for this feature on the GitHub issue at: [Bug]: web.archive.org not archiving or playing properly · Issue #307 · webrecorder/archiveweb.page · GitHub