Archive tools struggle archiving web.archive.org

I’m trying to archive some webpages that no longer exist but are archived on web.archive.org. I’ve tried using both ArchiveWeb.page.AppImage with replayweb.page and a local self-hosted kubernetes browsertrix cluster on my laptop.

There are two issues with this. The first is that the browsertrix “one hop out” feature seems to strip the https://web.archivemorg/web/DATETIME/ part of the https://web.archive.org/web/DATETIME/BASE_URL URL for the one-hop URLs (the ones that are not in scope) and instead tries to archive just the BASE_URL. This is fine for webpages that still exist but fails for websites that only exist on archive.org.

The second is that replaying WACZ that contain web.archive.org links in it fail to display in both browsertrix replay and replayweb.page. When trying to open the page in the WACZ it loads perfectly but then the replayweb.page browser immediately redirects to https://web.archive.org/replay/index.html?source=local://RANDOMSTRING#url=WEB_ARCHIVE_ORG_URL which is clearly wrong. In browsertrix a similar thing happens but the URL looks fine it just says “An unexpected error occured: TypeError: NetworkError when attempting to fetch resource”

It looks to me like the tools are trying to do some magic that’s getting confused when URLs (both relative and absolute) contain full absolute URLs.