Capture page modified by userscript

yyy · October 3, 2023, 8:25am

I am using the ArchiveWeb.page chrome extension.
I want to capture webpages after it is modified by my userscripts.
For example, this userscript uses the GitHub API to add some info to the repo page:
https://i.imgur.com/Xv6gW5d.png
Originally, I didn’t think this was possible to capture. I thought the extension simply records network traffic, and cannot see DOM modifications made by userscripts, therefore cannot capture the modified page.

However, surprisingly, I was able to search the archive for the text added by the userscript (/2020-02-10/584/41/44.69MB):

It seems that the extension indexed the text content after the userscript ran? However this text is missing from the actual archived page.

How can I modify the extension to capture the page after userscripts run, so that “/2020-02-10/584/41/44.69MB” will be visible on the archived page?
Also, some userscripts adds interactive elements on the page (e.g. clicking on “41” will open the fork list), is it possible to capture these as well?

Note: The GitHub userscript is just an example, I don’t actually archive GitHub repos like this.

As another example, I have a userscript that replaces thumbnails with high-quality original images by modifying the “src” attribute. When the browser loads the original images, the extension records the network traffic and stores these images into the archive. However, the archived page will still use the thumbnail src, and will not display the original image.

I believe that there is some value in capturing these “non-original / altered” pages, because these userscripts add additional information to the page that would otherwise be lost.

Hank · November 7, 2023, 5:23am

I thought the extension simply records network traffic, and cannot see DOM modifications made by userscripts, therefore cannot capture the modified page.

AFAIK, you’re correct! What’s happening here is that the extension does exactly that, but then takes a snapshot of the DOM in the browser to extract text from for search. That text extraction is then stored seperately from the HTML in the WACZ file. When it takes the snapshot of the page the injected assets are present and so they remain in the extracted text. Similar reason for why your high quality thumbnails aren’t present.

yyy · November 7, 2023, 8:47am

Thanks for confirming my thoughts.

Are you aware of any tools that can save pages as they are shown in the browser rather than recording network traffic, and also allow running userscripts before saving? Extensions such as SingleFile can do this, however the interactive elements won’t work, and you can’t navigate between pages, so it’s more like a screenshot than an actual archive.

I ended up doing this: write the API response (JSON) onto the page directly, this way it gets indexed by the extension, and I can retrieve the full API response from the pages.jsonl file by searching for the URL. It still won’t show on the archived page, but at least all data is preserved.