How to capture content that loads after button press?

Polar · March 19, 2023, 10:14am

Hello!

I’ve been using the docker based Browsertrix crawler to try and capture a fairly complicated website: ハニプレ official website. It’s done an amazing job and the replay works fine, but there are 9 different videos/(gifs?) in the circle at the bottom of the screen (there is a large ‘1/9’ next to it), of which only the first loads in the replay. The loading symbol is exactly the same as on the website itself, which is impressive, but I would ideally like to be able to view the different videos the same as on the original webpage.

To switch between the 9 videos, you press arrow buttons, and it seems as though it downloads the video upon pressing them, although I think it must be caching them as it is faster the subsequent times.

I have tried the --waitUntil networkidle0 and then both that and the --behaviorTimeout 0 flags but, while the file size changes (more explanation below), the videos at the bottom are still not present.

(File sizes for:
Test command explained in the Github readme: 20,922kb
Test command with --waitUntil networkidle0 flag added: 20,925kb
Test command with --waitUntil networkidle0 and --behaviorTimeout 0 flags added: 28,210kb)

The URL does not change (no hashes added) when you switch between which video is playing at the bottom, so while I would also be happy to have 9 versions of the website with each video playing as a backup option, I cannot figure out how to do it - however, I am a beginner, so I would be happy to hear any suggestion.

Thank you for your help with this! If there’s any more detail needed, please let me know.

Hank · April 12, 2023, 6:14am

It’s possible to do this with browsertrix crawler however this site doesn’t seem very large and wouldn’t really be worth automating IMO? A better tool to check out would be the ArchiveWeb.Page Chrome extension which will allow you to generate a web archive of the page as you browse it. This is a manual tool but it will allow you to capture things that more automated tools like browsertrix might miss as a result of not being able to interact with the page. Improving this is on our eventual roadmap!

Install the extension from the link above
Navigate to https://honeyworks-game.com/
Once the extension is installed, you should see its icon (a small blue square with a white cloud and an arrow) in the top right corner of your Chrome browser.
Click on the Archiveweb.page icon in your browser to open its panel. In the panel, click the “Start” button to begin archiving the current website. The extension will start capturing the content, including text, images, and other resources loaded by the page like the videos you mention. Click through all 9 videos ideally ensuring they play all the way through to ensure the resources have been downloaded.