Webscaper Regression Testing

eggs · April 13, 2025, 3:53pm

Context:
I have webscrapers built with Puppeteer that include the following:

Login
Clicking buttons
Querying a search engine within the site

I’m trying to use an offline archived version of the websites for regression testing.

I only need to test static flows (e.g. Login with specific set of credentials, click buttons in specific order, query specific strings in search engine, ect.)
Ideally, I want to be able to record a WACZ file for a running scraper
The idea is to have separate WACZ files for tests 1, 2, 3 …

Problem: In my experience so far, ArchiveWeb.page (desktop & extension versions) fails to record login authentications. The authentication is the most important since the pages post-login check if the user is authenticated client-side and server-side (meaning login must be recorded for anything to be recorded).

How I’ve been using ArchiveWeb.page: Using the desktop app on windows 11, after a new page loads during the archiving, I wait for the status: “Idle, Continue Browsing.”

Questions

Is it possible to record alongside my own running webscraper (not the browsertrix crawler)?
Am I using the tool incorrectly?
Is archiving the wrong approach? Any suggestions?

ilya · April 17, 2025, 6:49am

Do you have the “Archive Cookies” and “Archive local storage” options checked in ArchiveWeb.page? They are generally needed to be able to archive logins, usually.

In general, we sort of aim to archive a site after login, not the actual login process, since we want to avoid archiving credentials, but is possible, depending on the site. It’s very hard to say what’s going wrong without looking at the exact example.