Problem archiving media-rich website using ArchiveWeb Chrome plugin

kgkissel · March 4, 2022, 8:46pm

I’m trying to help archive Ukrainian-based websites, and started using the ArchiveWeb Chrome browser plugin to do so. I’ve run into an issue where I have the message that ““Webrecorder ArchiveWeb.page” started debugging this browser” and the website is trying to load and at the same time, the Recording status shows “Idle, Continue Browsing”. The first time I started recording, the status updated with the number of URLs, but now there’s nothing.
In short, I figure I’m doing something wrong, but I can’t find any documentation to help solve my issue. How can I capture a whole website using the ArchiveWeb plugin?

Screenshot:
Processing: Cosmonaut Museum website issue.png…

edsu · March 5, 2022, 1:13pm

ArchiveWeb.page is designed to capture web pages as you browse to them. The number of URLs you noticed are the number of URLs that are referenced in the HTML for the page you are looking at. This includes URLs for embedded images, video, JavaScript, CSS that are needed to render the page. As they are collected the number goes down. If you want to collect another page you click on a link and then ArchiveWeb.page will collect that page, (and all the other URLs that are needed to render it). You keep collecting pages like that and stop when you have collected what you wanted to archive.

ArchiveWeb.page doesn’t collect a “whole website” it is meant for curators to collect very specific regions of the web. You probably will want to look at browsertrix-crawler or the service browsertrix-cloud for automatically crawling a lot of pages. Does this help at all?

kgkissel · March 5, 2022, 2:20pm

Thank you, that is helpful! As I understand it, Browsertrix requires more technical knowledge than I currently have, so I’ll let the group I’m working with know that.