Follow-up crawl for pages that have not yet been captured?

kramski · April 26, 2024, 8:30am

Hello everyone.

I have a large PHP-based forum here (approx. 50,000 pages) that I would like to crawl with Browsertrix Cloud.

Max Pages cannot be increased above 20,000 pages, and anyway I would prefer to start with smaller, manageable crawls.

How can I crawl the pages in a follow-up run that have not yet been captured ?

TIA
Heinz

Hank · April 27, 2024, 3:19pm

Hi Heinz

Currently this isn’t really possible but I’m happy you highlighted this use case. We have two open issues that you may be interested in following.

[Feature]: Resume Stopped Crawl · Issue #1753 · webrecorder/browsertrix · GitHub
This one would allow you to do exactly what you are asking for here.
[Feature]: Only Archive New URLs · Issue #1372 · webrecorder/browsertrix · GitHub
A similar request, relevant when re-crawling the same site frequently with the desire to only capture new content.

Both of these are high priority crawling features for us to add, though as usual I can’t offer timelines or specifics as to when they’ll be addressed… Also as we haven’t started development on either yet, things are subject to change!

Sorry I don’t have a better answer for you at this time. Hopefully we can have this solved for you in the future