Follow-up crawl for pages that have not yet been captured?

Hello everyone.

I have a large PHP-based forum here (approx. 50,000 pages) that I would like to crawl with Browsertrix Cloud.

Max Pages cannot be increased above 20,000 pages, and anyway I would prefer to start with smaller, manageable crawls.

How can I crawl the pages in a follow-up run that have not yet been captured ?

TIA
Heinz

Hi Heinz

Currently this isn’t really possible but I’m happy you highlighted this use case. We have two open issues that you may be interested in following.

  1. [Feature]: Resume Stopped Crawl · Issue #1753 · webrecorder/browsertrix · GitHub
    This one would allow you to do exactly what you are asking for here.

  2. [Feature]: Only Archive New URLs · Issue #1372 · webrecorder/browsertrix · GitHub
    A similar request, relevant when re-crawling the same site frequently with the desire to only capture new content.

Both of these are high priority crawling features for us to add, though as usual I can’t offer timelines or specifics as to when they’ll be addressed… Also as we haven’t started development on either yet, things are subject to change!

Sorry I don’t have a better answer for you at this time. Hopefully we can have this solved for you in the future :slight_smile: