Ability to Retry Errors?

mjclemente · December 1, 2021, 5:37pm

Hi all,

I’m a new user of the Browsertrix Crawler project - and it is incredibly impressive!

I’m wondering if there is a way to retry errors that occur during the crawl? For example, during a recent crawl of a website with 815 pages, I got one error:

Load timeout for https://www.examplewebsite.com/fatigued-driving.html TimeoutError: Navigation timeout of 90000 ms exceeded
    at /app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/LifecycleWatcher.js:106:111
URL Load Failed: https://www.examplewebsite.com/fatigued-driving.html, Reason: Error: Timeout hit: 180000

There’s no issue with the page itself. I’m able to access it; it loads in a reasonable amount of time. For whatever reason though, the crawler’s request timed out. So, that leads me to two questions actually:

Is there a mechanism for retrying pages?
If not, is there a way to update an archive? That is, to recrawl failed pages separately and then add them to the archive?

Any guidance would be appreciated, and thanks for the amazing project!

zefik · June 13, 2022, 4:07pm

I second this!

I have been interested in patching some of Browsertrix Crawler crawls too, and one idea I had so far was to record the URLS I want to re-do with Archiveweb.page, import the original, Browsertrix WACZ I made into Archiveweb.page, and then basically import into the original crawls the URLs I recorded later.

However, I don’t think this will update the CDXJ and pages.json files, which is a bit of a bummer.

So now what I have selected doing, when that is reasonable to happen manually at least, is create a whole new separate patch crawl containing only URLs missed by Browsertrix, and I include this together with the original crawl files.

Would love to hear others experiences or ideas on this!