Ability to Retry Errors?

Hi all,

I’m a new user of the Browsertrix Crawler project - and it is incredibly impressive!

I’m wondering if there is a way to retry errors that occur during the crawl? For example, during a recent crawl of a website with 815 pages, I got one error:

Load timeout for https://www.examplewebsite.com/fatigued-driving.html TimeoutError: Navigation timeout of 90000 ms exceeded
    at /app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/LifecycleWatcher.js:106:111
URL Load Failed: https://www.examplewebsite.com/fatigued-driving.html, Reason: Error: Timeout hit: 180000

There’s no issue with the page itself. I’m able to access it; it loads in a reasonable amount of time. For whatever reason though, the crawler’s request timed out. So, that leads me to two questions actually:

  1. Is there a mechanism for retrying pages?
  2. If not, is there a way to update an archive? That is, to recrawl failed pages separately and then add them to the archive?

Any guidance would be appreciated, and thanks for the amazing project!

2 Likes

I second this!

I have been interested in patching some of Browsertrix Crawler crawls too, and one idea I had so far was to record the URLS I want to re-do with Archiveweb.page, import the original, Browsertrix WACZ I made into Archiveweb.page, and then basically import into the original crawls the URLs I recorded later.

However, I don’t think this will update the CDXJ and pages.json files, which is a bit of a bummer.

So now what I have selected doing, when that is reasonable to happen manually at least, is create a whole new separate patch crawl containing only URLs missed by Browsertrix, and I include this together with the original crawl files.

Would love to hear others experiences or ideas on this!