Restarting the browser or clearing cookies every N page loads or after N timeouts

I’m currently crawling a large site (https://www.wilsoncenter.org, ~70k pages found so far when I’m ~17k pages in) and have noticed that it starts timing out after ~12k page loads or so. At first I expected my IP might be getting blocked, but subsequent cURL requests and restarts of the crawl all work fine, so I’m guessing there is a cookie or something similar being used to block a client after a certain number of requests/day or something along those lines.

In trying to think of ways to circumvent this (other than adding a large pageExtraDelay value), I’m wondering if there are ways to:

  • Add some delay every N pages or N minutes instead of after every page load. (It’s possible taking a break for a bit might reset things? Not sure since I am not 100% clear on the blocking mechanism.)
  • Delete cookies/storage/etc. every N pages
  • Relaunch the browser every N pages (kind of like there’s a hardcoded max # of reuses of a browser page/tab)

I suspect the third is not something that there’s currently a way to do, but I’m wondering if there is a straightforward way to do 1 or 2 using a behavior (I haven’t really dived into custom behaviors yet, but it seems like they are designed to run in the browser which might make some of these ideas more challenging). Would love any pointers here if others have found solutions for similar problems.

This is actually almost supported in the full Browsertrix app that runs in k8s. There, you can configure crawler to restart after N minutes or X bytes - I suppose could add something for pages, but would be good to figure out what the issue is exactly. The restart also clears the cookies.

More generally, have been trying to consolidation rate limit detection / retry mitigation in this issue: Retry Improvements + Rate Limit Support · Issue #758 · webrecorder/browsertrix-crawler · GitHub

Perhaps the crawler could be paused and restarted when a rate limit / some error condition is reached. However, we’ll probably implement this with k8s support, as we have a system for managing crawler container lifecycle there already.

Have you tried running full Browsertrix in a k8s environment, perhaps with microk8s aor k3d?

Ah, is this basically deleting the pod after a certain amount of time? (via a CronJob or sidecar, I guess?)

I am running through Docker and have wrapper tooling that restarts on failure/interrupt, but am currently interrupting by hand right now by checking the job logs every few hours. I suppose if the idea is just “send a SIGINT to the container every N minutes or after I see certain logs repeated N times” that’s pretty doable. I think it would be nice if there were something more integrated, but I’ll head over to that GH issue for further discussion on that. :slight_smile:

TBH I really want to avoid that. My experience over several years with k8s is that it can work fairly well for a team that can afford a dedicated devops/SRE role (or at least someone who spends a significant amount of their time focused on that), but is pretty much pure overhead for any smaller teams. I’m actively trying to get other pieces of this project’s infra off k8s.

At any rate, I would love any pointers on whether this is what you mean for k8s or if you’re doing something else. It doesn’t seem like there are really any docs about Kubernetes usage at all. I’m also curious, if this is just causing a job to be restarted/new pod created, how it picks up the latest statefile to use as the crawl config (does this require a separate Redis instance that persists and that it winds up resuming from instead?).