Explanation of /ops/crawls/sync + very long exclude lists

naren · March 13, 2025, 7:27pm

So, I’m trying to support very long exclude lists in browsertrix. Right now, if your exclusions list is too long, the resulting config file will be too big for k8s (annotations can only be 256k) and the workflow will fail to start.

My first instinct here was to not insert excludes on creation, and instead to put them into redis. But the redis container doesn’t start fast enough for my initial direction of modifying _load_crawl_configmap.

Can I get an explanation of how the sync_crawls function works? Maybe there’s a simple design doc or just an explanation of what it does somewhere? And perhaps some direction on how to support large config files - might it make more sense to just pass the file through directly with k8s hostPath?

ilya · March 14, 2025, 4:17am

We are considering supporting URL seed lists that would just be files loaded from the S3 bucket, instead of configmap. I suppose it would be possible to do that for exclusion lists as well, but we haven’t run into this issue…

Can you say more about your use case? Curious why you have such a large exclude list - is it something that can’t be handled with regexes to match multiple URLs? I believe the configmap size limit is actually 1MiB - that seems fairly large for exclusions. Or, when we support seed lists of unlimited size, that could be alternative perhaps?

naren · March 17, 2025, 6:14pm

so i wanted to back up existing webcomics that are being updated. I wrote a script that inspects all visited URLs in a wacz file and then inserts them into the url blacklist, but I’m coming to the conclusion that it might make more sense to use btrix just for the actual backing up and not for any sort of site analysis. I will probably do what you suggested and write a script that does a very simple/lazy crawl with just bs4/requests and then sends the url list to btrix, and maintain a list of visited URLs separately.