Using Browsertrix Crawler with docker-compose

jont · August 2, 2023, 11:59am

Hi,

I have relatively recently tested Browsertrix for crawling some quite tailored jobs. In this case, I am producing data for a corpora of text from different health websites, and the easiness of configuring these in yaml with designated scopes have been compelling.

Some of the jobs are taking long time for completion, and I want to try to run them “in parallel” with --workers 3. I have a powerful computer with CPU that should be able to handle it, but I have a very naive question: Do I run it as a docker-compose or docker command? Will the rest of the command be the same as if I was using the “docker run” command?

When run in docker (one container only), this is my desired command:

docker run -v $PWD/crawl-config-jobName.yaml:/app/crawl-config-jobName.yaml -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --config /app/crawl-config-jobName.yaml --user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.49 Safari/537.36\r\n botname/0.1(+url)"

jont · August 9, 2023, 12:05pm

I solved this, understanding that I could simply run several workers in one container. --workers 2 reduced crawling time by almost 50%, and ran very well on my stationary computer with 64GB RAM that was also running other applications and processing.

ilya · August 18, 2023, 1:38am

Yes, I believe that you can just pass the same command-line options to either, the README provides examples of using Docker Compose