I have relatively recently tested Browsertrix for crawling some quite tailored jobs. In this case, I am producing data for a corpora of text from different health websites, and the easiness of configuring these in yaml with designated scopes have been compelling.
Some of the jobs are taking long time for completion, and I want to try to run them “in parallel” with
--workers 3. I have a powerful computer with CPU that should be able to handle it, but I have a very naive question: Do I run it as a
docker command? Will the rest of the command be the same as if I was using the “docker run” command?
When run in docker (one container only), this is my desired command:
docker run -v $PWD/crawl-config-jobName.yaml:/app/crawl-config-jobName.yaml -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --config /app/crawl-config-jobName.yaml --user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.49 Safari/537.36\r\n botname/0.1(+url)"