I’m trying to archive relatively large sites (approx. 30k), and when running the crawl, it usually crawls around 7000 pages, but then suddenly marks all the remaining URLs as failed without even trying to load them, as the verbose output goes by a lot faster, and the screencast output freezes. For context, I’m running the crawl on a 13th gen i7 and 32 gigs of DDR5 RAM.
In terms of storage, I only have around 200 gigs for the crawl files, but when the crawl begins to fail, the hard disk would still have 60GB available, with the vhdx taking up around 110. I’ve exhausted most of the usual options, and I know that the crawl does run with less storage free, so is there some sort of cache that slowly takes up my memory and stops the crawl, or is there a systemwide or WSL limit on vhdx storage? I’ve looked in the log file but there isn’t a clear error message that pops up before this issue starts, and if I don’t SIGINT it, it will complete the crawl, but with the statistics being something like this: (crawled: 7498, pending: 0, failed: 21674, total: 30546).
If needed, here is the docker run command:
docker run -p 9037:9037 -v D:/archives:/crawls --tmpfs /home.browser/.config -it webrecorder/browsertrix-crawler crawl --url https://gouletpens.com/ --generateWACZ --collection goulet --workers 6 --screencastPort 9037 --saveState --rm --behaviors autoplay,autofetch,siteSpecific --timeout 900 --waitUntil networkidle0
TIA!