As a crawler operator I want to be respectful of crawling limits and the operators of the websites I’m trying to crawl. I found a new way (to me) of limiting which is pages per day. What is the best way to achieve this?
Would it be setting the page delay to ~ 150s to roughly limit it just under 600 pages per 24 hour rolling period?
Right now I think you’ve come up with a good workaround! That will cause the crawl to take a long time but perhaps not an issue for you as you run your own infrastructure?