Request for help to overcome blockage by Cloudflare when crawling site

Hi,

I set up a crawl workflow for a site using the Default Crawler Channel and was blocked by Cloudflare upon running the crawl. Would appreciate some advice/help on how to overcome this Cloudflare blockage. Thank you.

My crawl settings is as follows:

Crawl Scope : Single Page
Include Any Linked Page : No
Max Pages : 2000 pages
Crawl Time Limit : Unlimited (default)
Crawl Size Limit : Unlimited (default)
Page Load Timeout : 2 minutes (default)
Delay After Page Load : 1 minute
Behaviour Timeout : 5 minutes (default)
Auto-Scroll Behavior : Enabled (default)
Delay Before Next Page : 1 minute
Browser Windows: 4
Crawler Channel : Default (docker.io/webrecorder/browsertrix-crawler:1.5.1)
Block Ads by Domain : Yes
User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36 orgname.browsertrix
Language : English
Crawl Schedule Type : No Schedule

Cloudflare can be fickle (in the realm of not currently possible) to fully subvert. If there is a captcha (or even if there isn’t) you may consider setting up a browser profile to visit the site with to accept any verification / prove your humanity prior to crawling.

Was this on our instance or do you self host? Can you share the site you are trying to crawl?

Hi

I did not self host, the crawl workflow was deployed on your instance. The site that I was trying to crawl is https://www.sgcarmart.com/directory/index.php. When I first deployed the crawl, I did not have a browser profile set up, and was hit with the cloudflare blockage. Subsequently, I set up a browser profile and even logged in with my personal google account but that did not resolve the blockage issue. I am also unable to visit the sgcarmart site on the brave browser itself.

Looking forward to further advice, if any.

Thank you.

I don’t think there’s much we can do at this time if Cloudflare has blocked our crawler. Proxies can help with this (currently in beta on higher tier plans) but eventually they’ll get blocked as well. Sorry we don’t have a better answer for you.

Okay, anyway, thank you for trying to help.