403 Forbidden on web accessible content

CannObserv · December 9, 2024, 10:01pm

Hello!
New Browsertrix user here although I’ve read through the documentation pretty thoroughly, and I really appreciate the work the team has accomplished over the years.

I’m attempting to archive a Washington state government website [ https://lcb.wa.gov/ ] which is publicly accessible via the web. I’ve attempted to configure a crawl of the site from the hosted Browsertrix app but it appears the crawler is blocked as it halts almost immediately and replays a 403 Forbidden screen.

I don’t know how frequently this occurs and couldn’t find any troubleshooting suggestions.

I’ve tried setting a commonly seen User Agent string.
The robots.txt file doesn’t appear to block crawlers in general nor the site landing screen.
There don’t appear to be any server redirects that might confuse the crawler.
There is no sitemap.xml file to leverage.

I’d appreciate any pointers. There’s a bit of a time crunch as I’m trying to archive the entire site before they transition to a new web framework on Tuesday December 17th.

Thank you!
gf

Hank · December 9, 2024, 10:48pm

I’m attempting to archive a Washington state government website [ https://lcb.wa.gov/ ] which is publicly accessible via the web.

Is it publicaly accessible? I’m also getting the same 403 error trying to visit the site on my PC. The site seems to be geo-blocked outside of America. I’m in Canada and our crawling servers are in Amsterdam. When I visit the URL with a VPN from America it works fine.

Also, a few notes for the future!

Because all crawling is initiated by a human, Browsertrix does not currently adhere to robots.txt.
Browsertrix’s default user agent string is that of the Brave Browser instances it uses. Sometimes it can be helpful to work with site owners to allowlist a certain custom user agent string, but we’ve yet to see sites blocking Brave specifically.
Server redirects can cause issues with crawling, but usually not like this. If a server does have redirects to a different domain that you want to capture, it may be useful to use the Custom Page Prefix scope type to add those additional domains.

CannObserv · December 10, 2024, 2:39pm

Thank you, this is extremely helpful - and potentially problematic for the state. I can’t see any reason why this resource should be blocked for requests arriving from outside the US.

I’ll see if I can get a proxy setup from within the US for now, thank you for your help!
gf

ilya · December 11, 2024, 8:43am

Given that this is time sensitive, we’ll try to setup a proxy in US as well and let you know if we’re able to do that.