Capturing protocol-neutral links (e.g. "//site.com")?

crizzo · May 16, 2025, 9:03am

Hi all, firstly I love everything webrecorder is doing

I’ve been crawling a site which includes links which look like href=“//domain/”, i.e. the protocol (http: or https:) is omitted, allowing the browser to select whichever protocol is in use on that page already. I believe this is compliant with modern web standards although not the norm.

browsertrix-crawler rejects these URLs (see below for my theory as to why) but I think it should crawl them.

I wanted to ask: is there some commandline option or behaviour or other way to get around this e.g. to force a base protocol or enable resolving of // protocols (other than manually seeding the problem URLs)?

Or should I report this as a bug/feature request?

My thoughts on the cause of this: browsertrix-crawler parses each URL with the new URL() constructor and checks that it has a http or https protocol before crawling, which makes complete sense. However, for “//domain” URLs, the new URL() constructor requires a base URL passing as a second argument - otherwise it returns a TypeError. This results in the crawler not seeing a http(s) protocol, and thus rejecting the URL.

My thoughts on a possible solution to this: in browsertrix-crawler/src/util /seeds.ts, in the function parseUrl, we have parsedUrl = new URL(url.trim()); but I wonder if this could include as a second parameter the URL of the current page? (e.g. when scopedSeed is initialised in src/crawler.ts , maybe page.url() could be passed to it too or something?)

My system: I’m using Browsertrix-Crawler 1.6.1 (with warcio.js 2.4.4) via docker image in Ubuntu 24.04.2 LTS via WSL on Win11.

ilya · May 16, 2025, 8:24pm

The crawler should already be getting the fully resolved URL, so if its <a href=“//domain/”> it should be passed to crawler as either http://domain/ or https://domain/.

If its a seed, then of course crawler doesn’t know which scheme to use, so seed URLs should be absolute URLs.

Perhaps URL is being rejected for other reasons, or do you have an example where this is happening?

crizzo · May 19, 2025, 8:28am

Thanks Ilya, appreciate your help. I’ve just noticed that the first URL it did not capture actually redirects outside of the prefix scope, so I am thinking that might be the cause… I’ll investigate and will let you know the outcome, but it feels like I was wrong about it being protocol-related

crizzo · May 20, 2025, 3:04pm

Yeah on closer inspection it looks like I was wrong, and was overlooking the fact these // links were redirecting off-domain. Sorry for the false alarm!