Hi all, firstly I love everything webrecorder is doing
I’ve been crawling a site which includes links which look like href=“//domain/”, i.e. the protocol (http: or https:) is omitted, allowing the browser to select whichever protocol is in use on that page already. I believe this is compliant with modern web standards although not the norm.
browsertrix-crawler rejects these URLs (see below for my theory as to why) but I think it should crawl them.
I wanted to ask: is there some commandline option or behaviour or other way to get around this e.g. to force a base protocol or enable resolving of // protocols (other than manually seeding the problem URLs)?
Or should I report this as a bug/feature request?
My thoughts on the cause of this: browsertrix-crawler parses each URL with the new URL() constructor and checks that it has a http or https protocol before crawling, which makes complete sense. However, for “//domain” URLs, the new URL() constructor requires a base URL passing as a second argument - otherwise it returns a TypeError. This results in the crawler not seeing a http(s) protocol, and thus rejecting the URL.
My thoughts on a possible solution to this: in browsertrix-crawler/src/util /seeds.ts
, in the function parseUrl
, we have parsedUrl = new URL(url.trim());
but I wonder if this could include as a second parameter the URL of the current page? (e.g. when scopedSeed is initialised in src/crawler.ts , maybe page.url() could be passed to it too or something?)
My system: I’m using Browsertrix-Crawler 1.6.1 (with warcio.js 2.4.4)
via docker image in Ubuntu 24.04.2 LTS via WSL on Win11.