5,476 pages missing from a large crawl

kramski · April 23, 2025, 9:33am

Last year, we crawled a large PHP forum as a PoC and retreived 67,135 pages.

A recent crawl (docker.io/webrecorder/browsertrix-crawler:1.5.9) with identical settings and exclusions retrieved only 62,328 pages, 4,807 less than in 2024, which is quite unlikely because the forum should have grown, not shrunk drastically.

By comparing */pages/*.jsonl and */logs/*.log, we were able to identify 5,476 URLs that were “Queued [as] new page url” but were actually missing from the crawl.

The missing URLs look unremarkable like many others that have been successfully crawled and are interactively accessible (although protected by a login, so examples here make no sense).

Why doesn’t Browsertrix itself compare the list of queued URLs with the list of URLs actually crawled at the end of a crawl and issue an appropriate warning if there are discrepancies?
Is it possible to feed 5,476 URLs into a “List of Pages” scope of a follow-up crawl without polluting the replay with a long random page list?

Here is what we did to find the missing URLs (may be usefull to others as well):

Download a crawl (an archived item) to a working directory with lots of free space. (In our case: manual-20250402070107-74e57fc0-17f.wacz, 16 GB.)

Set a variable for later use:

$ export MyID=20250402070107-74e57fc0-17f

Unzip the downloaded .wacz-file:

$ time unzip -d unzipped-$MyID manual-20250402070107-74e57fc0-17f.wacz

We have a multi-wacz .wacz-file. Unzip the inner .wacz-files:

$ time for f in unzipped-$MyID/*wacz; do unzip -d "${f%%.*}" "$f"; done

List crawled pages. We are interested in unzipped-$MyID/*/pages/extraPages.jsonl (non-seed pages) and unzipped-$MyID/*/pages/pages.jsonl:

$ echo -e "id\turl\ttitle\tloadState\tts\tmime\tstatus\tseed\tdepth\tfilename" > crawledpages-$MyID.tsv
$ time cat unzipped-$MyID/*/pages/*.jsonl | jq -r '[.id, .url, .title, .loadState, .ts, .mime, .status, .seed, .depth, .filename] | @tsv' | egrep -v "^pages" | sort -t$'\t' -k2 >> crawledpages-$MyID.tsv
$ wc -l crawledpages-$MyID.tsv 
62329 crawledpages-20250402070107-74e57fc0-17f.tsv

List queued pages:

$ echo -e "timestamp\tlogLevel\tcontext\tmessage\turl" > queuedpages-$MyID.tsv
$ time cat unzipped-$MyID/*/logs/*.log | jq -r 'select(.message | test("Queued new page url")) | [.timestamp, .logLevel, .context, .message, .details.url] | @tsv' | sort -t$'\t' -k5 >> queuedpages-$MyID.tsv
$ wc -l queuedpages-$MyID.tsv
67805 queuedpages-20250402070107-74e57fc0-17f.tsv

List missing URLs:

$ cut -d$'\t' -f2 crawledpages-$MyID.tsv | sort > crawledurls-$MyID.txt
$ cut -d$'\t' -f5 queuedpages-$MyID.tsv | sort > queuedurls-$MyID.txt
$ comm -13 crawledurls-$MyID.txt queuedurls-$MyID.txt > missingurls-$MyID.txt
$ wc -l missingurls-$MyID.txt
5476 missingurls-20250402070107-74e57fc0-17f.txt

Regards,
Heinz

ilya · April 24, 2025, 8:12pm

We appreciate you taking a closer look at this issue - and the steps here can be helpful to others looking to analyze their crawl.

There is usually a reason why pages are not included - they don’t just disappear of the crawl queue - the logs usually contain the answer about what happened to each URL. For example, looking at the logs for this crawl I see quite a few warnings like this:

{"timestamp": "2025-04-03T05:55:03.369Z", "logLevel": "warn", "context": "recorder", "message": "Skipping page that redirects to excluded URL", "details": {"newUrl": "https
://www.hoeflichepaparazzi.de/forum/showthread.php?13308-Das-kleine-Glck/page477&p=474419#post474419", "origUrl": "https://www.hoeflichepaparazzi.de/forum/showthread.php?p=4
74419"}}
{"timestamp": "2025-04-03T05:55:03.370Z", "logLevel": "warn", "context": "recorder", "message": "Request failed", "details": {"url": "https://www.hoeflichepaparazzi.de/foru
m/showthread.php?p=474419", "errorText": "net::ERR_BLOCKED_BY_RESPONSE", "type": "Document", "status": 301, "page": "https://www.hoeflichepaparazzi.de/forum/showthread.php?
p=474419", "workerid": 1}}
{"timestamp": "2025-04-03T05:55:03.374Z", "logLevel": "warn", "context": "general", "message": "Page Load Blocked, skipping", "details": {"msg": "net::ERR_BLOCKED_BY_RESPON
SE at https://www.hoeflichepaparazzi.de/forum/showthread.php?p=474419", "loadState": 0}}
{"timestamp": "2025-04-03T05:55:03.374Z", "logLevel": "warn", "context": "pageStatus", "message": "Page Load Failed, no response: will retry", "details": {"retry": 0, "retr
ies": 2, "url": "https://www.hoeflichepaparazzi.de/forum/showthread.php?p=474419", "page": "https://www.hoeflichepaparazzi.de/forum/showthread.php?p=474419", "workerid": 1}
}

What this means is that certain pages redirect to other pages that are actually excluded – in this case, the URL https://www.hoeflichepaparazzi.de/forum/showthread.php?p=474419 redirects to https ://www.hoeflichepaparazzi.de/forum/showthread.php?13308-Das-kleine-Glck/page477&p=474419#post474419. However, the crawl has an exclusion for URLs that contain #post so this page is not captured.

This issue was raised here: Ensure exclusions apply to pages that redirect · Issue #744 · webrecorder/browsertrix-crawler · GitHub and implemented soon after. Of course, it can still be confusing if a page that you don’t expect redirects to a page that is excluded - open to exploring other ideas, but in general it seems to make sense, since you don’t want the excluded page crawled, it should not be crawled from a direct either. The page is still added to the queue because the crawler doesn’t know it’ll redirect until it is actually loaded in the browser.

ilya · April 24, 2025, 8:57pm

Yes, this can be done with a list of pages, though for UI reasons, we have it limited to 100 at a time, but plan to expand that limit. E-mail us and we can help sort this out, though need to make sure the exclusions are not set, if you want to do that.

ilya · April 24, 2025, 9:04pm

It also appears that the crawl was stopped due to time limit being reached, as it was set at 20 hour initially (though went over). This can be seen by the presence of:

{"timestamp": "2025-04-03T06:17:36.383Z", "logLevel": "info", "context": "general", "message": "Crawler is stopped", "details": {}}

in the logs. We can probably improve this to include a reason here as well, currently we don’t report the time limit and instead just list as ‘Complete’ as it is not treated as an error, but perhaps should be reported in the logs. Will take that into consideration.

kramski · April 29, 2025, 9:19am

Thank you for your time and thoughtful analysis. Redirection to an excluded address is something I wasn’t expecting. I’ll have to look into whether this applies to all cases.

kramski · April 29, 2025, 9:24am

I will try to set up a help page that contains a list of all URLs that may still be missing as links.

With scope “single page” and the “one hop out” setting, this should give the desired result.

kramski · April 29, 2025, 9:27am

And yes, with a crawl time of 23h and the normal status “Complete” I didn’t even notice that the crawl ran into a time limit.

Thanks again for the analysis.

kramski · April 29, 2025, 11:56am

The main reason for the missing pages compared to last year may be something else:

For most forums, the maximum age for posts to be displayed is set to “from the beginning”. However, some (less popular?) forums set this to 365, 100 or even 30 days. As time goes on, fewer and fewer posts are displayed in the forums without activity.

Here we have probably crawled incompletely.

Fortunately, the maximum age of the posts can also be set in the user settings, which overrides the forum-specific setting. In any case, we have to crawl again with these settings!

kramski · May 13, 2025, 9:14am

A short summary:

The main reason for fewer pages in 2025 compared to 2024 was due to the relatively hidden setting for the maximum age of threads in some subforums. So this has no technical cause.
Regardless of this, many pages are missing in the 2025-04-03 crawl compared to the URL queue according to the log files. Many of these can be explained (as Ilya pointed out) by the fact that they redirect to URLs that are excluded. The cause of others remains unclear.
In a later crawl from 2025-05-02, many pages are again missing in comparison with the URL queue. These are all dead links, excludes that were inserted interactively later or URLs that redirect to pages that are excluded.
There is therefore no technical problem with Browsertrix.
Irrespective of this, I still think that Browsertrix should point out such a discrepancy between the queue and the page list in the logs at the end of a crawl so that users can investigate.

P.S.

The limit of 100 URLs in the “List of Pages” scope can be easily circumvented if you provide a page on your own web server that contains all (more than 100) URLs as a link list and which you then crawl as a “Single Page” with “One Hop Out”. This also looks better in the list of seed URLs.