Last year, we crawled a large PHP forum as a PoC and retreived 67,135 pages.
A recent crawl (docker.io/webrecorder/browsertrix-crawler:1.5.9) with identical settings and exclusions retrieved only 62,328 pages, 4,807 less than in 2024, which is quite unlikely because the forum should have grown, not shrunk drastically.
By comparing */pages/*.jsonl
and */logs/*.log
, we were able to identify 5,476 URLs that were “Queued [as] new page url” but were actually missing from the crawl.
The missing URLs look unremarkable like many others that have been successfully crawled and are interactively accessible (although protected by a login, so examples here make no sense).
-
Why doesn’t Browsertrix itself compare the list of queued URLs with the list of URLs actually crawled at the end of a crawl and issue an appropriate warning if there are discrepancies?
-
Is it possible to feed 5,476 URLs into a “List of Pages” scope of a follow-up crawl without polluting the replay with a long random page list?
Here is what we did to find the missing URLs (may be usefull to others as well):
Download a crawl (an archived item) to a working directory with lots of free space. (In our case: manual-20250402070107-74e57fc0-17f.wacz
, 16 GB.)
Set a variable for later use:
$ export MyID=20250402070107-74e57fc0-17f
Unzip the downloaded .wacz
-file:
$ time unzip -d unzipped-$MyID manual-20250402070107-74e57fc0-17f.wacz
We have a multi-wacz .wacz
-file. Unzip the inner .wacz
-files:
$ time for f in unzipped-$MyID/*wacz; do unzip -d "${f%%.*}" "$f"; done
List crawled pages. We are interested in unzipped-$MyID/*/pages/extraPages.jsonl
(non-seed pages) and unzipped-$MyID/*/pages/pages.jsonl
:
$ echo -e "id\turl\ttitle\tloadState\tts\tmime\tstatus\tseed\tdepth\tfilename" > crawledpages-$MyID.tsv
$ time cat unzipped-$MyID/*/pages/*.jsonl | jq -r '[.id, .url, .title, .loadState, .ts, .mime, .status, .seed, .depth, .filename] | @tsv' | egrep -v "^pages" | sort -t$'\t' -k2 >> crawledpages-$MyID.tsv
$ wc -l crawledpages-$MyID.tsv
62329 crawledpages-20250402070107-74e57fc0-17f.tsv
List queued pages:
$ echo -e "timestamp\tlogLevel\tcontext\tmessage\turl" > queuedpages-$MyID.tsv
$ time cat unzipped-$MyID/*/logs/*.log | jq -r 'select(.message | test("Queued new page url")) | [.timestamp, .logLevel, .context, .message, .details.url] | @tsv' | sort -t$'\t' -k5 >> queuedpages-$MyID.tsv
$ wc -l queuedpages-$MyID.tsv
67805 queuedpages-20250402070107-74e57fc0-17f.tsv
List missing URLs:
$ cut -d$'\t' -f2 crawledpages-$MyID.tsv | sort > crawledurls-$MyID.txt
$ cut -d$'\t' -f5 queuedpages-$MyID.tsv | sort > queuedurls-$MyID.txt
$ comm -13 crawledurls-$MyID.txt queuedurls-$MyID.txt > missingurls-$MyID.txt
$ wc -l missingurls-$MyID.txt
5476 missingurls-20250402070107-74e57fc0-17f.txt
Regards,
Heinz