A short summary:
- The main reason for fewer pages in 2025 compared to 2024 was due to the relatively hidden setting for the maximum age of threads in some subforums. So this has no technical cause.
- Regardless of this, many pages are missing in the 2025-04-03 crawl compared to the URL queue according to the log files. Many of these can be explained (as Ilya pointed out) by the fact that they redirect to URLs that are excluded. The cause of others remains unclear.
- In a later crawl from 2025-05-02, many pages are again missing in comparison with the URL queue. These are all dead links, excludes that were inserted interactively later or URLs that redirect to pages that are excluded.
- There is therefore no technical problem with Browsertrix.
- Irrespective of this, I still think that Browsertrix should point out such a discrepancy between the queue and the page list in the logs at the end of a crawl so that users can investigate.
P.S.
The limit of 100 URLs in the “List of Pages” scope can be easily circumvented if you provide a page on your own web server that contains all (more than 100) URLs as a link list and which you then crawl as a “Single Page” with “One Hop Out”. This also looks better in the list of seed URLs.