Crawl Scope "Custom Page Prefix" not work as expected

pchan3 · July 23, 2025, 9:44pm

I’m trying to crawl a page (Stanford's Pacific links; summer programs; what we want) and a few links within it (URLs start with Stanford Report…), which are embedded from a different domain. I have 1,221 similar URLs to crawl.

I thought I could use the Crawl Scope “Custom Page Prefix” with:

Crawl Start URL: https://us5.campaign-archive.com/?u=a8e6569da943904e9ac369cde&id=a8083d20bb

URL Prefixes in Scope: https://news.stanford.edu/stories/2017/

However, Browsertrix Cloud v1.17.4 only captured the Crawl Start URL.
Do you have any suggestions on what might be going wrong?

ilya · July 23, 2025, 11:07pm

Hi Peter,
Yes, we actually just deployed an update to 1.17.5, which should fix the ‘^’ appearing. Also just tested it on your account for an existing workflow that you had.
Also, it may be that the scope should be https://news.stanford.edu/2017/ and not https://news.stanford.edu/stories/2017/ looking at some of the links.
Please try again on the latest version and hopefully it works as expected now!

pchan3 · July 24, 2025, 4:25pm

Thanks, Illy, for the quick response. Version 1.17.5 works fine in this case. I have 1,221 similar crawls to work on.

ilya · July 26, 2025, 1:46am

Hm, perhaps there is an easier way to do this - you’re trying to crawl 1221 pages and all links from them that are to https://news.stanford.edu/, or with a different scope per page?

We will be adding support for large seed lists in the next release, hopefully next week.

system · August 7, 2025, 9:45pm

This topic was automatically closed after 15 days. New replies are no longer allowed.