Crawl Scope "Custom Page Prefix" not work as expected

I’m trying to crawl a page (Stanford's Pacific links; summer programs; what we want) and a few links within it (URLs start with Stanford Report…), which are embedded from a different domain. I have 1,221 similar URLs to crawl.

I thought I could use the Crawl Scope “Custom Page Prefix” with:

Crawl Start URL: https://us5.campaign-archive.com/?u=a8e6569da943904e9ac369cde&id=a8083d20bb

URL Prefixes in Scope: https://news.stanford.edu/stories/2017/

However, Browsertrix Cloud v1.17.4 only captured the Crawl Start URL.
Do you have any suggestions on what might be going wrong?

Hi Peter,
Yes, we actually just deployed an update to 1.17.5, which should fix the ‘^’ appearing. Also just tested it on your account for an existing workflow that you had.
Also, it may be that the scope should be https://news.stanford.edu/2017/ and not https://news.stanford.edu/stories/2017/ looking at some of the links.
Please try again on the latest version and hopefully it works as expected now!

Thanks, Illy, for the quick response. Version 1.17.5 works fine in this case. I have 1,221 similar crawls to work on.

Hm, perhaps there is an easier way to do this - you’re trying to crawl 1221 pages and all links from them that are to https://news.stanford.edu/, or with a different scope per page?

We will be adding support for large seed lists in the next release, hopefully next week.

This topic was automatically closed after 15 days. New replies are no longer allowed.