URL-list crawl failing for anchor links

I’m having trouble getting a URL-list crawl to work with a short list of anchor-style URLS. Here are a couple examples:

The page in full where these anchors appear is crawled and replays fine, but in replay, if a user clicks on a top-of-page link to any of the lower sections like those listed above, they get a “Sorry, this page was not found in this archive.”

Should crawling of these kinds of URLs work? Or is there something that needs to be configured on the replay side to make them work?

Any help or suggestions are appreciated. Thanks!

Clarification: I’m using the Browsertrix Cloud in-app replay right now. I haven’t yet tried loading the WACZ into my hosted page where replay.web is embedded. But either way, the crawl of these URLs is failing.

What you seem to be describing here is the URL List working as expected. URL List Workflows will only download the pages specified in the URL List. Optionally, they can visit every link on those pages if Include Any Linked Page is turned on.

It sounds like you might want to try a seeded crawl for https://countryofwords.supdigital.org/ instead? I think the default settings should capture what you seem to be after?


As bonus info: For most sites, ID links like the examples here don’t need to be specifically captured as discrete pages because they usually just jump to content that is already present in the webpage and doesn’t need to be loaded from an external source. Sometimes that also isn’t the case, with “single page applications” that do load new content and surface that as the same URL but with a # link. For these cases, we offer the Hashtag Links Only scope option

Thank you, Hank. My first pass at archiving this site was actually to do the seeded crawl of https://countryofwords.supdigital.org/. It didn’t capture the # links on the homepage (which, you’re absolutely right, are just sections of content on the page). This was not consistent with other pages in the project that also contain anchor links. But whereas the anchor links on those other pages are linked to from a list in the body of the page, the ones on this homepage are linked top from a graphic header that’s been giving me some other problems. It may very well be acting as a single page application (but only on this one page?). In any case, I tried the Hashtag Links Only option (thank you for pointing that out!), but I’m still getting the same results.

Since the issue is limited to this one page, and a user could still access the content by simply scrolling own the page, I’m inclined to just put a note on the HTML page I’m embedding the player into, informing users to just scroll that page rather than access the content via the links.

I appreciate the help, though, and the suggestions. I had completely overlooked the hashtags option in the seeded crawl!

1 Like

Curious… Either way, if you’re finding different levels of success with multiple different scopes, if you haven’t already, try adding them all to a collection! It will effectively merge them and allow each crawl to access data from the others, patching the parts that don’t work!

If you’re unable to capture the content within Browsertrix, you could also try manually capturing the interactions on the page with the ArchiveWeb.page extension, uploading that, and adding it to the same collection to patch the crawl.