URL-list crawl failing for anchor links

jmulliken · July 8, 2024, 4:43pm

I’m having trouble getting a URL-list crawl to work with a short list of anchor-style URLS. Here are a couple examples:

The page in full where these anchors appear is crawled and replays fine, but in replay, if a user clicks on a top-of-page link to any of the lower sections like those listed above, they get a “Sorry, this page was not found in this archive.”

Should crawling of these kinds of URLs work? Or is there something that needs to be configured on the replay side to make them work?

Any help or suggestions are appreciated. Thanks!

jmulliken · July 8, 2024, 4:45pm

Clarification: I’m using the Browsertrix Cloud in-app replay right now. I haven’t yet tried loading the WACZ into my hosted page where replay.web is embedded. But either way, the crawl of these URLs is failing.

Hank · July 9, 2024, 4:31pm

What you seem to be describing here is the URL List working as expected. URL List Workflows will only download the pages specified in the URL List. Optionally, they can visit every link on those pages if Include Any Linked Page is turned on.

It sounds like you might want to try a seeded crawl for https://countryofwords.supdigital.org/ instead? I think the default settings should capture what you seem to be after?

As bonus info: For most sites, ID links like the examples here don’t need to be specifically captured as discrete pages because they usually just jump to content that is already present in the webpage and doesn’t need to be loaded from an external source. Sometimes that also isn’t the case, with “single page applications” that do load new content and surface that as the same URL but with a # link. For these cases, we offer the Hashtag Links Only scope option

jmulliken · July 9, 2024, 7:19pm

Thank you, Hank. My first pass at archiving this site was actually to do the seeded crawl of https://countryofwords.supdigital.org/. It didn’t capture the # links on the homepage (which, you’re absolutely right, are just sections of content on the page). This was not consistent with other pages in the project that also contain anchor links. But whereas the anchor links on those other pages are linked to from a list in the body of the page, the ones on this homepage are linked top from a graphic header that’s been giving me some other problems. It may very well be acting as a single page application (but only on this one page?). In any case, I tried the Hashtag Links Only option (thank you for pointing that out!), but I’m still getting the same results.

Since the issue is limited to this one page, and a user could still access the content by simply scrolling own the page, I’m inclined to just put a note on the HTML page I’m embedding the player into, informing users to just scroll that page rather than access the content via the links.

I appreciate the help, though, and the suggestions. I had completely overlooked the hashtags option in the seeded crawl!

Hank · July 10, 2024, 1:50am

Curious… Either way, if you’re finding different levels of success with multiple different scopes, if you haven’t already, try adding them all to a collection! It will effectively merge them and allow each crawl to access data from the others, patching the parts that don’t work!

If you’re unable to capture the content within Browsertrix, you could also try manually capturing the interactions on the page with the ArchiveWeb.page extension, uploading that, and adding it to the same collection to patch the crawl.

Hank · October 4, 2024, 4:35pm

It appears that I should have had a deeper look at your site before making suggestions…

I have some thoughts, but will send to the team first. It is looking more like a replay issue to me. Do you know if there is a specific reason your timeline links are prefaced with /timeline/?? This may the the cause of some of the behavior.

jmulliken · October 4, 2024, 6:03pm

Thanks for the further investigation! The /timeline/ situation is due to that being the page (the Timeline page) the anchors are on. For instance the other pages–Visualizations, Network, and Audio Interviews–are /visualisations/ [sic], /network/, and /audio-interviews/, respectively. I did crawl the full anchor links (e.g. countryofwords dot supdigital dot org/timeline/#literary-diasporas-post-nakba-scattering), but you’re right that on replay, everything after the /timeline/ disappears. This is not the case, however on the individual pages like countryofwords dot supdigital dot org/periods/literary-diasporas-the-mahjar, where anchor links like countryofwords dot supdigital dot org/periods/literary-diasporas-the-mahjar/#heading-332b5ff37526 are indeed working and showing correctly.

On the public facing page where I’ve embedded the replay, I’ve simply made a note that these links do not work and the user should simply scroll down the page to access each section. Since the issue is so minor for this particular archive and only limited to this one page I ended up going with this (non)solution

Hank · October 4, 2024, 6:47pm

This is not the case, however on the individual pages […] where anchor links […] are indeed working and showing correctly.

I would hazard a guess that this is because they don’t include the page prefix. I’ve never seen in-text links include that on other sites and while browsers seem to deal with it properly ReplayWeb.page is having trouble with re-writing those links correctly.

Any chance you can remove the page prefix for them on the site? Prefixing ID links with a path should only be required if the path navigates to a different page.

If it helps, here’s the exact same SVG code the site has for the timeline but with the /timeline/ prefixes removed. Nothing should change for users, but it will hopefully archive correctly!

<svg width="1676" height="176"><g transform="translate(16, 22)"><g transform="translate(0, 0)"><a href="#literary-diasporas-the-mahjar" aria-label="Literary Diasporas: The Mahjar"><rect x="1.536150144362423" y="0" height="20" width="823.7041895678257" rx="2" ry="2" class="rect s-vIPuLVaSiKkQ"></rect></a><text x="1.536150144362423" y="10" dx="4" dy="1" dominant-baseline="middle" class="rectText font-sans-serif fs-7 s-vIPuLVaSiKkQ">Literary Diasporas: The Mahjar</text></g><g transform="translate(0, 22)"><a href="#literature-under-british-occupation" aria-label="Literature under British Occupation"><rect x="235.85137784031534" y="0" height="20" width="563.9576576584399" rx="2" ry="2" class="rect s-vIPuLVaSiKkQ"></rect></a><text x="235.85137784031534" y="10" dx="4" dy="1" dominant-baseline="middle" class="rectText font-sans-serif fs-7 s-vIPuLVaSiKkQ">Literature under British Occupation</text></g><g transform="translate(0, 44)"><a href="#literature-under-triple-occupation-post-nakba" aria-label="Literature under Triple Occupation Post-Nakba"><rect x="802.7347607622474" y="0" height="20" width="219.87950634091965" rx="2" ry="2" class="rect s-vIPuLVaSiKkQ active"></rect></a><text x="802.7347607622474" y="10" dx="4" dy="1" dominant-baseline="middle" class="rectText font-sans-serif fs-7 s-vIPuLVaSiKkQ active">Literature under Triple Occupation Post-Nakba</text></g><g transform="translate(0, 66)"><a href="#literary-diasporas-post-nakba-scattering" aria-label="Literary Diasporas: Post-Nakba Scattering"><rect x="807.6538373041631" y="0" height="20" width="218.04690919785298" rx="2" ry="2" class="rect s-vIPuLVaSiKkQ"></rect></a><text x="807.6538373041631" y="10" dx="4" dy="1" dominant-baseline="middle" class="rectText font-sans-serif fs-7 s-vIPuLVaSiKkQ">Literary Diasporas: Post-Nakba Scattering</text></g><g transform="translate(0, 88)"><a href="#literary-diasporas-a-golden-age-in-exile" aria-label="Literary Diasporas: A Golden Age in Exile "><rect x="1026.633120487085" y="0" height="20" width="173.2929579145425" rx="2" ry="2" class="rect s-vIPuLVaSiKkQ"></rect></a><text x="1026.633120487085" y="10" dx="4" dy="1" dominant-baseline="middle" class="rectText font-sans-serif fs-7 s-vIPuLVaSiKkQ">Literary Diasporas: A Golden Age in Exile </text></g><g transform="translate(0, 110)"><a href="#literature-under-israeli-occupation" aria-label="Literature under Israeli Occupation"><rect x="1032.323816878713" y="0" height="20" width="311.89517341910505" rx="2" ry="2" class="rect s-vIPuLVaSiKkQ"></rect></a><text x="1032.323816878713" y="10" dx="4" dy="1" dominant-baseline="middle" class="rectText font-sans-serif fs-7 s-vIPuLVaSiKkQ">Literature under Israeli Occupation</text></g><g transform="translate(0, 132)"><a href="#literary-diasporas-post-beirut-fragmentation" aria-label="Literary Diasporas: Post-Beirut Fragmentation"><rect x="1208.5103492296762" y="0" height="20" width="131.17537445108246" rx="2" ry="2" class="rect s-vIPuLVaSiKkQ"></rect></a><text x="1208.5103492296762" y="10" dx="4" dy="1" dominant-baseline="middle" class="rectText font-sans-serif fs-7 s-vIPuLVaSiKkQ">Literary Diasporas: Post-Beirut Fragmentation</text></g></g><g class="axis s-vIPuLVaSiKkQ" transform="translate(16,0)" fill="none" font-size="10" font-family="sans-serif" text-anchor="middle"><path class="domain" stroke="currentColor" d="M0.5,0.5H1644.5"></path><g class="tick" opacity="1" transform="translate(0.5,0)"><line stroke="currentColor" y2="0"></line><text fill="currentColor" y="3" dy="0.71em">1880</text></g><g class="tick" opacity="1" transform="translate(117.94697129161854,0)"><line stroke="currentColor" y2="0"></line><text fill="currentColor" y="3" dy="0.71em">1890</text></g><g class="tick" opacity="1" transform="translate(235.36140029007686,0)"><line stroke="currentColor" y2="0"></line><text fill="currentColor" y="3" dy="0.71em">1900</text></g><g class="tick" opacity="1" transform="translate(352.7762207546241,0)"><line stroke="currentColor" y2="0"></line><text fill="currentColor" y="3" dy="0.71em">1910</text></g><g class="tick" opacity="1" transform="translate(470.1910412191713,0)"><line stroke="currentColor" y2="0"></line><text fill="currentColor" y="3" dy="0.71em">1920</text></g><g class="tick" opacity="1" transform="translate(587.6380125107898,0)"><line stroke="currentColor" y2="0"></line><text fill="currentColor" y="3" dy="0.71em">1930</text></g><g class="tick" opacity="1" transform="translate(705.052832975337,0)"><line stroke="currentColor" y2="0"></line><text fill="currentColor" y="3" dy="0.71em">1940</text></g><g class="tick" opacity="1" transform="translate(822.4998042669556,0)"><line stroke="currentColor" y2="0"></line><text fill="currentColor" y="3" dy="0.71em">1950</text></g><g class="tick" opacity="1" transform="translate(939.9146247315028,0)"><line stroke="currentColor" y2="0"></line><text fill="currentColor" y="3" dy="0.71em">1960</text></g><g class="tick" opacity="1" transform="translate(1057.3615960231214,0)"><line stroke="currentColor" y2="0"></line><text fill="currentColor" y="3" dy="0.71em">1970</text></g><g class="tick" opacity="1" transform="translate(1174.7764164876685,0)"><line stroke="currentColor" y2="0"></line><text fill="currentColor" y="3" dy="0.71em">1980</text></g><g class="tick" opacity="1" transform="translate(1292.223387779287,0)"><line stroke="currentColor" y2="0"></line><text fill="currentColor" y="3" dy="0.71em">1990</text></g><g class="tick" opacity="1" transform="translate(1409.6382082438342,0)"><line stroke="currentColor" y2="0"></line><text fill="currentColor" y="3" dy="0.71em">2000</text></g><g class="tick" opacity="1" transform="translate(1527.0851795354529,0)"><line stroke="currentColor" y2="0"></line><text fill="currentColor" y="3" dy="0.71em">2010</text></g><g class="tick" opacity="1" transform="translate(1644.5,0)"><line stroke="currentColor" y2="0"></line><text fill="currentColor" y="3" dy="0.71em">2020</text></g></g></svg>

jmulliken · October 4, 2024, 10:24pm

Well it’s all a bit complicated. I’m not the site developer, so I can only relay what I can observe in the live hosted site files (which I do maintain). It appears the timeline is being generated by JavaScript (“svelte” if that makes sense). So the svg is not actually in that index page code. Instead there’s simply a div that’s pulling in further JS files, including the svelte. I was able to locate the file where the links are being generated, and I was able to change them (removing the “/timeline” so it was just “/#”. I confirmed they were relinked, but after that change the functionality of the timeline was no longer the same and the new anchor links did not actually resolve to the section headings as expected. Since I’m not the developer, I don’t feel comfortable digging too much further into the source files to try and track down all the various pieces that are coming together to create this functionality. I think the statement on the replay page will have to do for this particular archive. It definitely seems like a pretty fringe case to try and dedicate too much time to on you y’all’s end. Nevertheless, I do appreciate the digging! It’s helped me to get a better understanding of the site itself and certain red flags for archivability in the future. So thank you!

Hank · October 4, 2024, 10:53pm

…removing the “/timeline” so it was just “/#”

It should just start with the #. Typically in-page ID links look something like this:

<a href="#in-page-link">visible link text<a/>

I don’t feel comfortable digging too much further into the source files to try and track down all the various pieces that are coming together to create this functionality

Absolutely fair enough! I was hoping this would be an easy asset swap but that may not be the case. Generally it’s best practice not to edit the generated files from Svelte so that’s the right move.

We’ll see what can be done on our end at some point. I’ll try to file issues on Monday.