Error 500 (internal server error) while browsing in replay

oliver · November 28, 2024, 2:54pm

Hi
I am doing my first test crawls and am running into a problem:

I display / review the crawl in replay
I click on sub-pages in the menu to access another page
and get an error “500 internal server error”
There is a display of “technical details”, reads like this

Error: Error in Route Query at Pa (http://localhost:5471/w/id-5b2107bb837a/:37a8eec1ce19687d132fe29051dca629d164e2c4958ba141d5f4133a33f0688f/20241128123521esm_/https://www.bs.ch/_nuxt/CMh-WLqc.js:22:7297) at _i (http://localhost:5471/w/id-5b2107bb837a/:37a8eec1ce19687d132fe29051dca629d164e2c4958ba141d5f4133a33f0688f/20241128123521esm_/https://www.bs.ch/_nuxt/CMh-WLqc.js:22:9809) at X (http://localhost:5471/w/id-5b2107bb837a/:37a8eec1ce19687d132fe29051dca629d164e2c4958ba141d5f4133a33f0688f/20241128123521esm_/https://www.bs.ch/_nuxt/CJ3gAP8Z.js:2:3875) at async setup (http://localhost:5471/w/id-5b2107bb837a/:37a8eec1ce19687d132fe29051dca629d164e2c4958ba141d5f4133a33f0688f/20241128123521esm_/https://www.bs.ch/_nuxt/BlLpqCQd.js:3:1484)

This happens in both the integrated replay app in browsertrix or the local app after WACZ-download.
The pages itself are there and archived correctly, I can access them through the URL menu of replay.

Any ideas?
Thanks for help, Oliver

ilya · November 29, 2024, 4:46pm

Unfortunately, the issue is that this site navigation is not crawler friendly, as it behaves differently when a link is clicked in the page vs when a page is loaded directly via a URL.

There is no 500 error on the server, it is actually a 404, but the page frontend says its a 500 error. Looks like it performs a different graphql query on every link click - to speed up loading, which the crawler does not do, as it follows the links.

Here’s an example of the 404 (routeNodePage?..) but is displayed as a 500:

This is something that could be fixable with a custom behavior, which we are adding support for soon, but requires some custom work because of how the site is built. When you see this, if you refresh the page, it will load because the crawler loads the page directly.
We’ll consider if there’s any other way to address this,

oliver · December 2, 2024, 8:13am

Hi Ilya
Thanks for that quick response. There’s the workaround of re-loading, but still, for a public webarchive, usability is very limited. You mentioned custom behaviours as a solution: Do you have a timeframe when a fix for this could be tested, and used?

I will test the page with our heritrix instance but I guess it won’t behave any better. So for the moment, this site (main site of our state administration, brand-new) will not be archived. We can surely wait a few months.

Thanks for keeping me up to date.
Oliver

ilya · December 3, 2024, 11:11pm

We are actually testing an version that will be able to click links, and using your site as one of the test cases, so far looks promising! Hope to make it available as ‘Beta’ crawler channel at some point soon.

ilya · December 11, 2024, 8:44am

Hi Oliver,

We deployed a test version of the crawler, available under the ‘Dev’ channel, and ran a quick test crawl on your account to check the links, it seems like it’s promising. Will reach out more over email.

oliver · May 14, 2025, 7:35am

Hi
Few months later, I took up the work and crawled. Still with the dev channel and the same workflow settings, but it doesn’t seem to work. I get the same error message as before:

Is there anything that has changed. Has functionalities been ported to main?

ilya · May 15, 2025, 3:08am

Is there anything that has changed. Has functionalities been ported to main?

Yes, this has been released to the main crawler, and is available as the optional Autoclick behavior - clicking on all links isn’t appropriate for all pages, so we had to make it an option.

I’ve started a new crawl with ‘Autoclick’ enable in the ‘Page Behavior’ section.

oliver · May 15, 2025, 8:16am

All working fine. Sorry I didn’t follow up on the issue since December, thanks for your quick answer and tests.