I’m trying to archive a site that uses a select list and a “go” button as part of the navigation. When the form is submitted a POST request happens and the end result is a 302 redirect to the URL of item you’ve selected.
Is this something that can be achieved via browsertrix behaviors?
I tried patching my initial crawl with a manual crawl from the browser plugin. I tried two items in the dropdown as a test. Although the form doesn’t 404, it only sends me to the last item I tried.
Thanks in advance,
Heya Brian! It seems like archivewebpage’s support for POST canonicalization should result in the response getting saved and indexed for playback? Did you do the first crawl using ArchiveWebPage? Is it possible to share the URL for the page in question?
URL in question: Provincial Archives of New Brunswick
It’s the “View newspaper” drop-down giving me issues.
The initial crawl was with browsertrix (docker). I only tried patching the crawl with the archiveweb.page chrome plugins when i noticed the drop-down wasn’t working.
It looks like there is some funky ASP form stuff going on where the form initially posts to
and then gets 302 redirected to a URL like:
I noticed that the first POST has over 50KB of POST data. Maybe this is hitting some limit internal in ArchiveWeb page related to how much it will allow POST data to be rewritten into a GET URL? If it is getting truncated in some way that could be causing the matching to fail. @ilya is there any truncation like this going on when mapping POSTs to GETs during recording? Alternatively, maybe fuzzy matching is picking off the wrong entry in the index? I guess some analysis of the CDX file might be needed, here’s my WACZ if it’s helpful…
Sadly it looks like the correct response is there if you enter a URL directly, for example:
I wonder, is there any chance the app can be modified not to do this form handling prior to the archive being created? Also, this seems worth of a ticket in the ArchiveWebPage repo.
Thanks for giving it a shot Ed. I’ll submit an issue to the repo (Update: GH Issue).
I hadn’t noticed that initially, but you’re right, the __VIEWSTATE POST arg is a massive blob.
I will also enquire upstream and see if that dropdown can be replaced with something more straight-forward. I assume a simple JS hook would do the trick?
Since the destination URL is actually already archived via a different path, I wouldn’t even have to bother trying to automate clicking on each of those items in the list either.
FYI, they’ve updated the form upstream to a simple JS redirect and it works perfectly now.
That is a beautiful thing! It still would be good to fix that issue, so thanks for documenting it.
I took a look at the time and that was indeed the issue - one very long POST query argument can result in all others being truncated. Adding truncation to individual POST query arguments in latest ArchiveWeb.page 0.11.3 to avoid this issue in the future, as we’re more likely to be able to create a more accurate match with the smaller arguments (imo).