Several identical pages at different times in one collection

If I crawl a URL, say the homepage of a news portal, at different times with Browsertrix Cloud and merge the crawls into a collection, ReplayWeb.Page shows me the seed URL several times with different timestamps (good).

But regardless of which entry point I choose, I always see the oldest content version of the page.

A simple example would be https://replayweb.page/?source=https://app.browsertrix.com/api/orgs/01e6c292-95fb-4a85-b3f1-73dfa16132a7/collections/25600540-762c-44fd-a8f0-c7818fdb5946/public/replay.json#view=pages:

All three identical seed URLs, including the one from 7/9/2024, lead to the page with the oldest content of 4/3/2024 which says: “The next meeting will take place on Thursday, March 28, 2024”, while the content of the crawl from 7/9/2024 should be “The next meeting will take place on Friday, June 14, 2024”.

In https://forum.webrecorder.net/t/follow-up-crawl-for-pages-that-have-not-yet-been-captured/574 I learned that there is no “update mode” for crawls (unfortunately!).

I’m all the more surprised that the oldest, not the most recent version is displayed.

Or is there a way to give ReplayWeb.Page a temporal context in which one wants to navigate within a collection? That would of course be a good thing for regular crawls of pages whose development over time you want to document.

At the moment, my understanding is that I have to create a separate collection each time for referenceable temporal versions of a web page.

TIA
Heinz

Hmm, a few things are going on here!

  1. Currently, ReplayWeb.page’s index views show multiple pages, but in my opinion this isn’t the most helpful for browsing archives with lots of different captures at different times because it creates a lot of duplication. We are moving towards displaying a single entry per URL with temporal information in the list… But that’s not quite implemented yet. See Update resource (URLs) browser · Issue #241 · webrecorder/replayweb.page · GitHub for some of the details there.

  2. The way things are currently implemented, this feels like a bug! Can reproduce, no matter which entry in this list I click, I am always directed to the oldest entry as you say. You should be able to switch between these timestamps in the URL bar but that doesn’t seem to be available here… Something is up!

At the moment, my understanding is that I have to create a separate collection each time for referenceable temporal versions of a web page.

No, this isn’t the way it should work… Ideally they should all be referencable within the collection. As for why that is, will have to investigate further.

Or is there a way to give ReplayWeb.Page a temporal context in which one wants to navigate within a collection? That would of course be a good thing for regular crawls of pages whose development over time you want to document.

This is a good observation. You mention being surprised that it didn’t navigate to the most recent page. If you are looking at an archive and you click a link that has been archived many times, would you expect to go to the closest timestamp based on the previous page, the most recent, or the oldest?

From an archive perspective, it is messy to mix different crawl time periods in the replay (as I believe the Wayback Machine also does).

Navigation from link to link should take place in a user-definable time frame.

The default could be to start from the timestamp of the first link of the index view or the previous link. However, a certain, adjustable tolerance would be appropriate, as complex crawls can extend over a period of time, even if they actually belong together logically in terms of time.

This tolerance range could include everything by default, so that users who simply want to see all webspages are satisfied.

Archive users who want to examine and compare defined time slices could set the tolerance as they need it.

What is the preferred way of reporting issues like that? A post here? An email to support@webrecorder.net? Filing a Github issue?

From an archive perspective, it is messy to mix different crawl time periods in the replay (as I believe the Wayback Machine also does).

This can indeed result in some issues, but also has other benefits - namely the ability to patch crawls and restore broken links. If you don’t want to mix content from different time periods, we would suggest not adding them to a collection and instead creating multiple different collections and linking to them individually through an external source.

Your suggestion of a temporal tolerance range in ReplayWeb.page is an interesting one. I think it could be a good addition, but likely won’t be a priority for us for quite some time. Feel free to file a GitHub issue in the ReplayWeb.page repo?


For the current issue, absolutely fire off an email to the support address - the forum isn’t an official support channel for customers with dedicated support plans (though we’re happy to answer questions here if we’re able)! A GitHub issue can optionally be created, but things like this for our customers only actually get tracked once there’s an email associated with them. We’ll also end up making a GitHub issue if you don’t and in scenarios like this one that might be preferable as this could be a ReplayWeb.page issue or a Browsertrix collections issue… Hard for me to say which one right now! If you’d like to be added or tagged over in GitHub we can do that too :slight_smile:

Yes, we did actually have replay separate out this way, but this prevented the ability to patch missing content with additional content (from a new crawl, or an uploaded WACZ file).

An adjustable tolerance is interesting, but I think will be quite tricky because that means users have to decide what is acceptable.

However, this particular issue is definitely a bug of some sort, the distinct crawls should be loading within one collection, so we’ll take a look.