Are shared (public) URLs pointing to replayweb.page indexed by Google or other search engines?

kramski · April 30, 2024, 9:57am

If we share/publish collections created with Browsertrix Cloud at ReplayWeb.page…, can the public URLs be indexed by Google or other search engines?

For some web resources we have the rights to archive and publish them, but the owners don’t want the archive version to overshadow the original source in Google results.

A parameter that could be used to optionally prohibit or allow crawling by other search bots would be ideal.

Hank · April 30, 2024, 3:24pm

As with all things SEO, it’s hard to say exactly… But the likely answer is that it won’t overshadow the original site.

Some articles claim that Google doesn’t index iframes, but also the docs links they point to are broken. Others say they likely do because they actually use a headless browser to interpret the content when crawling.

Your most surefire bet is probably to set up robots.txt and block the page containing the link with a noindex tag? Here’s a link to Google’s docs for removing pages.

This topic actually came up at the recent IIPC web archiving conference and I don’t think anyone was able to point to any instances of it happening? I’ve personally never run into it in search with the variety of web archives that exist out there today… There’s other factors such as update frequency and crosslinks that are used by Google when deciding how to score a page — I highly doubt more websites will link to your archive than the original site.

TL;DR you’re probably good. They’ll tell you if that’s ever not the case and I wouldn’t put too much effort into worrying about it, it’s actually somewhat difficult to rocket up to the first spot on Google!

kramski · May 2, 2024, 1:46pm

Thank you for the detailed answer.

But wouldn’t a simple https://replayweb.page/robots.txt be enough to safely lock out all (benign) bots for all archived items?

Hank · May 2, 2024, 3:09pm

Admittedly I’m not that familiar with setting up robots.txt — here at Webrecorder we’re pretty crawler friendly heh — but we can’t just block anything past / because our docs exist at /docs and we want those crawled. Because this has never been a problem in practice, we haven’t done this, but feel free to submit a PR or suggest a robots.txt config if it ever becomes one? Again, I’m not entirely convinced Google can even interpret content in our replay engine.