Problems archiving Google Sites

I am using Browsertrix v1.13.2 to archive a Google Site Storytelling With AI

Somehow, the images in some pages are not captured. Any help will be very much appreciated.

Took a look at this page in my browser. When I opened the network tab of the developer tools and force reloaded the page without caching (CMD + Shift + R) I found that it took about ~40 seconds to fully finish loading the page with the images taking especially long to load (near the end of that ~40 seconds).

You could try increasing the Page Load Timeout value to force the crawler to wait on the page for longer before capturing content. Browsertrix does try to intelligently detect if there are still assets that need to be loaded, but this is a way of brute forcing it if it doesn’t. Give it a shot and watch the crawl for the first bunch of pages to make sure it’s working properly? It will increase the amount of time a crawl takes, so good to confirm that is the issue before spending a bunch of execution time on the site!

Thanks, Hank. I increased the Page Load Timeout from 120 seconds to 1200 seconds and get the same result.

Tried capturing the page with ArchiveWeb.page as well, same issue. Not a timeout problem!

This seems like a replay issue to me, which means that your archives may very well have been captured correctly!

To test this yourself, try looking for images in the resources tab of ReplayWeb.page within Browsertrix. I was able to find the original images as seen on the site, meaning that they’re in the archive.

Looking in the developer tools, they appear to be loaded within a bunch of <iframe> elements… Not exactly the simplest way of making a grid of images :upside_down_face:

Because the data exists in the archive, I’ve filed a replay bug that you may wish to follow on GitHub. Can’t promise a quick resolution for this, but it’s on our radar!

Thank you so much, Hank, for your invaluable help. I truly appreciate your effort.

1 Like