Web archives for data journalism- animation recording issue

jwalt · May 7, 2024, 7:48pm

Hi! I work for a museum working to collect and present works of interactive data journalism. We have used archiveweb.page to successfully record some of the sites we are collecting (with permission and cooperation of the publisher) but we are having issues with animation files in one article in particular.

I’m not sure whether simply trying a different recording tool can solve this issue, since it seems like the animations render just fine while recording, but they aren’t really being captured in the resulting warc/wacz file. The specific article of interest is How the Virus Got Out - The New York Times; I have tried recording the site with archiveweb.page and Conifer, both of which I believed were the tools currently best able to capture complex animated content.

Any suggestions on alternate ways to capture the site would be appreciated, along with a clearer description of what is going wrong in a more technical sense. Is this a problem with this site in particular or a more generalizable issue? In the meantime we’ll present the work through video documentation, but the goal is to replay a warc/wacz to enable a more interactive (self-paced) viewing experience, as in the current website experience. Thank you for any help explaining or troubleshooting the issue.

Hank · May 7, 2024, 9:22pm

Hey there! I don’t have a NY Times account, so I can’t replicate the issues for myself, but I can offer a few thoughts:

The problem you seem to be encountering — “was a piece of content captured correctly or are there replay bugs?” — is one of the harder problems to solve in web archiving because the content might be captured fine! Generally it’s not something we’re equipped to solve quickly either. One of the better ways for average folks (outside of Browsertrix where we have some QA features that can help get a better idea of what’s going wrong) to nail this down is trying other tools to archive the page and replaying the files they create in ReaplayWeb.page so good that you tried Connifer as well.

If you find that specific resources are missing on replay, checking the resources tab can be a good way of nailing down what was captured and what isn’t in the archive.

Ghostarchive — a 3rd party site that we are not affiliated with — does a decent job of getting around paywalls, crawls to WARC files, and also uses ReplayWeb.page for playback! I have no idea what they use for crawling, but it seems to have captured this content correctly and it is replaying in their embedded ReplayWeb.page viewer. Click here to check out their archive of this page. This could indicate a possible capture issue with ArchiveWeb.page.

You can also download their WARC file for this page if you’d like.

Once again, we can take absolutely no responsibility for the validity of files created with Ghostarchive, I have no idea who runs it or how their capture process works. Spooky!