Impact of accessing wacz on a website

DHK · January 28, 2023, 2:51pm

I’ve been investigating saving a website in a wacz file for archival preservation. I’ve successfully tested my wacz build and have set up a test version on a website where it seems to work flawlessly. I’m now trying to find sites to host this file and have been asked about the impact of viewers accessing it on the website operation. Does accessing a wacz file put any more or less strain on a site than a normal user? Thanks.

Hank · January 29, 2023, 5:21am

When replaying a WACZ file you are only viewing the archived copy of the site. The actual website that has been archived won’t recieve any traffic from users viewing the WACZ file.

As for the effect on the server hosting WACZ files, I’ll refer you to the Processing Model section of the spec. Even when individual WACZ files are very large, the whole file does not need to be downloaded by users to browse the archived contents. The list of pages will be downloaded and using this list ReplayWebPage will request and download the appropriate other parts of the WACZ file from the web server hosting the WACZ file on the fly as they are requested by the user.

DHK · January 29, 2023, 11:47am

@Hank Thanks, would you say there is a negligible difference between a normal user of a website and a user accessing a wacz file? A potential host is concerned about possible hits to their server.

Hank · January 30, 2023, 2:12am

I think my above comment answers this mostly? I don’t think the amount of data transferred from a WACZ file vs a live site would be different enough to be concerned about. Overall if I had to give a ballpark vague response I would say that yes, the difference should be negligable. That said, I don’t know what kind of data you’re planning on archiving or what your hosting or bandwidth fees look like so it’s difficult to give a comprehensive answer. Those things are also independant of if you are viewing the data through a web archive are not though, probably a moot point.

Try it out and run some experiemnts? See what happens! If viewing the files through ReplayWebPage is resulting in much larger data transfers than you’d expect, we’d like to know!

edsu · January 31, 2023, 10:27am

Can you say more about where the potential host is planning on publishing the WACZ files? As Hank said, ReplayWebPage works with large WACZ files because HTTP Range requests are used to retrieve portions of the WACZ file on demand. This means transferring the complete WACZ isn’t necessary for accessing the web archive! Depending on the archived content (how many resources did the archived page require for rendering) and how much the archive is viewed (is it part or a news story that just went viral?), this could result in an increased number of HTTP requests to wherever the WACZ is stored, which may have some pricing costs associated with it?

ReplayWebPage does also make the URL for the WACZ file discoverable when viewing metadata, which could lead to some full downloads by users who want the complete dataset. Depending on where the WACZ is hosted that might be preventable, and perhaps there could be some way of disabling or hiding the WACZ download in ReplayWebPage. But that would be kind of sad, and probably should be avoided until it becomes a known problem? Doing the research ahead of time to figure out how it might be done could be beneficial depending on how risk averse your client is.

If it’s helpful we could set up a call to discuss: info@webrecorder.net

DHK · January 31, 2023, 12:41pm

@edsu Thanks for the reply. I’m still waiting to hear back from the potential host’s IT department, but will be back here if I do need more info. This is not a high volume site… In all of January as of a day or so ago we only had 255 users according to Google Analytics… we’re a memorial/history site for a radio station that was sold in 1974!