Best way to later remove individual resources or pages from a crawl?

kramski · April 30, 2024, 10:14am

We once had a situation where we needed to quickly remove a single copyrighted photo from a larger crawl.

What is the best way to do this with Browsertrix Cloud? Is there a (standalone) WACZ editor to modify the archive files afterwards?

Hank · April 30, 2024, 2:58pm

There is currently no way to do this within Browsertrix and it’s a little more involved than one might initially think to add!

Because the archived data inside WACZ files is hashed and signed to verify capture authenticity, removing or modifying any part of the file will invalidate the hash.

ArchiveWeb.page works slightly differently compared to Browsertrix, it stores all of its data in a database, and creates WACZ or WARC files on export. If you load up a WACZ file in ArchiveWeb.page, pages can be removed from an archived item in ArchiveWeb.page by clicking the X button in the list of pages. This will also remove any resources only associated with that page.

Doing the above will destroy any signing data associated with the page.

Alternately, WACZ files are actually just ZIP files and contain WARCs alongside some extra metadata (like the signing info, but more importantly an index of the content to enable fast serverless replay). However you removed the specific resources before, if you were using WARC files, you can do so again here by unzipping the WACZ, extracting the WARCs, deleting the records in question, and re-creating a WACZ file out of your WARCs with py-wacz. Doing this will also destroy the signing data.

Would be interested in how you went about this previously with the image you mention!

kramski · May 2, 2024, 1:33pm

This was long ago, I think we were using HTTrack back then and simply deleted the file.

kramski · May 2, 2024, 1:38pm

Good advice, thank you! Do we need the standalone app for that? The browser extension (v0.11.3) throws an “Unexpected Loading Error: TypeError: this.store.loadWACZFiles is not a function” on loading a .wacz file made with Browsertrix Cloud (beta crawler).

Hank · May 2, 2024, 3:01pm

Nope, the browser extension should work for this. Can’t replicate, seems to work fine with files created from our instance, tested with 2 WACZ files.

Dragging and dropping might be broken? Be sure to use the “Import Archive” button on the home screen.

kramski · May 2, 2024, 3:40pm

Did that. Same error for me on Edge (Windows) and Chromium (Linux) with two different .wacz files. Console shows

Uncaught (in promise) TypeError: Cannot destructure property 'url' of 'object null' as it is null.
    at Aa.login (ui.js:3745:1142)
    at new wr (ui.js:3745:14084)
 Service worker is registered
 attempt to disable CSP to ensure replay works
w/api/c/id-c22434363e3f?all=1:1 
        
        
       Failed to load resource: the server responded with a status of 404 ()
sw.js:121 TypeError: this.store.loadWACZFiles is not a function
    at Rb.loadMultiWACZPackage (sw.js:121:36532)
    at Rb.loadPackage (sw.js:121:36360)
    at async Rb.load (sw.js:121:35309)
    at async Nb.load (sw.js:121:38482)
    at async Qb.addCollection (sw.js:121:62479)
    at async Qb._handleMessage (sw.js:121:57877)
addCollection @ sw.js:121
 attempt to disable CSP to ensure replay works
w/api/c/id-c22434363e3f?all=1:1 
        
        
       Failed to load resource: the server responded with a status of 404 ()
 attempt to disable CSP to ensure replay works
w/api/c/id-c22434363e3f?all=1:1 
        
        
       Failed to load resource: the server responded with a status of 404 ()
 attempt to disable CSP to ensure replay works
w/api/c/id-c22434363e3f?all=1:1 
        
        
       Failed to load resource: the server responded with a status of 404 ()
 attempt to disable CSP to ensure replay works

kramski · May 2, 2024, 3:56pm

The .exe app also fails with the identical error message.

kramski · May 2, 2024, 4:00pm

Could you perhaps try to import a minimal example which also fails here: DLA-Cloud?

TIA
Heinz

edsu · May 2, 2024, 4:03pm

That WACZ seems to load just fine for me into ReplayWebPage (MacOS/Firefox). If you are getting a repeatable error on your platform I think that’s definitely worth ticketing at https://github.com/webrecorder/replay-web-page/issues/

Update: sorry I missed that you were in ArchiveWebPage. I actually didn’t realize the replay mechanism greatly diverged between the two, but I guess importing data could have different code paths. So this has been an interesting discussion!

Hank · May 2, 2024, 4:16pm

@edsu This is an ArchiveWeb.page issue (but also thanks for testing! <3 )

@kramski Thanks for the example file! Found the issue, this WACZ uses the multi-wacz spec which ArchiveWeb.page isn’t yet set up to open. If you unzip the WACZ and open the one that is located inside of it then it will work. Complicated! This will likely be a problem for all WACZ files downloaded from a collection, but shouldn’t be an issue for the WACZ files downloaded from the files tab of an archived item.

Behind the scenes, this is actually how collections work! They load multiple WACZ files using the datapackage.json file that you’ll find inside that WACZ file. It looks like this:

{"profile": "multi-wacz-package", "resources": [{"name": "01e6c292-95fb-4a85-b3f1-73dfa16132a7/20240408083603494-6de5935a-ea7-0.wacz", "path": "01e6c292-95fb-4a85-b3f1-73dfa16132a7/20240408083603494-6de5935a-ea7-0.wacz", "hash": "5bbc74deb49717fcef99127e7bed27167ff8b3269064c1e9cab5a03c56b30bca", "crc32": 1722297469, "size": 1929734, "crawlId": "manual-20240408083534-6de5935a-ea7", "numReplicas": 1, "expireAt": "2024-05-04T15:48:44"}]}

In this file the 01e6c292-95fb-4a85-b3f1-73dfa16132a7/20240408083603494-6de5935a-ea7-0.wacz is the one we actually want to load into ArchiveWeb.page. Ideally ArchiveWeb.page would just handle this by reading datapackage.json and importing all the nested WACZs inside the top level WACZ. Looks like it hasn’t been updated to do that yet!

I have filed an issue for this here: [Feature]: Support importing multi-wacz WACZ files · Issue #210 · webrecorder/archiveweb.page · GitHub

kramski · May 2, 2024, 4:19pm

Thanks! So you have been able to import it with “Import Archive” into ArchiveWeb.page? Strange, as my Windows PC at work and my Linux box at home are very different…

kramski · May 2, 2024, 4:21pm

Never mind, I should have read your earlier comment.

edsu · May 7, 2024, 4:04pm

I’m curious, given

R1: an HTML page resource in the archive
R2: an image resource in the archive, that is referenced via an <img> tag from R1

If I remove R1 from the archive, will it also remove R2?

I guess the motivation here is whether this approach could work when trying to remove PII from a page. If the PII itself was both the HTML and the image you would need to remove both R1 and R2?

Hank · May 7, 2024, 6:10pm

AFAIK yes, though I haven’t looked at the code or inspected the exported WARCs and the only way I’ve tested this is looking at the size of the archive. I have no idea how this is handled if the image is used on multiple pages.

Also you can download individual pages in the pages list!

How ArchiveWeb.page handles deletion isn’t really documented and in the future I’ll be converting that Jekyll site into an MKDocs site to match the rest of them. Would be good to add this after I do that if you end up investigating