Saving time and space repeatedly archiving large responses

I mainly use Browsertrix Crawler to repeatedly archive the same sites and URLs every few days. There are a few files on those sites that are pretty big and rarely change (extreme case: this 457 MB video at NOAA.gov), so one thing I’m thinking about is how I might be able to save time and overhead in both crawling and storing them.

Has anyone else here dealt with this sort of thing and come up with useful tools or techniques?

Things I’ve been thinking of:

  • When it comes to re-requesting these while crawling, I’m imagining some way of seeding the crawler or the browser cache with ETags and dates for various URLs — I could easily gather the etag and last-modified headers for responses that were larger than 10 MB after a crawl and store those along with a digest or a pointer to the raw bytes in the crawl’s WARC so that next time, the crawler could request the data from a local location or even record a revisit record. Or maybe this could be integrated somehow with the Redis server that the crawler uses (I currently use the one embedded in the the official Docker image, but it would be worth setting up an long-lived external one if it could help with this kind of thing).
  • Another option there that I think could work today (but haven’t tried) might be writing a proxy that does the above (although it wouldn’t be able to record the revisit record; it would just have to feed the old saved bytes to the crawler instead of passing the request through to the live server).
  • Short of writing revisit records, are there alternative storage formats people have used that address the storage duplication? It seems easy to imagine breaking up a WARC into its records and payloads and putting them in some sort of hash-addressed storage, then re-assembling them on demand. I’ve seen this Webrecorder design document about using IPFS to do the job, but the linked implementations all seem to been untouched for a few years (Webrecorder staff: do you use this or something similar in production anywhere?).

Anyway, I’m just curious if anybody here has explored these or other approaches to similar problems and has experience to share before I dive in too deep!

Hi @mr0grog,

We’ve actually been working on deduplication in Browsertrix and Browsertrix Crawler as our main focus for the last few months, and expect to begin testing it with clients soon. I believe the way that we’ve implemented it will work well for your use case. We are using collections as flexible deduplication sources, and writing revisit records for page resources whose contents have already been captured in previous crawls, as determined by comparing the resource’s hash against known hashes in a deduplication index.

Stay tuned!

’ve seen this Webrecorder design document about using IPFS to do the job, but the linked implementations all seem to been untouched for a few years (Webrecorder staff: do you use this or something similar in production anywhere?).

Forgot to answer this part. We are not using this or anything like it in production, that was more of a research project. If anyone does go further down that road we would be very interested to hear and learn from their experiences.

@tessa-webrecorder Awesome! I saw the beta release with a brief note about this a day or two after I posted, but wasn’t sure what the approach/goals were since there aren’t any docs in that change. Looking forward to trying it out when it’s more widely available. :smiley:

Also, good to know about the IPFS spec. Thanks!