I mainly use Browsertrix Crawler to repeatedly archive the same sites and URLs every few days. There are a few files on those sites that are pretty big and rarely change (extreme case: this 457 MB video at NOAA.gov), so one thing I’m thinking about is how I might be able to save time and overhead in both crawling and storing them.
Has anyone else here dealt with this sort of thing and come up with useful tools or techniques?
Things I’ve been thinking of:
- When it comes to re-requesting these while crawling, I’m imagining some way of seeding the crawler or the browser cache with ETags and dates for various URLs — I could easily gather the
etagandlast-modifiedheaders for responses that were larger than 10 MB after a crawl and store those along with a digest or a pointer to the raw bytes in the crawl’s WARC so that next time, the crawler could request the data from a local location or even record a revisit record. Or maybe this could be integrated somehow with the Redis server that the crawler uses (I currently use the one embedded in the the official Docker image, but it would be worth setting up an long-lived external one if it could help with this kind of thing). - Another option there that I think could work today (but haven’t tried) might be writing a proxy that does the above (although it wouldn’t be able to record the revisit record; it would just have to feed the old saved bytes to the crawler instead of passing the request through to the live server).
- Short of writing revisit records, are there alternative storage formats people have used that address the storage duplication? It seems easy to imagine breaking up a WARC into its records and payloads and putting them in some sort of hash-addressed storage, then re-assembling them on demand. I’ve seen this Webrecorder design document about using IPFS to do the job, but the linked implementations all seem to been untouched for a few years (Webrecorder staff: do you use this or something similar in production anywhere?).
Anyway, I’m just curious if anybody here has explored these or other approaches to similar problems and has experience to share before I dive in too deep!