Adding fields to collinfo.json

This question is about a modification I’m planning for Common Crawl’s pywb instance… and I’m just asking for advice.

We make each time-based crawl a separate collection in pywb, and by now we have nearly 100 of them. To make things easier for cdx clients to fetch by date, I’d like to add a start/end timestamp to each collection in collinfo.json. This addition ought not break any cdx client software, but who knows.

So:

  • Is changing collinfo.json a good idea?
  • Should I make a new .json file instead? (I already have a graphinfo.json for our web graph.)
  • Are there other pywb installs that might want to add this feature? Or who have already done this?

I’ve also asked on the IIPC slack. Thanks in advance for your replies! – greg

Having not heard feedback, I’m going to put this in collinfo.json, and I am going to call the two timestamps “from” and “to” to match the pywb cdxserver API.

1 Like

Sorry to not have provided any meaningful feedback. I’m curious about what software will read and write the new metadata.

The additional fields are generated by our crawl software, and tools like my cdx_toolkit will use them to stitch together Common Crawl’s 100 collections into a time-based cdx index.

Also anyone using our web graph will now be able to understand what the fetch dates are for the multiple crawls involved in a given web graph.