Use of Custom Fields Defined in warcinfo in browsertrix-crawler


I’m trying to understand what happens / should happen with the custom fields you define in the warcinfo property in your browsertrix crawler. Should this information get written to either the wacz (in a file like datapackage.json) or in its embedded warcs? I see it’s included in each of the crawl-******.yml files, but I’m wondering if it also gets written in the final generated wacz? It seems from reading the WACZ Spec that contextual information like this should be included in datapackage.json.

Also, from reading about contextual information in WACZ, I’m wondering how you set title or description.

Thanks in advance for any insights here.


1 Like

It looks like warcInfo information is only included when you also use combineWARC, which will write a single *-warc.gz file to your collection directory. As far as I can tell from testing with browsertrix 0.9.0-beta the warcInfo information is not currently being persisted to the WACZ’s datapackage.json or to WARC files that are contained in it.

I agree that it is really important to be able to embed contextual information into the datapackage.json when running browsertrix-crawler. It might be good to have an issue for this?

@edsu thanks for the reply and taking a look at this. I’ll open an issue about this tonight.

Just adding that I’ve submitted an issue about this here: Support Contextual Information in datapackage.json for WACZ · Issue #268 · webrecorder/browsertrix-crawler · GitHub

1 Like