Where in browsertrix-crawler code is the WARC stored?

I’ve been looking into the browsertrix-crawler code to find where the WARC files are stored, but have a hard time finding it. All references I could find are about storing screenshots, generating combined WARCs and generating the WACZ. But where in the code are the WARC files from the browser requests and responses stored, or where is it configured that another component (like puppeteer) stores it?

Hi @wvengen it is difficult to see at first because the recording is done by pywb that is installed in the Docker image. The crawler accesses the web via a proxy provided by pywb running in recording mode:

I’m actually having a bit of trouble quickly seeing where pywb is started up. You can find some system calls to wb-manage for setting up the collection. The call to run the process must be in there somewhere!

Ah, thank you! I was assuming this was more a regular proxy, but indeed it is pywb here, and indeed I can find the WARC writing code there. Thanks a lot!

1 Like