Where in browsertrix-crawler code is the WARC stored?

wvengen · June 15, 2023, 9:54am

I’ve been looking into the browsertrix-crawler code to find where the WARC files are stored, but have a hard time finding it. All references I could find are about storing screenshots, generating combined WARCs and generating the WACZ. But where in the code are the WARC files from the browser requests and responses stored, or where is it configured that another component (like puppeteer) stores it?

edsu · June 15, 2023, 10:27am

Hi @wvengen it is difficult to see at first because the recording is done by pywb that is installed in the Docker image. The crawler accesses the web via a proxy provided by pywb running in recording mode:

github.com

webrecorder/browsertrix-crawler/blob/main/crawler.js#L98


      
          this.saveStateFiles = [];
          this.lastSaveTime = 0;
          
          
// sum of page load + behavior timeouts + 2 x fetch + cloudflare + link extraction timeouts + extra page delay
          // if exceeded, will interrupt and move on to next page (likely behaviors or some other operation is stuck)
          this.maxPageTime = this.params.pageLoadTimeout + this.params.behaviorTimeout +
                             FETCH_TIMEOUT_SECS*2 + PAGE_OP_TIMEOUT_SECS*2 + this.params.pageExtraDelay;
          
          
this.emulateDevice = this.params.emulateDevice || {};
          
          
this.captureBasePrefix = `http://${process.env.PROXY_HOST}:${process.env.PROXY_PORT}/${this.params.collection}/record`;
          this.capturePrefix = process.env.NO_PROXY ? "" : this.captureBasePrefix + "/id_/";
          
          
this.gotoOpts = {
            waitUntil: this.params.waitUntil,
            timeout: this.params.pageLoadTimeout * 1000
          };
          
          
// pages directory
          this.pagesDir = path.join(this.collDir, "pages");

I’m actually having a bit of trouble quickly seeing where pywb is started up. You can find some system calls to wb-manage for setting up the collection. The call to run the process must be in there somewhere!

wvengen · June 15, 2023, 10:55am

Ah, thank you! I was assuming this was more a regular proxy, but indeed it is pywb here, and indeed I can find the WARC writing code there. Thanks a lot!