Replay crawl that exists of multiple resources in embed

robbin · October 19, 2023, 1:00pm

I’d like to replay a single crawl via the <replay-web-page> embed. Most crawls are contained in a single .wacz file which can be loaded in the source attribute. For large websites the .wacz files are split in separate files with a max. size around 7GB. One crawl contains 3 files (6,92 GB + 6,93 GB + 709 MB, crawl total is 14,6 GB).

When I set the first file in te source attribute the site is loaded but not all pages are available (they exists in one of the two other files). I can’t load multiple files in the source attribute. How can I embed the entire crawl?

wvengen · October 24, 2023, 9:13am

Hi Robin, I don’t think this is documented that extensively. The feature is called MultiWACZ, where multiple WACZ files are loaded from a JSON file. See for example this issue pointing to Continued problems with warcit and wacz - #4 by Hank.
For a starting point reading the source code, see wabac.js’s loader.

I don’t think MultiWACZ has been integrated in the WACZ spec yet, but that seems to be in progress. Note that browsertrix(-cloud) uses this already, so it is something that should work.

robbin · October 24, 2023, 9:26am

Thnx @wvengen! I’l look into it

robbin · October 25, 2023, 7:20am

Thnx @wvengen for the information, I’ve made it work via a separate JSON file per crawl which includes the multiple WACZ files in the resources. The JSON file is set in the ‘source’ attribute which loads the entire crawl instead of the single WACZ.

BrowserTrixCloud supplies a JSON archive containing all the crawls in the collection. From this source I’ve created separate JSON files that only contain one crawl, this is done via a script that splits all the data maintaining the original resources data from the BrowserTrixCloud JSON.

JSON from BrowserTrix:

{
  "id": (string) "[collectionId]",
  "name": (string) "[collectionName]",
  "oid": (string) "[collectionOid]",
  "description": (string) "[collectionDescription]",
  "modified": (string) "[collectionModifiedDate]",
  "crawlCount": (int) [CollectionCrawlsCount],
  "pageCount": (int) [CollectionCrawlsTotalPageCount],
  "totalSize": (int) [CollectionCrawlsTotalSize],
  "tags": (array) [],
  "isPublic": (bool) true,
  "resources": [
    {
      "name": (string) "[collectionOid]/[fileName].wacz",
      "path": (string) "[filePath]",
      "hash": (string) "[fileHash]",
      "size": (int) [fileSize],
      "crawlId": (string) "[crawlId 1]"
    },
        ...multiple items...
    {
    "name": (string) "[collectionOid]/[fileName].wacz",
    "path": (string) "[filePath]",
    "hash": (string) "[fileHash]",
    "size": (int) [fileSize],
    "crawlId": (string) "[crawlId 2]"
    }
  ]
}

JSON generated for single crawl:

{
  "name": (string) "[collectionName]",
  "description": (string) "[collectionDescription]",
  "modified": (string) "[collectionModifiedDate]",
  "crawlCount": (int) [CollectionCrawlsCount],
  "tags": (array) [],
  "resources": [
    {
      "name": (string) "[collectionOid]/[fileName].wacz",
      "path": (string) "[filePath]",
      "hash": (string) "[fileHash]",
      "size": (int) [fileSize],
      "crawlId": (string) "[crawlId 1]"
    },
    {
    "name": (string) "[collectionOid]/[fileName].wacz",
    "path": (string) "[filePath]",
    "hash": (string) "[fileHash]",
    "size": (int) [fileSize],
    "crawlId": (string) "[crawlId 1]"
    },
    {
    "name": (string) "[collectionOid]/[fileName].wacz",
    "path": (string) "[filePath]",
    "hash": (string) "[fileHash]",
    "size": (int) [fileSize],
    "crawlId": (string) "[crawlId 1]"
    }
  ]
}

Hopefully this is helpful for anyone that is looking for loading multiple wacz’s into the <replay-web-page> embed