Using Wabac/Wombat to replay from custom WARC server implementation

vcavallo · March 5, 2025, 10:01pm

Hello!

I have a novel WARC server with an API that differs from pywb and other implementations. Briefly, it provides a list of CDX entries from /index and serves WARC records from /read?digest=<digest from cdx>.
I am hoping to replay these WARCs client-side and looking to use wabac.js/wombat.js to achieve this.

My initial plan was to build an adapter layer that would translate wabac’s WARC-serving API expectations into our implementation’s spec. Like any plucky programmer I went into this thinking “this should be simple!”. Narrator: It was not simple.

If anyone has general guidance from the outset on why this might be simply impossible, that would be great to hear as soon as possible . Otherwise, any other guidance on how to approach this problem or even examples of other projects that have done the same would be appreciated.

At the moment, I am knee-deep in attempting to start from a modified version of the /examples/live-proxy mini-project, and debugging lots of issues with service workers, CORS, and the (expected) kinks in the proxy service that I’m temporarily including directly in the wabac source code and importing in src/index.ts.

I realize this is a big vague, but I’m not sure the best way to ask about this in order to get some help. Let me know if there is more specific information I can give.

Thanks for any time and attention you lend!
Vinney

ilya · March 13, 2025, 6:56am

Hi Vinney,

This is an interesting question - we actually would like to support something like this as there are many use cases for this, but haven’t had time to do that yet! It shouldn’t be too hard to implement as a new loader, though I know the library is not as well documented as it could be. Yes, it should be possible to do this, probably easier to discuss in examples.

vcavallo · March 14, 2025, 3:55pm

Hi Ilya, thanks for getting back to me!

Since posting this, our investigation has pushed us in a slightly different direction. We are working on adjusting our backend to implement the pywb external API. If we’re successful, client-side systems won’t know that they’re not talking to a pywb implementation.

Given this, we’re expecting to use some combination of the constellation of existing javascript libraries (warcio, wabac, archiveweb.page, etc) to create a client-side record/replay experience that works with our remote WARC server and adds in some of the p2p sharing elements at the core of our project. Forking the archiveweb.page Chrome extension is my default starting point, as far as “getting something functional without re-inventing the wheel”.

In that context, I have a much more narrowly-scoped question:

What’s the best way to achieve connecting archiveweb.page to a remote, vanilla pywb-compliant WARC server?

In my exploration thus far, I’ve noticed that archiveweb.page defaults to a local IndexDB and has options for IPFS and Browsertrix. I’m assuming Browsertrix is ultimately backed by pywb, but that the Btrix layer in archiveweb.page is more complex than just connecting the extension to a pywb server.
It seemed like the best areas to look into at first would be building an alternative client, based off BTrixClient but modified to remove any btrix-specific features and provide a minimal pywb API connection. OR forking archiveweb.page’s wabac dependency to kind of hijack the “localDB” collection it thinks its using and kind of MitM re-route that to our API. …or maybe something else i haven’t considered yet?

Thanks again for your attention! It’s wonderful to have such a mature set of tools from which to start.