Pywb with s3 support

Hi,

I run pywb on a DigitalOcean droplet and want to use s3 (Digitalocean Spaces) to store my WARC files. Is it possible?

Each WARC file contains 1 web page and I create WARC files via WGET.

I found ReplayWeb.page, it has s3 support but is it possible to create collections automatically for all of my WARC files?

Thanks,

Ozan.

1 Like

in your config.yaml you can define archive_paths with the public http url of the bucket

archive_paths:
  - 'https://s3-server/bucket/'
1 Like

Yes, for pywb, this is the simplest option, you can just use the DigitalOcean Spaces Edge Url as the prefix.

The only complication is if you need to have the bucket be private. As of now, that’s not yet possible, but could be supported by adding an option to specify a custom endpoint for DO. I created an issue to track this: https://github.com/webrecorder/pywb/issues/579

However, if you don’t mind having the archive be in a public bucket, that should work already.

For replayweb.page, the suggested format is to put all the WARCs in a WACZ file.
This can be done by using the brand new tools here: https://github.com/webrecorder/wacz-format/tree/master/py-wacz
Packaging up multiple WARCs is a great use case and I believe it should work already, but is very new. Let me know if you try it.

Another option is you can just concatenate all the WARCs into one cat *.warc.gz > all.warc.gz and then load the all.warc.gz in replayweb.page

To host it publicly, the WARC or WACZ will need to publicly accessible and with CORs enabled on Digital Ocean.

Hope this helps! Definitely would like to make all of this even simpler, let me know if you have any thoughts/feedback and which approach makes most sense for you

1 Like

Thanks a lot for your replies.

First, I am trying to do the simplest option using config.yaml with pywb, but I could not manage:

  1. I created config.yaml at the root of my archive called “webarchive” and it has only one line
    archive_paths: https://<name>.<location>.digitaloceanspaces.com/

  2. I run re-index command:
    sudo docker run --rm -v ~/webarchive:/webarchive webrecorder/pywb wb-manager reindex my-collection

  3. It gives below info message, indexes local directory and it does not index WARC files from DigitalOcean Spaces.
    2020-08-15 15:12:42,863: [INFO]: Indexing /webarchive/collections/my-collection/archive to /webarchive/collections/my-collection/indexes/index.cdxj

DigitalOcean Space settings:

  • File Listing: Enabled
  • Files: Public

What is missing?

the cdx indexer could not read a remote location, so you need to create the index at the beginning from a warc in your local filesystem (unless you mount the s3 bucket to local filesystem with e.g. s3fs, but it’s kinda slow).

if you want to do by hand you can use cdxj-indexer

pip3 install cdxj-indexer
cdxj-indexer local.warc > $YOUR-ARCHIVE/collections/$COLLECTION-NAME/indexes/local.warc.cdxj

once you have created the index you can remove the warc from local filesystem and pywb will use for replay the remote file from the s3 bucket

1 Like

Yes, as @raffaele mentions, unfortunately the indexing tools do not yet support indexing directly from a remote location, which would probably involve a streaming download to index. There is an issue for that https://github.com/webrecorder/pywb/issues/182 but haven’t had a chance to address it.

For now, it has to be done manually, something like this, downloading the warc, indexing and deleting.

(note the -s flag on the cdxj-indexer, which is needed to do sorting)

s3cmd get s3://path/to/myfile.warc myfile.warc
cdxj-indexer -s myfile.warc > $YOUR-ARCHIVE/collections/$COLLECTION-NAME/indexes/index.cdxj
rm myfile.warc

or, you can even do it streaming, like this:

s3cmd get -q s3://bucket/path/to/mywarc.warc - | cdxj-indexer -s - > ./$YOUR-ARCHIVE/collections/$COLLECTION-NAME/indexes/mywarc.cdxj

If you have multiple WARCs, can repeat that for each one, or merge into a single index.

Sorry that its not simpler than this right now.

3 Likes

Thanks a lot @raffaele and @ilya.