Yes, for pywb, this is the simplest option, you can just use the DigitalOcean Spaces Edge Url as the prefix.
The only complication is if you need to have the bucket be private. As of now, that’s not yet possible, but could be supported by adding an option to specify a custom endpoint for DO. I created an issue to track this: https://github.com/webrecorder/pywb/issues/579
However, if you don’t mind having the archive be in a public bucket, that should work already.
For replayweb.page, the suggested format is to put all the WARCs in a WACZ file.
This can be done by using the brand new tools here: https://github.com/webrecorder/wacz-format/tree/master/py-wacz
Packaging up multiple WARCs is a great use case and I believe it should work already, but is very new. Let me know if you try it.
Another option is you can just concatenate all the WARCs into one cat *.warc.gz > all.warc.gz and then load the all.warc.gz in replayweb.page
To host it publicly, the WARC or WACZ will need to publicly accessible and with CORs enabled on Digital Ocean.
Hope this helps! Definitely would like to make all of this even simpler, let me know if you have any thoughts/feedback and which approach makes most sense for you
First, I am trying to do the simplest option using config.yaml with pywb, but I could not manage:
I created config.yaml at the root of my archive called “webarchive” and it has only one line archive_paths: https://<name>.<location>.digitaloceanspaces.com/
I run re-index command: sudo docker run --rm -v ~/webarchive:/webarchive webrecorder/pywb wb-manager reindex my-collection
It gives below info message, indexes local directory and it does not index WARC files from DigitalOcean Spaces. 2020-08-15 15:12:42,863: [INFO]: Indexing /webarchive/collections/my-collection/archive to /webarchive/collections/my-collection/indexes/index.cdxj
the cdx indexer could not read a remote location, so you need to create the index at the beginning from a warc in your local filesystem (unless you mount the s3 bucket to local filesystem with e.g. s3fs, but it’s kinda slow).
if you want to do by hand you can use cdxj-indexer
Yes, as @raffaele mentions, unfortunately the indexing tools do not yet support indexing directly from a remote location, which would probably involve a streaming download to index. There is an issue for that https://github.com/webrecorder/pywb/issues/182 but haven’t had a chance to address it.
For now, it has to be done manually, something like this, downloading the warc, indexing and deleting.
(note the -s flag on the cdxj-indexer, which is needed to do sorting)