Yes, as @raffaele mentions, unfortunately the indexing tools do not yet support indexing directly from a remote location, which would probably involve a streaming download to index. There is an issue for that https://github.com/webrecorder/pywb/issues/182 but haven’t had a chance to address it.
For now, it has to be done manually, something like this, downloading the warc, indexing and deleting.
(note the -s
flag on the cdxj-indexer, which is needed to do sorting)
s3cmd get s3://path/to/myfile.warc myfile.warc
cdxj-indexer -s myfile.warc > $YOUR-ARCHIVE/collections/$COLLECTION-NAME/indexes/index.cdxj
rm myfile.warc
or, you can even do it streaming, like this:
s3cmd get -q s3://bucket/path/to/mywarc.warc - | cdxj-indexer -s - > ./$YOUR-ARCHIVE/collections/$COLLECTION-NAME/indexes/mywarc.cdxj
If you have multiple WARCs, can repeat that for each one, or merge into a single index.
Sorry that its not simpler than this right now.