Hi @pchameleon – thanks for finding and using the forum for your question!
I think that you were on the right track by creating a WACZ file for the WARC file, since that will index the WARC data (create a CDX file) which ReplayWebPage will use to find content.
However I downloaded the WARC file and on further inspection it looks like it is full of Resource records for JSON files that were collected somehow. Usually WARC files for web archives contain Response records that contain the HTTP responses from web servers, which can be replayed later by software like ReplayWebPage.
Perhaps the creator of the archive (if you can track them down) might be able to share why they did it this way. I suspect it was because this ArchiveTeam member was working quickly before Yahoo Groups went down and found a way to traverse the JSON files programmatically without using web archiving software. But for some reason they chose to still use WARC to package them up. This isn’t technically wrong it’s just a bit weird.
I don’t know if it’s helpful but I wrote a small Python program that uses Webrecorder’s warcio library to extract the JSON files to the filesystem:
#!/usr/bin/env python3
import json
from pathlib import Path
from warcio.archiveiterator import ArchiveIterator
with open('yahoo-groups-2016-03-20T12:45:19Z-nyzp9w.warc.gz', 'rb') as stream:
for record in ArchiveIterator(stream):
# unlike most WARC files we're only interested in the resource records
if record.rec_type == 'resource':
# use the target URI to create filename (no colon for windows users)
uri = record.rec_headers['WARC-Target-URI'].replace(':', '-') + '.json'
path = Path(uri)
print(path)
# create the directory if it's not there already
if not path.parent.is_dir():
path.parent.mkdir(parents=True)
# parse and pretty print the JSON with indent of 2 to make it more readable
data = json.load(record.raw_stream)
path.open('w').write(json.dumps(data, indent=2))
This writes out the individual JSON Resources to the filesystem, which amounts to 455,194 files. Each one looks something like:
{
"numRecords": 0,
"recFirstNextTopic": 0,
"recFirstLastPosted": 0,
"digestNum": 0,
"recFirstTopicStatus": 0,
"subject": "Re: just joined",
"yahooAlias": "jbtseti",
"author": "jbtseti",
"topicLastRecord": 0,
"topicInfoStatus": 0,
"recFirstTopicFirstRecord": 0,
"topicStatus": 0,
"length": 669,
"email": "jbtseti",
"firstRecInfoStatus": 2,
"parent": 1,
"prevTopic": 0,
"recFirstDigestNum": 0,
"nextTopic": 0,
"lastPosted": 0,
"date": 942510188,
"recFirstPrevTopic": 0,
"topicNextRecord": 24,
"recFirstTopicLastRecord": 0,
"hasAttachments": 0,
"threadLevel": 0,
"topicPrevRecord": 22,
"recFirstTopicPrevRecord": 0,
"summary": "actually i am not a member of any groups, just looking for interesting discussions that dont revolve around set@home and science fiction.",
"recFirstTopicNextRecord": 0,
"messageId": 23,
"recFirstNumRecords": 0,
"topicFirstRecord": 1
}
If this would be helpful I could zip up the directory and make it available to you. Let me know!