Warc file too large --- out of memory

pchameleon · July 12, 2023, 7:57pm

Hello. I’m not a programmer, and so I’m not sure this forum is for me. I will explain my problem anyway.
I am trying to read some old Yahoo Group which was archived on Wayback Machine. I downloaded a large (>2gb) warc file from that site and loaded it on replayweb.page. Aside from taking forever to scroll down to near the part I wanted, the computer runs out of memory before I actually get to the exact point I wish, and I have to reload the page, losing all the progress I had made. I tried to convert the warc file to wacz (assuming that would solve this problem – would it?) using py-wacz, but, when I load the resulting wacz file on replayweb.page, no url is found (py-wacz produced warning/error messages in the process, which leads me to suspect the conversion was not successful).
I am grateful for any help on this. The warc file I am dealing with is the one which can be downloaded from the following url:

The command line I issued when trying to convert to the wacz format was:
wacz create -o myfile.wacz current.warc

edsu · July 13, 2023, 10:16pm

Hi @pchameleon – thanks for finding and using the forum for your question!

I think that you were on the right track by creating a WACZ file for the WARC file, since that will index the WARC data (create a CDX file) which ReplayWebPage will use to find content.

However I downloaded the WARC file and on further inspection it looks like it is full of Resource records for JSON files that were collected somehow. Usually WARC files for web archives contain Response records that contain the HTTP responses from web servers, which can be replayed later by software like ReplayWebPage.

Perhaps the creator of the archive (if you can track them down) might be able to share why they did it this way. I suspect it was because this ArchiveTeam member was working quickly before Yahoo Groups went down and found a way to traverse the JSON files programmatically without using web archiving software. But for some reason they chose to still use WARC to package them up. This isn’t technically wrong it’s just a bit weird.

I don’t know if it’s helpful but I wrote a small Python program that uses Webrecorder’s warcio library to extract the JSON files to the filesystem:

#!/usr/bin/env python3

import json

from pathlib import Path
from warcio.archiveiterator import ArchiveIterator

with open('yahoo-groups-2016-03-20T12:45:19Z-nyzp9w.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream):

        # unlike most WARC files we're only interested in the resource records 
        if record.rec_type == 'resource':

            # use the target URI to create filename (no colon for windows users)
            uri = record.rec_headers['WARC-Target-URI'].replace(':', '-') + '.json'
            path = Path(uri)
            print(path)

            # create the directory if it's not there already
            if not path.parent.is_dir():
                path.parent.mkdir(parents=True)
            
            # parse and pretty print the JSON with indent of 2 to make it more readable
            data = json.load(record.raw_stream)
            path.open('w').write(json.dumps(data, indent=2))

This writes out the individual JSON Resources to the filesystem, which amounts to 455,194 files. Each one looks something like:

{
  "numRecords": 0,
  "recFirstNextTopic": 0,
  "recFirstLastPosted": 0,
  "digestNum": 0,
  "recFirstTopicStatus": 0,
  "subject": "Re: just joined",
  "yahooAlias": "jbtseti",
  "author": "jbtseti",
  "topicLastRecord": 0,
  "topicInfoStatus": 0,
  "recFirstTopicFirstRecord": 0,
  "topicStatus": 0,
  "length": 669,
  "email": "jbtseti",
  "firstRecInfoStatus": 2,
  "parent": 1,
  "prevTopic": 0,
  "recFirstDigestNum": 0,
  "nextTopic": 0,
  "lastPosted": 0,
  "date": 942510188,
  "recFirstPrevTopic": 0,
  "topicNextRecord": 24,
  "recFirstTopicLastRecord": 0,
  "hasAttachments": 0,
  "threadLevel": 0,
  "topicPrevRecord": 22,
  "recFirstTopicPrevRecord": 0,
  "summary": "actually i am not a member of any groups, just looking for interesting discussions that dont revolve around set@home and science fiction.",
  "recFirstTopicNextRecord": 0,
  "messageId": 23,
  "recFirstNumRecords": 0,
  "topicFirstRecord": 1
}

If this would be helpful I could zip up the directory and make it available to you. Let me know!

pchameleon · July 13, 2023, 10:42pm

Yes, that would be helpful, thanks a lot.

pchameleon · July 14, 2023, 2:22pm

Hello again. I went ahead and ran your program, but it didn’t work. There was a message about a syntax error in the warc file at message 10017. I tried it with another warc file, and a similar message happened. Then I remembered that while I was scouring the internet for solutions I came upon a program which split warc files, and decided to try it instead. It didn’t issue any message, but didn’t produce any file either. So I inspected it and noticed that it detected response-type records. Since you told me that my file had resource-type records instead of response-type ones, I substituted the word ‘resource’ for the word ‘response’ in the script code. This time it worked. It produced one warc file for every group message. Not exactly wieldy, but certainly a lot better than what I had before. I obtained a slightly smaller number of records than you (454,897), for some reason.
Anyway, without your explanation I wouldn’t have been able to make the other program work, so, again, many thanks. I probably will no longer need the files you obtained, except perhaps for checking missing entries in mine. The webpage with the program I used is at:
Python: How to split WARC file? - Stack Overflow

edsu · July 14, 2023, 2:57pm

Interesting. It sounds like you are on your way to figuring out how to use the data and don’t need this ZIP file I created of the extracted JSON. What error did you get from the code I posted?

pchameleon · July 14, 2023, 3:00pm

Here’s the message (sorry, it’s in Portuguese, since I live in Brazil; the translation should be more or less ‘The syntax of the file name, folder name, or volume label is incorrect’):
C:\Users\User\Downloads\yahoo-groups-2016-03-20T12-45-19Z-nyzp9w>python warcsplitter.py

org.archive.yahoogroups:v1/group/psychiatry-research/message/10017/info.json

Traceback (most recent call last):

File “C:\Users\User\Downloads\yahoo-groups-2016-03-20T12-45-19Z-nyzp9w\warcsplitter.py”, line 15, in

path.parent.mkdir(parents=True)

File “C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\pathlib.py”, line 1116, in mkdir

os.mkdir(self, mode)

OSError: [WinError 123] A sintaxe do nome do arquivo, do nome do diretório ou do rótulo do volume está incorreta: ‘org.archive.yahoogroups:v1\group\psychiatry-research\message\10017’

edsu · July 14, 2023, 3:48pm

Ah I think I see. I think the : in the filename might be causing a problem on Windows (I ran it on a Mac).

I’ve updated the code snippet above to replace the : with a - in the filename. Can you see if using that version helps?

pchameleon · July 14, 2023, 3:54pm

No, I actually had already substituted an underscore symbol for the colon in your code. My file name has no colon in it. As downloaded from the archive.org site, it comes with underscore instead of the colon.

pchameleon · July 14, 2023, 4:10pm

I don’t know whether it’s important, but I miscopied the error message (last line): the directory, as it appears on the command window, has double backslashes instead of single ones. I can’t copy-paste it correctly, for some reason.

pchameleon · July 14, 2023, 4:37pm

I did what you suggested anyway. It’s working. I can’t say I understand what is going on, though. Is it colons internal to the warc file? Well, anyway, it’s working. (It hasn’t finished yet, but I suppose it’s working.)
Thanks again!

pchameleon · July 15, 2023, 1:48pm

Hello, Ed. After running your program I noticed that it produced the info.json files, but not the raw.json ones, which are precisely the ones with the message bodies I wanted. After a few more tests and an inspection of the code, and using what’s left of my earlier knowledge of programming, it appeared that you placed the writing instruction under the if clause, which caused writing in a specific directory to be performed only if that directory didn’t already exist. So, I took the writing instructions from under the if clause, which solved the problem. Below is the corrected code.

#!/usr/bin/env python3

import json

from pathlib import Path
from warcio.archiveiterator import ArchiveIterator

with open('yahoo-groups-2016-03-20T12:45:19Z-nyzp9w.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_type == 'resource':
            uri = record.rec_headers['WARC-Target-URI'].replace(':', '-') + '.json'
            print(uri)
            path = Path(uri)
            if not path.parent.is_dir():
                path.parent.mkdir(parents=True)
                # parse and pretty print the JSON with indent of 2 to make it more readable
            data = json.load(record.raw_stream)
            path.open('w').write(json.dumps(data, indent=2))

I think my problem is now totally solved. For my purpose, the other program I had found at Stack Overflow is quite impractical, as it produces an immensity of warc files with totally random names. Yours in turn produces one windows-searchable text file per message, and organizes messages in directories named after respective group names and message numbers. A lot better!
I hope others find out about this program – and this site in general. Probably many people have been discouraged from researching those archived Yahoo Groups because they couldn’t handle those files.
Thanks for everything!

edsu · July 15, 2023, 2:04pm

You are totally right about the indentation problem, thanks for catching that! I will update the example so it no longer has the bug in case someone copy/pastes without reading this discussion.

I did think that it might be a fun experiment to create a simple web application for viewing messages in time as threads. I wonder if anyone has done that?

On second thought you could probably extract out the rawEmail from the raw.json files, and then feed them into an open source mailing list forum tool like ezmlm or mailman.

edsu · July 15, 2023, 2:15pm

Yes, the colons were appearing in the WARC-Target-URI values in each WARC record. They needed to be replaced before they were used to construct the path for the JSON file.

edsu · July 15, 2023, 7:02pm

I don’t know if this approach is helpful but here’s another program that will read the WARC files and extract the email raw messages to write them out in an MBOX file (where each mailing list corresponds to a different MBOX file):

gist.github.com

https://gist.github.com/edsu/760ff538274756b6a793e1982a1f2084

warc2mbox.py

#!/usr/bin/env python3

import sys
import json
import pathlib
import mailbox

from warcio.archiveiterator import ArchiveIterator

warc_file = sys.argv[1]

This file has been truncated. show original

You can run it by giving it the WARC filename and it will write one or more MBOX files to a mboxes directory:

$ python3 warc2mbox.py yahoo-groups-2016-03-20T12:45:19Z-nyzp9w.warc.gz
$ ls -l mboxes
-rw-r--r--  1 edsummers  staff    12522488 Jul 15 14:14 amicigranata.mbox
-rw-r--r--  1 edsummers  staff     6377115 Jul 15 14:14 black-white_a.mbox
-rw-r--r--  1 edsummers  staff     2207823 Jul 15 14:14 boukman.mbox
-rw-r--r--  1 edsummers  staff      781270 Jul 15 14:14 deardavidbeckham.mbox
-rw-r--r--  1 edsummers  staff       95302 Jul 15 14:14 drawingroom2.mbox
-rw-r--r--  1 edsummers  staff     3048044 Jul 15 14:14 dreamwavescomics.mbox
-rw-r--r--  1 edsummers  staff  1962908885 Jul 15 14:14 evolutionary-psychology.mbox
-rw-r--r--  1 edsummers  staff     1102962 Jul 15 14:14 fractalsinnature.mbox
-rw-r--r--  1 edsummers  staff    10656615 Jul 15 14:14 genealogie_from_france.mbox
-rw-r--r--  1 edsummers  staff     1691232 Jul 15 14:14 gillianandersonfanclub.mbox
-rw-r--r--  1 edsummers  staff    11291371 Jul 15 14:14 limp-bizkit.mbox
-rw-r--r--  1 edsummers  staff    93975052 Jul 15 14:14 mzmoudarres.mbox
-rw-r--r--  1 edsummers  staff    83145966 Jul 15 14:13 psychiatry-research.mbox
-rw-r--r--  1 edsummers  staff      524803 Jul 15 14:14 setiproject.mbox
-rw-r--r--  1 edsummers  staff      828719 Jul 15 14:14 tahoetruckee87.mbox
-rw-r--r--  1 edsummers  staff      788032 Jul 15 14:14 thefacultyklub.mbox
-rw-r--r--  1 edsummers  staff    12860765 Jul 15 14:14 tuskegeeuniversity.mbox

I tried importing one of the mboxes with the venerable hypermail software and it generates some pretty basic (but functional) HTML pages for browsing and viewing the messages.

$ hypermail -m mboxes/psychiatry-research.mbox -d psychiatry-research

pchameleon · July 15, 2023, 7:57pm

Excellent! That will certainly be of great help. I will try it later on. Thanks again, Ed.