Warc Text Analysis

aykutkaya · July 27, 2023, 6:24am

Hello all,
I created a Warc-Wacz collection on a Twitter account using ArchiveWeb.page. How can I analyze text or topic from this collection?
Thanks

edsu · July 27, 2023, 2:11pm

Hi @aykutkaya thanks for this question! I think the answer might depend a bit on what type of analysis you want to do.

One thing that’s nice about the WACZ file that is created with ArchiveWeb.page is that it includes a pages.jsonl file that lists the pages in the archive and the text that is on the page. It is used when you search the collection.

If you unzip your WACZ file (yes it’s just a ZIP file), you should find the pages directory. It may be that the pages.jsonl file will be suitable for the type of analysis you are doing? I’m happy to continue the conversation here if you have questions.

aykutkaya · July 28, 2023, 5:30am

Hello @edsu
Thank you for your answer. I want to analyze the most used words in the pages I archive. I searched many places but couldn’t find the right method. I opened the WACZ file with Replayweb.page, but I could not access a file with the pages.jsonl extension. Is it possible that I archived the files without the extension you mentioned?

edsu · July 30, 2023, 8:09am

Can you remind me what tool you used to create the WACZ file?

aykutkaya · August 1, 2023, 6:28am

I used Archiveweb.page

edsu · August 1, 2023, 4:28pm

Ah ok. To open or “unpack” the WACZ file you need to unzip it. The easiest way to do this is probably to change the file name to have a .zip extension and then double click on it to launch your operating systems ZIP utility. Then if you look in the extracted files you should find a pages.jsonl?

aykutkaya · August 4, 2023, 12:15pm

I did as you said and reached the pages.jsonl file, but there is not enough data for analysis.

edsu · August 4, 2023, 1:45pm

@aykutkaya interesting – are you able to share the WACZ file with me, either here or at ehs@pobox.com?