Extract social media posts from WARC/WACZ-file of ArchiveWeb.page

Sophie · March 14, 2025, 3:18pm

Dear all,
I want to inform if any of you have experience with the extraction of Twitter/Facebook/Instagram posts of an ArchiveWeb.page-WARC or WACZ-file?
I know they are all embedded between "“full_text”-tags but I would like to export them in a CSV- or JSON-file.
Thank you for your help,
Sophie Bossaert

Hank · March 15, 2025, 3:18am

I don’t know of any data extraction tools that are built to get this type of structured data out of WARCs / WACZs. Would love to see it done!

ilya · March 15, 2025, 7:55pm

What exactly are you looking to extract? The text from the posts in a particular format? Comments? Media? etc… there’s a lot of data encoded in the posts of course, so would be helpful if you specify what sort of data you’re looking for.

pchan3 · March 18, 2025, 3:52pm

I would like to extract text from posts and comments into a text file. How can I do that?

Sophie · March 25, 2025, 5:26pm

Dear all, thank you for your quick respons. I would like to extract the text from posts (tweets and retweets) and export them in a JSON or CSV file. For Media I was able to extract the WARC’s with warcit. I’m asking this because we want to use this data for a project in Generative AI, and WARC is not a format that our developer could use.