Dear all,
I want to inform if any of you have experience with the extraction of Twitter/Facebook/Instagram posts of an ArchiveWeb.page-WARC or WACZ-file?
I know they are all embedded between "“full_text”-tags but I would like to export them in a CSV- or JSON-file.
Thank you for your help,
Sophie Bossaert
I don’t know of any data extraction tools that are built to get this type of structured data out of WARCs / WACZs. Would love to see it done!
What exactly are you looking to extract? The text from the posts in a particular format? Comments? Media? etc… there’s a lot of data encoded in the posts of course, so would be helpful if you specify what sort of data you’re looking for.
I would like to extract text from posts and comments into a text file. How can I do that?
Dear all, thank you for your quick respons. I would like to extract the text from posts (tweets and retweets) and export them in a JSON or CSV file. For Media I was able to extract the WARC’s with warcit. I’m asking this because we want to use this data for a project in Generative AI, and WARC is not a format that our developer could use.