Pandemic Facebook Archive failure?

JimClifford · March 1, 2024, 8:34pm

In 2022 we had a student spend a few months capturing the Facebook page of the Saskatchewan premier during the pandemic. The files are large with a couple GBs of information, but when I use the Webrecorder tool or warcio to try and retrieve the posts and comments, it seems like it only captured the first two posts and a single comment. I’ve now spent the time to explore the source code behind a Facebook page, so I have some understanding of why this is so difficult, particularly as the web address remains the same as you scroll through dynamic content. When I use the browser extension tool and select audio/video content, I find hundreds of meme videos that must have been included in the comments. So, this suggests the data is all in the WARC somewhere. Can anyone point me to a solution that allows me to parse the WARZ/WARC files and extract Posts and the comments for each post along with basic metadata like dates? Or is it impossible to reassemble the data given the nature of the Facebook dynamic webpages?

At the end of the day, I’m wondering if we should keep this data in our archive or write it off as an unsuccessful attempt to capture this essential public square discussion during the pandemic.

JimClifford · March 1, 2024, 8:49pm

I’ve got a few of the files in Google Drive, so send me a message if you want a link to see what I’m talking about.

Hank · March 6, 2024, 6:25pm

Links to files are always helpful when troubleshooting! Without more data it’s tough to say for sure if it was captured or not. When using warcio are you looking for specific URLs?

JimClifford · March 6, 2024, 7:26pm

Here is one of the standard files: scott-moe-facebook-april-2020.wacz - Google Drive

The problem with Facebook is the ULR doesn’t change as you scroll through different months of their content: Scott Moe