Archiving Facebook

nmarton · August 4, 2020, 1:22pm

The recent changes of the facebook interface seems to have a pretty negative impact to the latest webrecorder destop version. Autopilot is not working and if you want to capture content manually just after some threads the recording is starting to freezing. The prblem appears on the latest webrecorder desktop version in all operating systems (windows, linux, mac). I hope some soulution can be made to follow the fb interface changes.

edsu · August 15, 2020, 2:05am

I was unable to get past the FB login page in Webrecorder Desktop. I had more luck logging into FB with conifer.rhizome.org. Autoscroll worked for a bit but then seemed to get jammed. Unfortunately the scrolled content doesn’t seem to play back either in conifer.rhizome.org or webrecorder desktop (2.0.1). Also the warc file I downloaded doesn’t even load in replayweb.page.

ilya · September 25, 2020, 2:56am

To make this isn’t lost, here’s a short answer of why replaying facebook is so hard… There’s basically no unique URLs used, its just POST requests

The only good news is that if any content is captured after logging in, hopefully with improvements to replay we can make it replayable in the future, maybe.

zefik · September 26, 2020, 2:35pm

Thanks for this clarification @ilya, it seems quite daunting to have to chase around API changes all the time…
You say that if the captured after logging in then maybe it can become replayable in the future - does that mean that content captured without logging in is likely to not be replayable then? What are the differences between the two in terms of matching the different responses to the POST requests?

edsu · September 26, 2020, 4:46pm

Yes, thank you for this pointer Ilya! Can you say a little bit more, or point me at a description elsewhere, about what fuzzy matching is doing on replay? Since there are so many records in the archive with the same URL does that make it difficult to lookup and find the correct record that matches the request? This seems like a really interesting edge case that could extend into other areas of then web as GraphQL is used in places other than FB.

ilya · September 27, 2020, 9:33pm

Oh, I just meant that assuming that you weren’t able to capture anything due to login issues.
The non-logged in version of Facebook for public pages would have the same issues, though it appears to be using the older version of the UI.

Yes, the only way to identify these responses is by looking at the corresponding POST request data. pywb has a way of adding the POST query to the URL (This still needs to be added in ReplayWeb.page). There is a system for fuzzy match querying arguments, and that can work if POST data is itself form-encoded and is done via regex.

But, a different system is needed if its actually JSON. This can probably be done by specifying JSON keys that can be matched, though I haven’t tried it yet. Basically, need to compare the POSTs to the archived POSTs and find the best match… but the rules will continue to change, unfortunately.