Archiving Facebook

The recent changes of the facebook interface seems to have a pretty negative impact to the latest webrecorder destop version. Autopilot is not working and if you want to capture content manually just after some threads the recording is starting to freezing. The prblem appears on the latest webrecorder desktop version in all operating systems (windows, linux, mac). I hope some soulution can be made to follow the fb interface changes.

3 Likes

I was unable to get past the FB login page in Webrecorder Desktop. I had more luck logging into FB with conifer.rhizome.org. Autoscroll worked for a bit but then seemed to get jammed. Unfortunately the scrolled content doesnā€™t seem to play back either in conifer.rhizome.org or webrecorder desktop (2.0.1). Also the warc file I downloaded doesnā€™t even load in replayweb.page. :man_shrugging:

1 Like

To make this isnā€™t lost, hereā€™s a short answer of why replaying facebook is so hardā€¦ Thereā€™s basically no unique URLs used, its just POST requests

The only good news is that if any content is captured after logging in, hopefully with improvements to replay we can make it replayable in the future, maybe.

2 Likes

Thanks for this clarification @ilya, it seems quite daunting to have to chase around API changes all the timeā€¦
You say that if the captured after logging in then maybe it can become replayable in the future - does that mean that content captured without logging in is likely to not be replayable then? What are the differences between the two in terms of matching the different responses to the POST requests?

1 Like

Yes, thank you for this pointer Ilya! Can you say a little bit more, or point me at a description elsewhere, about what fuzzy matching is doing on replay? Since there are so many records in the archive with the same URL does that make it difficult to lookup and find the correct record that matches the request? This seems like a really interesting edge case that could extend into other areas of then web as GraphQL is used in places other than FB.

1 Like

Oh, I just meant that assuming that you werenā€™t able to capture anything due to login issues.
The non-logged in version of Facebook for public pages would have the same issues, though it appears to be using the older version of the UI.

Yes, the only way to identify these responses is by looking at the corresponding POST request data. pywb has a way of adding the POST query to the URL (This still needs to be added in ReplayWeb.page). There is a system for fuzzy match querying arguments, and that can work if POST data is itself form-encoded and is done via regex.

But, a different system is needed if its actually JSON. This can probably be done by specifying JSON keys that can be matched, though I havenā€™t tried it yet. Basically, need to compare the POSTs to the archived POSTs and find the best matchā€¦ but the rules will continue to change, unfortunately.