Archiving Twitter Suggestions/Feedback

Archiving Twitter has been a key issue for many folks.
Due to latest rate limits, it has become more difficult to do so from centralized services, such as Conifer. Recommendations to get best results are:

  • Use Webrecorder Desktop (or pywb) running locally.
  • In Webrecorder Desktop, you can log in in Preview mode, then start capturing.
  • Autopilot should work in the desktop app when logged in.

This should lead to much better results.

3 Likes

Another option, as shared by Kritika Garg and Himarsha Jayanetti from ODU-WS DL, is to set the User-Agent header to Googlebot to get the old Twitter UI, which is still available for crawlers. The old UI does not use the Twitter API as much, and therefore is less prone to rate limiting.
More on these findings in their blob post:

Webrecorder Desktop does not yet have an option for setting the user-agent, but is possible to override the User-Agent via DevTools

4 Likes

I’m very interested in answers here as well for ArchiveBox. I have some friends with custom hand-written Twitter archiving bots that work very well, but they’re afraid to share them for fear of Twitter shutting down the private API’s they’re using.

It’s a tough situation :confused:

1 Like

Thanks, yes, more is likely possible with the API, but is a bit tricky.

I should mention twarc and Social Feed Manager are some other tools that archive via the API. I’m not sure if they’ve been affected, I guess since they use the API and not the web, maybe less so?

The Webrecorder approach has always been to use the web interface though, archiving exactly what is available on the web.

1 Like

Thank you, Ilya! I will investigate this option to change Webrecorder Desktop’s UA in DevTools… I read Kritika’s and Himarsha’s excellent blog post too! :sparkles: I then set out to test their method on the Confier instance, changing my UA to Googlebot. I have been successful in capturing the full scroll from the latest post to the earliest. However, in my experiments so far, this method does not enable the capture of individual Tweets. So posts can be viewed in the context of the feed, but individual Tweets cannot be clicked on or expanded to view replies.

2 Likes

Twitter is an interesting one because normally people try to use only the API, or only “HTML crawls”. I think for an effective archive you should use both - the API contains a lot of metadata on tweets that aren’t visible in HTML, and the HTML pages of tweets contain a lot that isn’t easily obtained with the API. The user details for a tweet for example, are always new and latest in the API - HTML crawls preserve user details as they were at the time of the tweet archive.

I still haven’t settled on a good HTML crawler for twitter - as the new interface makes it somewhat awkward - there are usually things that are missed by crawlers (sometimes you need to manually “expand” some conversations for example). A twitter-specific crawler that captures both the API and the HTML of a tweet or profile or other page on Twitter would be great - but i don’t think it exists yet!

2 Likes

The goal of the Autopilot/behavior system is to provide such an “HTML crawler”. Currently working on making this more easily accessible.
It should work in the latest Webrecorder Desktop and on Conifer, and tries to expand tweets/conversations.

The tricky part is maintaining these, as Twitter updates their UI frequently. We’ve gone through several iterations of such a script, latest one is here, built with xpaths. Of course, it could break any time the UI is changed…

I’d be curious as to how it could be augmented with data from the API, though!
Perhaps it could cross-reference the encountered tweet id with results of the API response?

3 Likes

Hello, I’ve been unable to capture Twitter pages on Webrecorder Desktop recently (the last time I did it successfully was 5 Aug, but I didn’t try again until a few days ago), so I’m guessing that Twitter has updated their UI. Is anyone else having the same problem?

I’ve tried re-installing and clearing cookies and cache on a couple of machines, and getting the same issue.

Do you mean that you’re unable to capture any Twitter pages, or using the Autopilot automation?
What happens when you try to load Twitter?

Hi, yes unable to capture any Twitter pages. When I try to load Twitter, in both preview and capture sessions, there’s just a blank white page. The filesize goes up to about 900KB as if it is capturing something, but there’s nothing on the screen while capturing, or in playback on browsing mode.

1 Like

Hi, yes, there indeed seems to be an issue with Twitter loading in Electron!
I think this is a recent issue (last time I checked was a few weeks ago and it was fine).
I’m not sure exactly what’s changed, but its happening even in current versions of Electron!
Here’s a bug where this was reported: https://github.com/electron/electron/issues/25421
It seems that there may be a workaround.
Thanks for reporting this!

2 Likes

Thank you, @eyan and thank you @ilya!

Hello there @eyan. I’ve had some (limited) success using the method @Himarsha suggested in her recent blog. Perhaps worth testing in the interim? Try capturing Twitter using the Webrecorder service hosted by Conifer and changing your browser’s User Agent to Googlebot. Working in this way, I found I was able to capture the ‘surface’ of a feed, although not individual Tweets. I haven’t had any luck using Autopilot’s scripted behaviours in this set up, but using manual scrolling I could at least capture something.

Hi @Anisa, thanks so much for your reply, and thanks also to @Himarsha - I’ve given it a go and changing the User Agent has worked as a workaround. Thanks again!

3 Likes

@Anisa @eyan, good to hear that the UA changing trick worked. Although, we don’t know for how long!

2 Likes

Reviving this thread because yesterday I noticed that a similar/the same issue has come up again with Twitter.

I don’t know if it has been reported again in the previous days, but Webrecorder Desktop does not load Twitter at all for me.
I thought it might be an instance of the same issue that was solved by overriding the user agent string, so I attempted to do this on Webrecorder Desktop.

I entered into preview mode in order to be able to open Developer tools, and as you can see in the screenshot, even after switching to Googlebot, Twitter refuses to show itself.

I did test this on Conifer btw, both using Firefox and Chrome, and the technique described by @Himarsha worked. But on Webrecorder Desktop I haven’t had the same luck.

In fact, I experimented with using the Googlebot Desktop UA as well, and this caused Webrecorder Desktop to crash repeatedly.

Any ideas about what could be causing this issue?

Hi, I’ve just released Webrecorder Desktop v2.0.3 that should fix the issue.

Unfortunately, it is a bug in Electron, and the only way to make it work was to downgrade to an older version. Luckily, that seems to be working fine for now.

The Twitter Autopilot behavior has also been updated and should work much better.

You can get the latest release at:

2 Likes