Archiving Twitter has been a key issue for many folks.
Due to latest rate limits, it has become more difficult to do so from centralized services, such as Conifer. Recommendations to get best results are:
Use Webrecorder Desktop (or pywb) running locally.
In Webrecorder Desktop, you can log in in Preview mode, then start capturing.
Autopilot should work in the desktop app when logged in.
Another option, as shared by Kritika Garg and Himarsha Jayanetti from ODU-WS DL, is to set the User-Agent header to Googlebot to get the old Twitter UI, which is still available for crawlers. The old UI does not use the Twitter API as much, and therefore is less prone to rate limiting.
More on these findings in their blob post:
Iâm very interested in answers here as well for ArchiveBox. I have some friends with custom hand-written Twitter archiving bots that work very well, but theyâre afraid to share them for fear of Twitter shutting down the private APIâs theyâre using.
Thanks, yes, more is likely possible with the API, but is a bit tricky.
I should mention twarc and Social Feed Manager are some other tools that archive via the API. Iâm not sure if theyâve been affected, I guess since they use the API and not the web, maybe less so?
The Webrecorder approach has always been to use the web interface though, archiving exactly what is available on the web.
Thank you, Ilya! I will investigate this option to change Webrecorder Desktopâs UA in DevTools⌠I read Kritikaâs and Himarshaâs excellent blog post too! I then set out to test their method on the Confier instance, changing my UA to Googlebot. I have been successful in capturing the full scroll from the latest post to the earliest. However, in my experiments so far, this method does not enable the capture of individual Tweets. So posts can be viewed in the context of the feed, but individual Tweets cannot be clicked on or expanded to view replies.
Twitter is an interesting one because normally people try to use only the API, or only âHTML crawlsâ. I think for an effective archive you should use both - the API contains a lot of metadata on tweets that arenât visible in HTML, and the HTML pages of tweets contain a lot that isnât easily obtained with the API. The user details for a tweet for example, are always new and latest in the API - HTML crawls preserve user details as they were at the time of the tweet archive.
I still havenât settled on a good HTML crawler for twitter - as the new interface makes it somewhat awkward - there are usually things that are missed by crawlers (sometimes you need to manually âexpandâ some conversations for example). A twitter-specific crawler that captures both the API and the HTML of a tweet or profile or other page on Twitter would be great - but i donât think it exists yet!
The goal of the Autopilot/behavior system is to provide such an âHTML crawlerâ. Currently working on making this more easily accessible.
It should work in the latest Webrecorder Desktop and on Conifer, and tries to expand tweets/conversations.
The tricky part is maintaining these, as Twitter updates their UI frequently. Weâve gone through several iterations of such a script, latest one is here, built with xpaths. Of course, it could break any time the UI is changedâŚ
Iâd be curious as to how it could be augmented with data from the API, though!
Perhaps it could cross-reference the encountered tweet id with results of the API response?
Hello, Iâve been unable to capture Twitter pages on Webrecorder Desktop recently (the last time I did it successfully was 5 Aug, but I didnât try again until a few days ago), so Iâm guessing that Twitter has updated their UI. Is anyone else having the same problem?
Iâve tried re-installing and clearing cookies and cache on a couple of machines, and getting the same issue.
Hi, yes unable to capture any Twitter pages. When I try to load Twitter, in both preview and capture sessions, thereâs just a blank white page. The filesize goes up to about 900KB as if it is capturing something, but thereâs nothing on the screen while capturing, or in playback on browsing mode.
Hi, yes, there indeed seems to be an issue with Twitter loading in Electron!
I think this is a recent issue (last time I checked was a few weeks ago and it was fine).
Iâm not sure exactly whatâs changed, but its happening even in current versions of Electron!
Hereâs a bug where this was reported: Twitter does not load inside Webview ¡ Issue #25421 ¡ electron/electron ¡ GitHub
It seems that there may be a workaround.
Thanks for reporting this!
Hello there @eyan. Iâve had some (limited) success using the method @Himarsha suggested in her recent blog. Perhaps worth testing in the interim? Try capturing Twitter using the Webrecorder service hosted by Conifer and changing your browserâs User Agent to Googlebot. Working in this way, I found I was able to capture the âsurfaceâ of a feed, although not individual Tweets. I havenât had any luck using Autopilotâs scripted behaviours in this set up, but using manual scrolling I could at least capture something.
Hi @Anisa, thanks so much for your reply, and thanks also to @Himarsha - Iâve given it a go and changing the User Agent has worked as a workaround. Thanks again!
Reviving this thread because yesterday I noticed that a similar/the same issue has come up again with Twitter.
I donât know if it has been reported again in the previous days, but Webrecorder Desktop does not load Twitter at all for me.
I thought it might be an instance of the same issue that was solved by overriding the user agent string, so I attempted to do this on Webrecorder Desktop.
I entered into preview mode in order to be able to open Developer tools, and as you can see in the screenshot, even after switching to Googlebot, Twitter refuses to show itself.
I did test this on Conifer btw, both using Firefox and Chrome, and the technique described by @Himarsha worked. But on Webrecorder Desktop I havenât had the same luck.
In fact, I experimented with using the Googlebot Desktop UA as well, and this caused Webrecorder Desktop to crash repeatedly.
Hi, Iâve just released Webrecorder Desktop v2.0.3 that should fix the issue.
Unfortunately, it is a bug in Electron, and the only way to make it work was to downgrade to an older version. Luckily, that seems to be working fine for now.
The Twitter Autopilot behavior has also been updated and should work much better.