Browsertrix configuration for list of twitter profiles

Hi, I wonder how would the command look like, if I want to crawl list of twitter profiles, and I want whole profiles without external links.

When I run the command below, the twitter profile is crawled only partialy. The autoscroll stop to soon.

docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --seedFile /crawls/twitter.txt --profile /crawls/profiles/profile.tar.gz --behaviors autoscroll --text -generateWACZ --collection twitter

Thank you!

PS. And thank you for your work on the Browsertrix and WR, its great!

Sorry if this is more confusing that it should be. The autoscroll behavior is for other sites that don’t have any specific custom behavior, so actually you don’t want that behavior for twitter!
The flag to use is --behaviors siteSpecific, which currently includes twitter, instagram and facebook.
But the default value for behaviors is --behaviors siteSpecific,autofetch,autoplay so it will be included automatically if no other value is specified!

If you just run the following, that should run the command you have without specifying behaviors, it should do the right thing by default:

docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --seedFile /crawls/twitter.txt --profile /crawls/profiles/profile.tar.gz --text -generateWACZ --collection twitter

If you want to watch the crawl, you can also do:

docker run -p 9037:9037 -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --seedFile /crawls/twitter.txt --profile /crawls/profiles/profile.tar.gz --behaviors autoscroll --text -generateWACZ --collection twitter --screencastPort 9037

and then load http://localhost:9037/ in the browser to watch as its going!

Hope this helps!

1 Like

Thanks for your reply ilya!

My goal is to crawl the whole twitter profile in a space efective way - so I have to limit the crawl to not crawl the entire twitter right :slight_smile: But how? I struggle with the paramtetrs specified on github, because the crawl always finished without scroll on the end (start) of the profile. Especially if I use the --limit 1 parametr.

What command should I use to crawl the whole profile in space economic way - to crawl just a feed of specific profile without external links?

The only way to do it is through the --config and .yaml file and scopeType: “page” parameter?