Replay instagram account

mona · October 3, 2022, 11:55am

Hello

I’ve captured an Instagram Account with browsertrix-crawler using a browser profile.

The WACZ is 1 GB in size.

I’ve added the WACZ to replayweb.page. When I click on a post to open it, I get the Login pop-up instead of the post. But if I use pywb, everything works fine - all the data is there and I can view the posts - without getting the Login pop-up.

Is this a known issue? Is there something wrong with my files?

THX
mona

edsu · October 3, 2022, 6:17pm

Thanks for raising this @mona I can confirm that I see this issue as well when using a logged in Instagram user profile with latest browsertrix-crawler and replayweb.page

ilya · October 3, 2022, 11:43pm

Hi, would you be able to share the WACZ file in question? I haven’t been able to repro the issue where it is logged in pywb but not logged in replayweb.page?

Does the pywb collections have any other data in it, if so, perhaps could try in a new collection just in case data is loaded from a previous capture?

edsu · October 4, 2022, 1:41am

Actually I was mistaken, I guess my Instagram profile needed to be refreshed. I crawled the Instagram page with the latest browsertrix-crawler with the following configuration:

collection: ichbinsophiescholl
generateWACZ: true
text: true
behaviors:
  - autoscroll
  - siteSpecific
  - autoplay
  - autofetch
behaviorTimeout: 0
timeout: 36000
profile: /crawls/profiles/instagram-edsuarchivist.tar.gz
screencastPort: 9037
scopeType: page
seeds:
  - url: https://www.instagram.com/ichbinsophiescholl/

And you can see it seems to play back with the latest ReplayWebPage ok?

https://inkdroid.org/web-archives/archive/?source=https%3A%2F%2Fedsu-webarchives.s3.amazonaws.com%2Fichbinsophiescholl.wacz

If you like you can download the WACZ from https://edsu-webarchives.s3.amazonaws.com/ichbinsophiescholl.wacz and try yourself. Please let me know when you would like me to delete it.

mona · October 5, 2022, 6:05pm

Hello Ed,
Hello Ilya,

thank you for your quick replies.
Here is my WACZ file: http://monaulrich.online/web_archives/ichbinsophiescholl_account.wacz

I’ve created it with the following command:
docker run -p 9037:9037 -v $PWD:/crawls/ -it webrecorder/browsertrix-crawler crawl --url [url] --limit 1 --generateWACZ --text --collection ichbinsophiescholl _20221005 --behaviors autoscroll,siteSpecific --profile /crawls/profiles/profile_insta.tar.gz --screencastPort 9037 --timeout 1000000 --behaviorTimeout 0 --scopeType page --saveState

Here are some Screenshots:

@Ed: The WACZ you’ve created works also in my environment. Thank you very much.
I captured the the page again with the configs you posted. And the replay with replaywebpage works fine too.

So, it seems like the error is in my WACZ file. I tried to reproduce the error with the command above but there is a problem in the process of crawling. When I reproduced it, I will post it.

THX
Mona

hamouda · April 26, 2024, 10:24pm

This WACZ won’t work on replay web page app version 2.0.0 . I don’t know why, it does with ReplayWeb.page-1.8.14 and ReplayWeb.page-1.8.17. works too with archive web.page. you can convert it to zim format to be a future-proof or long-term preservation.
with this tool GitHub - openzim/warc2zim: Command line tool to convert a file in the WARC format to a file in the ZIM format . Installation was difficul to me on windows machine. I have a little knowldge with libzim ; may be you can do the conversion. am doing my warc stuff manually. all worked very well with all versions I have tested on . I even came to this post to know what is the ideal browsertrix options and what it’s results are. and to learn from the posted issues. so you can archive your account once again to see if it works with ver. 2.0.0 or to convert your file to zim.