Important data size limit, data storage option, autopilot behaviour, custom limits and settings

FOR : instagram.com, x.com, tiktok.com, reddit.com, Facebook.com, & POPULAR FORUMS

" which settings to use for best results "

  1. between browser based extension or standalone app which is better ? and which browser is better on Mac ?
  2. does data gets saved automatically in selected storage location or it only saves in browser itself which means it saved in memory and will definitely crash the browser if size increases above 5-10 GB. How to fix that ?
  3. autopilot does it scrolls to the last page for example Instagram or x that has infinite scroll to see posts. also how to change the behaviour of autopilot to only scroll continuously but not open each posts which will slow the process and increase the size.
  4. can the data be saved in different format instead of only one.
  5. can we replay the page thats archived in exactly like how we interact them online or the live links won’t work. or will it be like archive.org page which doesn’t have live interaction example if we click a link and open it doesn’t interact like in the source original live page.

If you mean ArchiveWeb.page for archiving, it depends on your preference really - the app provides an isolated browser for archiving, but works the same way.

The data is stored in Indexeddb in the browser and can definitely handle archives >10GB, not everything is loaded into memory. Though, if a page has a large DOM, there’s not much we can do. Otherwise, data is loaded for each page. You can also export WACZ files at any time to disk.

You can also try the crawler if you’re archiving more than one page at a time.

Unfortunately, we don’t yet have a way to add custom behaviors in ArchiveWeb.page at this time, but something we’d like to support. We do have this support in the crawler.

We save data in WARC format and export as WACZ, but from there it’s possible to extract data to other formats. We also save screenshots and PDFs in the latest ArchiveWeb.page.

Yes, that’s the idea - if you hit a link that’s not in the archive, it’ll give you an option to load the live page. The idea is to be able to archive content at high-fidelity with all interactions present, but of course, it can be tricky and may not always work.

What are you trying to archive exactly?

thank you so much for responding,
i am trying to achieve few Public Profiles from Instagram,x and reddit that has 30-50k posts making sure the archive is as perfect as possible and exactly how its viewed online.

  1. I want to skip videos files because it will increase archive size. Is there a way I can do that ?
  2. by crawler you mean Browsertrix • Webrecorder. This is a paid service right which does everything for you. What do you think is the major pros and cons between this crawler vs extension/app version which is free.
  3. In autopilot mode if am correct for example in Instagram profile it opens each posts one by one and scrolls right is it reliable and accurate ? also can we speed up the process by controlling the speed and other settings if available ?