archivewebpage = auto pilot next page automatically.
in default setting in auto pilot scrolls the page but doesn’t go to next page automatically, how to add this function.
please help
archivewebpage = auto pilot next page automatically.
in default setting in auto pilot scrolls the page but doesn’t go to next page automatically, how to add this function.
please help
Hello. ArchiveWeb.page is designed to record traffic as the user manually navigates between pages.
If you’re looking for a more automated solution that can crawl entire sites or at least multiple pages automatically, I’d recommend you look into Browsertrix, which has a nice friendly UI and loads of additional features such as curation and sharing, or Browsertrix Crawler, the open source crawler which can be run via the command line.
its a fully paid only service right which does crawling for you ans saves in cloud storage, but the plan are too limited in how many crawls we can do and pages, also pricing/plan are not designed towards larger projects where a single website will have millions of posts, for example :
I want to archive Instagram profiles and posts) from my timeline posts, my liked posts, saved posts and following/follower lists, the post are more than 1 million. (which plan covers that)
(free extension archive webpage crashes after thousand of posts are archived in auto pilot mode and non auto pilot mode as well) I guess it has to do with the browser/ram limitation. (Dom size limitation)
same with x, TikTok, bluesy, Facebook. (any plans that covers that without over paying)
also large forum sites that has millions of pages.
its a fully paid only service right which does crawling for you ans saves in cloud storage, but the plan are too limited in how many crawls we can do and pages, also pricing/plan are not designed towards larger projects where a single website will have millions of posts
Millions of pages is definitely well beyond the scope of what can be reasonably done with our (or any other) browser extension, due to the browser’s own constraints. So you’ll definitely want to look into automated solutions. You should also anticipate that crawling millions of pages with all of their resources will have a large storage footprint, likely in the TBs.
We offer Browsertrix as a hosted service, but the software is also free and open source and can be self-deployed if you have a working knowledge of Kubernetes, as documented here. For our hosted Browsertrix service, we offer Pro plans at higher prices in addition to what’s on the website that can be set with much higher limits. Feel free to reach out to sales@webrecorder.net for pricing and other information.
Browsertrix Crawler, the actual crawler that Browsertrix uses, can also be run locally on your computer in a Docker image, and is fully documented at https://crawler.docs.browsertrix.com/. With the right scoping set, you could do these crawls via the command line on your own computer.
With either Browsertrix or Browsertrix Crawler, you also have the ability to create, save, and apply browser profiles to your crawl, which allow you to do automated crawls as a logged-in user.
thank you for responding, so what are the options : (am fine with TB of data)
since you said millions posts is beyond any extensions or standalone apps capacity to archive.
I bet majority are running Browsertrix in their computer for free ? what’s the advantage of paying apart from cloud storage. (to share, store, manage in cloud)
So you say it’s impossible to archive million of posts from Instagram using archive webpage chrome extension or standalone app, because of ? (limitations are Dom size, ram usage) Can’t it be done in batches like auto pilot archives first 2000 posts then saves Dom and release the size occupied in ram and Dom and runs again from 2000 posts to 4000 posts and repeats the process, but the problem is it refresh the page every time to rerun the process right ? so is there a way to only archive from page 100-200 using auto pilot.
can browsetrix used in computer itself archive all millions if posts in single Dom or will it do in batches like saves few thousands, saves the Dom and saves the rest like wise.
How does both Browsertrix and archive webpage extension, app, be command line docker saves from sites that has millions of posts, does it do in one go or will do in batches resulting in many Dom files (wacz) please help with the understanding.