We’re surveying options for our institution, including both self-hosted and Cloud options and I’m ignorant on the self-hosted infrastructure specs for Browsertrix. What are both the current minimums and optimal standards for running Browsertrix locally?
I’m assuming I’d take the optimal requirements for a Kubernetes Cluster and Helm 3 which seem to be 8 CPU cores, 32 GB of RAM, and enough storage to handle whatever it is we’re trying to archive. For those who are self-hosting, or the WebRecorder engineers, any insight would be really helpful.
We recently deployed a self-hosted instance of Browsertrix on our own server. I’m not sure if there are any mandated or endorsed minimums, but we’re working with 4 CPU cores and 24GB of RAM.
This amount of RAM seems to give the deployment plenty of headroom; you can adjust just how much of that RAM you want to allocate to your crawlers based on your workload. So far, I’ve found this setup adequate to run three simultaneous crawls with 6 windows active.
I would suspect that 8 CPU cores and 32 GB of RAM would be plenty sufficient for a single machine deployment doing a typical amount of crawling.
Keep in mind that the resources (CPU and RAM) used by the backend, frontend, database, Redis, and crawler node containers are all customizable via the Helm chart. For our production deployments we increase many of these values beyond the default by overriding the default values using a local chart as is documented in the Self-Hosting Guide. The Helm chart can also be used to manage horizontal pod auto-scaling of the backend and frontend, which will automatically spin up additional backend and frontend containers as needed based on observed resource usage.
Resource needs will largely be dependent on the load. Running many large crawls simultaneously will increase resource usage by spinning up additional crawler containers, requiring more resources per crawler node/browser, and increasing the demands on the backend (including for some backend jobs like file replication) and database. So resource usage will differ greatly depending on whether average load is on the more typical end of a handful of simultaneous crawls using a modest number of browser windows and crawling thousands of pages per crawl (as Alex’s example illustrates), or at the other end, many simultaneous large crawls each employing a large number of browser windows and crawling millions of pages. To illustrate the deep end of the pool, we know of at least one institution who manages a self-hosted installation working on the latter scale, who use an 8 CPU/32 GB machine for the core services as well as two additional machines of 24-36 CPUs and 64-128GB RAM dedicated exclusively to crawler nodes, but that is not typical of most deployments.
Our own deployments are not on a single machine but on managed Kubernetes clusters that spin up additional machines as needed, but I suspect @ilya will have additional useful thoughts to share.
It would also be great to include this type of information into our documentation, so thank you for asking the question and to everyone who answers!
Thank you so much, @alexd and @tessa-webrecorder. That really helps in terms of determining our infrastructure request to our IT department to see if this is even possible for them to support. I’m a little out of my depth in understanding how these operations scale up, and your examples helped a lot.
Our project shouldn’t as ambitious as what Tessa described with the addition of 2 more machines. We’re looking to archive maybe hundreds of thousands of pages rather than into the millions.