Request for advice - this is not strictly related to Browsertrix but rather to Kubernetes and networking.
I restored a disappeared website on my workstation that was running a classic LAMP stack with WordPress. The site was hacked 10+ years ago and was running an old WordPress version with dozens of outdated plugins. I managed to restore it by creating a Docker Compose stack and later cleaning and patching the database.
I manually edited my /etc/hosts to point the original domain to 127.0.0.1. Since the original site didn’t have HTTPS, I’m able to browse a complete working copy of the site.
On the same workstation, I run Browsertrix on k3s.
What’s your best advice for crawling the site? Should I move the Compose stack inside the same k8s context? Should I work with DNS resolution or networking?
All choices require me to study Kubernetes, so please suggest the easiest approach.
I would prefer to use browsertrix and not just browsertrix-crawler (that would be easy), because the Wordpress site had some calendar plugins, meaning endless trap of infinited links, so excluding with the UI is necessary.
Hm, that’s a good question! I assume you want to crawl it under the original domain, right? I think the simplest option might be to configure Browsertrix to crawl through a proxy, which is running on the host machine or inside the Docker Compose network where everything is already configured.
Great, thanks! Configuring the proxy seems easy. I’ll try it ASAP.
Yes, I want to crawl it under the original domain. Luckily, the original site was plain HTTP; otherwise, I would have had to create a self-signed certificate and CA and figure out how to use it in a Browsertrix browser.