Archive a local running website

Request for advice - this is not strictly related to Browsertrix but rather to Kubernetes and networking.

I restored a disappeared website on my workstation that was running a classic LAMP stack with WordPress. The site was hacked 10+ years ago and was running an old WordPress version with dozens of outdated plugins. I managed to restore it by creating a Docker Compose stack and later cleaning and patching the database.

I manually edited my /etc/hosts to point the original domain to 127.0.0.1. Since the original site didn’t have HTTPS, I’m able to browse a complete working copy of the site.

On the same workstation, I run Browsertrix on k3s.

What’s your best advice for crawling the site? Should I move the Compose stack inside the same k8s context? Should I work with DNS resolution or networking?

All choices require me to study Kubernetes, so please suggest the easiest approach.
I would prefer to use browsertrix and not just browsertrix-crawler (that would be easy), because the Wordpress site had some calendar plugins, meaning endless trap of infinited links, so excluding with the UI is necessary.

Hm, that’s a good question! I assume you want to crawl it under the original domain, right? I think the simplest option might be to configure Browsertrix to crawl through a proxy, which is running on the host machine or inside the Docker Compose network where everything is already configured.

It sounds like the main question is how to access the host network from within the k3s cluster. From: https://stackoverflow.com/questions/74795408/clean-way-to-connect-to-services-running-on-the-same-host-as-the-kubernetes-clus - it looks like k3s just supporting accessing the hostname directly, so perhaps that will work?

Then, you can just run: GitHub - tarampampam/3proxy-docker: 🥷 Docker image with 3proxy - Tiny free proxy server either inside compose network and map the proxy port or directly on the port, and configure browsertrix to use a proxy with: Configuring Proxies - Browsertrix Docs and access the mapped port.
Hopefully this will work!

1 Like

Great, thanks! Configuring the proxy seems easy. I’ll try it ASAP.
Yes, I want to crawl it under the original domain. Luckily, the original site was plain HTTP; otherwise, I would have had to create a self-signed certificate and CA and figure out how to use it in a Browsertrix browser.

I’ll keep you updated on my progress.