Kubernetes pod help

wxsampson · October 4, 2024, 5:14pm

I’m running Browsertrix locally with Docker, Helm and Kubernetes.

I set a crawl overnight, and it appears to have caused some chaos with Kubernetes.

kubectl wait --for=condition=ready pod --all --timeout=300s

…runs out a long list of timed out pods, and I’m not sure how to begin troubleshooting this so Kubernetes runs again and I can re-access the Browsertrix local page. I want to keep all my previous crawl data, so trying to avoid a mass delete of pods.

Any help appreciated, I’m comfortable on the command line but not familiar much with Kubernetes.

wvengen · October 8, 2024, 7:04am

Hi! Nice to hear you’r running Browsertrix in Kubernetes. I think running your own instance of Browsertrix is not completely trivial, so be prepared to do some studying when you’re somewhat new to Kubernetes.
In this case, it could be that the job is restarted multiple times because of an error condition. This could happen, for example, when there are not enough resources (e.g. runs into a memory limit, or perhaps storage is full), then the job fails and is re-run. If you have a cluster with nodes that appear and disappear, it could also be that jobs are rescheduled on another node (but this is unlikely to keep happening).
I would suggest to start with a small job (e.g. one page) and see if you can get that working.
Also, dive deeper into the reasons. Use kubectl describe pod <pod name> to see what could be the case. Look at the logs kubectl logs <pod name>. If you have prometheus running, see what happened during the time a job was (re)started. Perhaps the Kubernetes Dashboard could help too here (not really sure what metrics it keeps though).

If you really want to get out the data, you could look into gathering all files from the persistent volumes, and creating the WACZ manually from there. But since it was only running one night, that might not be worth the effort.

Hope this helps a bit.