Troubleshooting and debugging in Kubernetes environment

Submitted by Admin on

Application containers running in Kubernetes are usually slim and don't include necessary tools for debugging. Most times they also don't run as root users so you can't do packet captures. In this article, we will discuss some alternatives.

TCPDUMP

1) TCPDUMP is used by most engineers to do network packet capture to analyze connectivity related issues. Since containers don't normally give you root, or have tcpdump installed, we can run tcpdump on the hosts. If you are using Google Cloud Kubernetes Engine (GKE),  the GKE worker nodes are just normal GCE VMs.

2) To capture traffic for a given POD, we first need to find the POD IP and which node it is running on (eg: kubectl get pod -o wide), then you can find that node IP (kubectl get node -o wide) to ssh to it.

3) Default GKE nodes runs Container OS (cos). It doesn't come by default with tcpdump or other tools. Once you ssh into the node, run toolbox to get into a Debian debug environment where you can install and run tools such as tcpdump using Debian package manager,

eg: apt install tcpdump

4) We can run tcpdump with a proper filter for the POD ip or application traffic port. If you used "toolbox" to run tcpdump and saved the pcap file, the resulting pcap file may be saved in the container specific directory. You may need to run below command to find the exact place: find / -name file.pcap

Debug pods without shell

Sometimes we need to check connections (netstat) inside a pod that doesn't have a shell or netstat installed. We can work around this by finding the worker node where the pod is running on, and login that worker node to run netstat.

$ kubectl exec -it pod-name sh

OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown

command terminated with exit code 126

Since the pod doesn't have shell, we can find the node where the pod is running (kubectl get pods -o wide | grep pod-name) and ssh to the node to run netstat. Once you are on the node, and find the PID (eg: 1234) of the docker process corresponding to your pod, you then:

$ sudo nsenter -t 1234 --net netstat -an

This approach can also be used to run other network commands that are installed on the worker node for the network namespace of the application container that we can't shell into using normal kubectl exec.

If we really need to shell into the running pod, we can launch a root user busybox as a dummy sidecar container, sharing the same network and process namespace as the application container.

$ docker run --privileged -v /tmp:/temp --rm --pid container:container_id --net container:container_id -it busybox /bin/sh

As you can see, we are now able to see the main application's process and network namespace but with our own disk. Since we are root user, we can add additional tools as needed if we were to use other container images such as alpine (apk add) or debian (apt install). The docker run option -v /tmp:/temp gives a way to output files to /temp so that you can access them from worker node's /tmp after you exit the container.