Etcd Down After Running A Day Or Less
etcd is down after running for some uncertain time
- Problem
I deploy a kubernetes cluster with 3 VM(4C8G). After running for some uncertain time (one day or just 1 hour ) the k8s service is not available.
At first, i found the etcd and apiserver container is down and been start by kubelet. I try to stop kubelet service, docker service, and start docker then kubelet service.
- First try
This solution does not work everytime, sometimes i need to repeat the procure for three times more, sometimes i need to reboot the VM which host etcd container.
- Checking log
When checking the log of messages, nothingness useful seems found.
When i check the log of etcd container with “docker logs ETCD-CONTAINER-ID” no error log is found, but etcd is been restarted by kubelet on and on.
- Fixed
After been troubled by the problem for a week, all a sudden i realize the performance may be the problem.
I check the yaml of etcd pod, found “initialDelaySeconds” and “timeoutSeconds” of etcd is just 15, I try to update those value to 60,
and do the procedure above to restart k8s cluster service.
THE problem fixed. :)