无法正常删除节点资源—— kubelet 问题排查
问题
删除 pod 时 发现,特定节点上的资源卡在 Terminating 这个状态
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
default test 1/1 Terminating 20 26d 192.168.196.133 node01 <none>
kube-system fabric-node-7p2z8 0/2 Terminating 0 21m <none> node01 <none>
describe pod,查看 events,有的 没有事件,有的 卡在了 scheduler 这一步:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 22m default-scheduler Successfully assigned kube-system/fabric-node-7p2z8 to node01
接下来,理应由 kubelet 接手进行对 pod 进行清理,但是卡在这里,我们查看 对应节点上的 kubelet 状态:
[root@node01 ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since 一 2020-01-20 15:21:47 CST; 5s ago
Docs: https://kubernetes.io/docs/
Process: 23737 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
Main PID: 23737 (code=exited, status=255)
1月 20 15:21:47 node01 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
1月 20 15:21:47 node01 systemd[1]: Unit kubelet.service entered failed state.
1月 20 15:21:47 node01 systemd[1]: kubelet.service failed.
可以看到,kubelet 已经跪了,但是没有什么有效的信息,我们再去看看 kubelet 的日志,找到有效的信息:
[root@node01 ~]# journalctl -l -u kubelet
...
Jan 20 15:05:34 node01 systemd[1]: kubelet.service holdoff time over, scheduling restart.
Jan 20 15:05:34 node01 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jan 20 15:05:34 node01 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jan 20 15:05:34 node01 kubelet[1797]: F0120 15:05:34.624977 1797 server.go:190] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such
Jan 20 15:05:34 node01 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
Jan 20 15:05:34 node01 systemd[1]: Unit kubelet.service entered failed state.
Jan 20 15:05:34 node01 systemd[1]: kubelet.service failed.
日志可知,缺少 kubelet 的配置文件:"/var/lib/kubelet/config.yaml",查看确实如此,但是怎么解决呢?
Google/Baidu 一下,多数都是重新部署,原因是这个文件由 kubeadm init/join 时生成,配置文件丢失,那就重新生成一份好了,我的这个是从节点,我就重新 join 一下
// master 上
//重新生成token
kubeadm token create
[root@master kubelet]# kubeadm token list
TOKEN TTL EXPIRES USAGES DESCRIPTION EXTRA GROUPS
mwy6r6.wc7s9fkwsyth85xq 23h 2020-01-21T15:50:05+08:00 authentication,signing <none> system:bootstrappers:kubeadm:default-node-token
// 生成密钥
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
// node 上执行,加入集群 注意要加 sha256 指明加密方式
kubeadm join 10.20.9.12:6443 \
--token mwy6r6.wc7s9fkwsyth85xq \
--discovery-token-ca-cert-hash \
sha256:aceb1a082cdffa655e77f89c25aa0e5ad24e4ef5b41a6aa459131890aef0d7c6
此时整个集群已经正常,查看 /var/lib/kubelet/config.yaml
路径下文件已补齐
Mission Complete!!!
- 原文作者:战神西红柿
- 原文链接:https://tomatoares.github.io/posts/cloud/kubelet-Q/
- 版权声明:本作品采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可,非商业转载请注明出处(作者,原文链接),商业转载请联系作者获得授权。