问题

删除 pod 时 发现,特定节点上的资源卡在 Terminating 这个状态

NAMESPACE     NAME                             READY   STATUS              RESTARTS   AGE   IP                NODE     NOMINATED NODE
default       test                             1/1     Terminating         20         26d   192.168.196.133   node01   <none>
kube-system   fabric-node-7p2z8                0/2     Terminating         0          21m   <none>            node01   <none>

describe pod,查看 events,有的 没有事件,有的 卡在了 scheduler 这一步:

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  22m   default-scheduler  Successfully assigned kube-system/fabric-node-7p2z8 to node01

接下来,理应由 kubelet 接手进行对 pod 进行清理,但是卡在这里,我们查看 对应节点上的 kubelet 状态:

[root@node01 ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: activating (auto-restart) (Result: exit-code) since 一 2020-01-20 15:21:47 CST; 5s ago
     Docs: https://kubernetes.io/docs/
  Process: 23737 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
 Main PID: 23737 (code=exited, status=255)

1月 20 15:21:47 node01 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
1月 20 15:21:47 node01 systemd[1]: Unit kubelet.service entered failed state.
1月 20 15:21:47 node01 systemd[1]: kubelet.service failed.

可以看到,kubelet 已经跪了,但是没有什么有效的信息,我们再去看看 kubelet 的日志,找到有效的信息:

[root@node01 ~]# journalctl -l -u kubelet
...
Jan 20 15:05:34 node01 systemd[1]: kubelet.service holdoff time over, scheduling restart.
Jan 20 15:05:34 node01 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jan 20 15:05:34 node01 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jan 20 15:05:34 node01 kubelet[1797]: F0120 15:05:34.624977    1797 server.go:190] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such
Jan 20 15:05:34 node01 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
Jan 20 15:05:34 node01 systemd[1]: Unit kubelet.service entered failed state.
Jan 20 15:05:34 node01 systemd[1]: kubelet.service failed.

日志可知,缺少 kubelet 的配置文件:"/var/lib/kubelet/config.yaml",查看确实如此,但是怎么解决呢?

Google/Baidu 一下,多数都是重新部署,原因是这个文件由 kubeadm init/join 时生成,配置文件丢失,那就重新生成一份好了,我的这个是从节点,我就重新 join 一下

// master 上
//重新生成token
kubeadm token create

[root@master kubelet]# kubeadm token list
TOKEN                     TTL       EXPIRES                     USAGES                   DESCRIPTION   EXTRA GROUPS
mwy6r6.wc7s9fkwsyth85xq   23h       2020-01-21T15:50:05+08:00   authentication,signing   <none>        system:bootstrappers:kubeadm:default-node-token

// 生成密钥
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'

// node 上执行,加入集群 注意要加 sha256 指明加密方式
  kubeadm join 10.20.9.12:6443 \
--token mwy6r6.wc7s9fkwsyth85xq \
--discovery-token-ca-cert-hash \
sha256:aceb1a082cdffa655e77f89c25aa0e5ad24e4ef5b41a6aa459131890aef0d7c6

此时整个集群已经正常,查看 /var/lib/kubelet/config.yaml 路径下文件已补齐

Mission Complete!!!