PVC 问题解决记录
问题发现
- github 上 16 年 12 月就提出来了
- describe pod
- describe pvc
- kube-controller 日志
kubectl describe pod ceph-static
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 102s (x1480 over 15h) default-scheduler pod has unbound immediate PersistentVolumeClaims
kubectl describe pvc ceph-kube-claim
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 54s (x417 over 15h) persistentvolume-controller Failed to provision volume with StorageClass "rbd": failed to create rbd image: executable file not found in $PATH, command output:
Mounted By: ceph-static
已知的两种解决方式
- 替换 kube-controller 镜像
- 使用 CSI
记录
-
kube-controller 没有 rbd 二进制 同样的方式验证
-
使用 out-of-tree 遇到的问题
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ExternalProvisioning 11s persistentvolume-controller waiting for a volume to be created, either by external provisioner "ceph.com/rbd" or manually created by system administrator
Normal Provisioning 9s ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6 External provisioner is provisioning volume for claim "default/claim1"
Warning ProvisioningFailed 2s ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6 failed to provision volume with StorageClass "rbd": failed to get admin secret from ["kube-system"/"ceph-admin-secret"]: secrets "ceph-admin-secret" is forbidden: User "system:serviceaccount:default:rbd-provisioner" cannot get resource "secrets" in API group "" in the namespace "kube-system"
Mounted By: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 7m3s ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6 failed to provision volume with StorageClass "rbd": failed to get admin secret from ["kube-system"/"ceph-admin-secret"]: secrets "ceph-admin-secret" is forbidden: User "system:serviceaccount:default:rbd-provisioner" cannot get resource "secrets" in API group "" in the namespace "kube-system"
Normal Provisioning 3m15s (x5 over 7m10s) ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6 External provisioner is provisioning volume for claim "default/claim1"
Warning ProvisioningFailed 3m14s (x4 over 6m45s) ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6 failed to provision volume with StorageClass "rbd": missing Ceph monitors
Normal ExternalProvisioning 57s (x26 over 7m12s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "ceph.com/rbd" or manually created by system administrator
Mounted By: test-pod
核心: pv-sc-->获取 secret 失败-->用户无法获取资源
failed to provision volume with StorageClass "rbd":
failed to get admin secret from ["kube-system"/"ceph-admin-secret"]: secrets "ceph-admin-secret" is forbidden:
User "system:serviceaccount:default:rbd-provisioner" cannot get resource "secrets" in API group "" in the namespace "kube-system"
在 default 下创建和 kube-system 相同的 secret
- 需要补习 k8s 权限相关知识
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Provisioning 10s ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6 External provisioner is provisioning volume for claim "default/claim1"
Normal ExternalProvisioning 10s persistentvolume-controller waiting for a volume to be created, either by external provisioner "ceph.com/rbd" or manually created by system administrator
Warning ProvisioningFailed 10s ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6 failed to provision volume with StorageClass "rbd": missing Ceph monitors
Mounted By: test-pod
核心:相关 issue 可能是 dns 解析问题,将 storageclass 中的 ceph 地址换为 ip,不用再域名解析
又一个问题:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Provisioning 5m56s (x7 over 21m) ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6 External provisioner is provisioning volume for claim "default/claim1"
Warning ProvisioningFailed 5m56s (x7 over 21m) ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6 failed to provision volume with StorageClass "rbd": failed to get admin secret from ["kube-system"/"ceph-admin-secret"]: secrets "ceph-admin-secret" is forbidden: User "system:serviceaccount:default:rbd-provisioner" cannot get resource "secrets" in API group "" in the namespace "kube-system"
Normal ExternalProvisioning 64s (x83 over 21m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "ceph.com/rbd" or manually created by system administrator
Mounted By: test-pod
并且时不时的会出现权限问题
更换 provisioner deployment 的 namespace 直接就起不来了
使用 non-rbac 方式部署
环境调整为:kubernetes 1.15.1
- default 下 secret
- admin
- kube
- deployment default provisioner
- storageclass
- pvc
- pod
看起来正常,但是 pvc 持续 pending 查看 provisioner 日志:
kubectl logs -f rbd-provisioner-5dfb574774-5tfcg
...
E1125 07:15:35.106667 1 leaderelection.go:234] error retrieving resource lock default/ceph.com-rbd: endpoints "ceph.com-rbd" is forbidden: User "system:serviceaccount:default:default" cannot get resource "endpoints" in API group "" in the namespace "default"
相关 issue: 建议 rbac 方式部署
使用 rbac 方式部署
到底还是权限问题
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Provisioning 50s (x6 over 8m35s) ceph.com/rbd_rbd-provisioner-98b88f5d6-n8mk8_a21cced6-26d8-11ea-95e4-12902d150902 External provisioner is provisioning volume for claim "default/claim1"
Warning ProvisioningFailed 50s (x6 over 8m35s) ceph.com/rbd_rbd-provisioner-98b88f5d6-n8mk8_a21cced6-26d8-11ea-95e4-12902d150902 failed to provision volume with StorageClass "rbd": failed to get admin secret from ["kube-system"/"ceph-admin-secret"]: secrets "ceph-admin-secret" is forbidden: User "system:serviceaccount:default:rbd-provisioner" cannot get resource "secrets" in API group "" in the namespace "kube-system"
Normal ExternalProvisioning 25s (x42 over 10m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "ceph.com/rbd" or manually created by system administrator
查看相关 issue: 应该是 clusterrole 的权限缺少对 secrets 的读取权限
即使报错:missing Ceph monitors
, 也是权限问题
更改 clusterrole.yaml, 添加:
- apiGroups: [""]
resources: ["secrets"]
verbs: ["create","get","list","watch"]
重新部署,pvc 终于可以用
- 还有的小问题
descirbe pvc claim, 显示的 volumeMode 还是文件系统
VolumeMode: Filesystem
开始 resize 功能测试
rbd 支持 resize 功能,可用相应命令行直接进行测试
[root@k8s-master01 kube]# rbd ls -p kube
kubernetes-dynamic-pvc-c70dd221-26da-11ea-a08c-9e09d9def392
kubernetes-dynamic-pvc-d5d7f8b2-fbb8-11e9-b33f-2ae96f292ca7
[root@k8s-master01 kube]# kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
claim1 Bound pvc-ca954649-175b-4668-a667-2336a2470a6c 1Gi RWO rbd 70m
[root@k8s-master01 kube]# rbd info kubernetes-dynamic-pvc-d5d7f8b2-fbb8-11e9-b33f-2ae96f292ca7 -p kube
rbd image 'kubernetes-dynamic-pvc-d5d7f8b2-fbb8-11e9-b33f-2ae96f292ca7':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.1c1af6b8b4567
format: 2
features: layering
flags:
[root@k8s-master01 kube]# rbd info kubernetes-dynamic-pvc-c70dd221-26da-11ea-a08c-9e09d9def392 -p kube
rbd image 'kubernetes-dynamic-pvc-c70dd221-26da-11ea-a08c-9e09d9def392':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.1dc536b8b4567
format: 2
features: layering
flags:
[root@k8s-master01 kube]# rbd resize --size 2048 kubernetes-dynamic-pvc-d5d7f8b2-fbb8-11e9-b33f-2ae96f292ca7 -p kube
Resizing image: 100% complete...done.
[root@k8s-master01 kube]# rbd resize --size 2048 kubernetes-dynamic-pvc-c70dd221-26da-11ea-a08c-9e09d9def392 -p kube
Resizing image: 100% complete...done.
我们需要测的是 storageclass 创建的 rbd image 自动化扩容:通过编辑已创建的 pvc 的容量,rbd image 动态地、自动化的扩容
大致流程
- 打开 sc AllowVolumeExpansion
- 打开 k8s 准入控制器
- edit pvc 容量=》应该能看到 image 自动扩容
- 查看对应 挂在卷是否扩容=》文件系统是否扩容
storageclass.yaml 添加:allowVolumeExpansion: true
修改之后完整的 yaml:
allowVolumeExpansion: true
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: rbd
provisioner: ceph.com/rbd
parameters:
monitors: 10.20.9.22:6789
pool: kube
adminId: admin
adminSecretNamespace: kube-system
adminSecretName: ceph-admin-secret
userId: kube
userSecretNamespace: kube-system
userSecretName: ceph-secret
imageFormat: "2"
imageFeatures: layering
创建 pvc,创建 pod,edit pvc
看到的效果:VolumeResizeFailed
Conditions:
Type Status LastProbeTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
Resizing True Mon, 01 Jan 0001 00:00:00 +0000 Wed, 25 Dec 2019 20:13:14 +0800
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ExternalProvisioning 21m persistentvolume-controller waiting for a volume to be created, either by external provisioner "ceph.com/rbd" or manually created by system administrator
Normal Provisioning 21m ceph.com/rbd_rbd-provisioner-98b88f5d6-vdl99_bc97b613-26da-11ea-a08c-9e09d9def392 External provisioner is provisioning volume for claim "default/claim2"
Normal ProvisioningSucceeded 21m ceph.com/rbd_rbd-provisioner-98b88f5d6-vdl99_bc97b613-26da-11ea-a08c-9e09d9def392 Successfully provisioned volume pvc-2d89a03c-ca23-401e-bdfa-3daea35b228f
Warning VolumeResizeFailed 9m22s (x18 over 20m) volume_expand
-
缺少 rbd 相关命令,还是得替换 controller-manager 镜像:直接替换 /etc/kubernetes/manifests/kube-controller-manager.yaml 中的镜像,集群会自动重启(kubelet 监控)
-
edit pvc
-
需要重启 pod(挂载 pvc 的),查看 pvc 发现容量已经变化
-
进入 pod 内部,发现对应文件夹大小也发生变化
TODO
-
看明白 external-storage 代码:就是启动了个 controller
- 根据命令行 生成配置 config
- 根据配置 起客户端 clientset
- 获取 provisioner name,provisioner id,
- 启动 provisioner:查看 log 发现 name=id=“ceph.com/rbd”
- 构建完整的 provision controller,并持续运行
-
看明白 rbac 用户定义
- 定义的 serviceaccount 没有获取 namespace=kube-system 下 secret 的权限
- 这个 serviceaccount 绑定了一个 role(有 secret 权限) 和一个 clusterrole(没有 secret),且都在 namespace=default 下
- storageclass rbd 在使用这个 serviceaccount,这个 rbd 在 default 下
- 需要跨 namespace 权限,则需要在现有 clusterrole 中添加 secret 的权限
-
测试 1.12.6 环境
-
storageclass 替换 ip 为域名进行测试
- 确实不行,需要配置额外的域名解析
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ProvisioningFailed 14s ceph.com/rbd_rbd-provisioner-98b88f5d6-b77qb_4274cbd3-2733-11ea-a019-36c21d3844ee failed to provision volume with StorageClass "rbd": failed to create rbd image: exit status 22, command output: did not load config file, using default settings. 2019-12-25 16:56:16.028 7f431ab1e900 -1 Errors while parsing config file! 2019-12-25 16:56:16.028 7f431ab1e900 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory 2019-12-25 16:56:16.028 7f431ab1e900 -1 parse_file: cannot open /root/.ceph/ceph.conf: (2) No such file or directory 2019-12-25 16:56:16.028 7f431ab1e900 -1 parse_file: cannot open ceph.conf: (2) No such file or directory 2019-12-25 16:56:16.029 7f431ab1e900 -1 Errors while parsing config file! 2019-12-25 16:56:16.029 7f431ab1e900 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory 2019-12-25 16:56:16.029 7f431ab1e900 -1 parse_file: cannot open /root/.ceph/ceph.conf: (2) No such file or directory 2019-12-25 16:56:16.029 7f431ab1e900 -1 parse_file: cannot open ceph.conf: (2) No such file or directory server name not found: ceph (Name or service not known) unable to parse addrs in 'ceph:6789'
-
构建包含 ceph-common 的 controller-manager 镜像,替换 kube-controller-manager,解决没有 rbd 命令的问题
- 可能也能解决 无法 resize 的问题
-
搞清楚 视频里为啥不用走这些流程(使用 external-storage,或者更换 controller-manager)
-
回收策略怎么看
PersistentVolumes 可以有多种回收策略,包括 “Retain”、”Recycle” 和 “Delete”。对于动态配置的 PersistentVolumes 来说,默认回收策略为 “Delete”。这表示当用户删除对应的 PersistentVolumeClaim 时,动态配置的 volume 将被自动删除。如果 volume 包含重要数据时,这种自动行为可能是不合适的。那种情况下,更适合使用 “Retain” 策略。使用 “Retain” 时,如果用户删除 PersistentVolumeClaim,对应的 PersistentVolume 不会被删除。相反,它将变为 Released 状态,表示所有的数据可以被手动恢复 总结:
- Delete pvc 删除,pv 自动删除
- Retain pvc 删除,pv 保留
- Recyle
-
edit 详细用法
-
apply 和 create 区别
-
controller-manager 原生 dockerfile
-
为什么 non-rbac 方式运行不了
- 原文作者:战神西红柿
- 原文链接:https://tomatoares.github.io/posts/storage/pvc/
- 版权声明:本作品采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可,非商业转载请注明出处(作者,原文链接),商业转载请联系作者获得授权。