问题发现

  • github 上 16 年 12 月就提出来了
  1. describe pod
  2. describe pvc
  3. kube-controller 日志

kubectl describe pod ceph-static

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  102s (x1480 over 15h)  default-scheduler  pod has unbound immediate PersistentVolumeClaims

 kubectl describe  pvc ceph-kube-claim

  Type       Reason              Age                  From                         Message
  ----       ------              ----                 ----                         -------
  Warning    ProvisioningFailed  54s (x417 over 15h)  persistentvolume-controller  Failed to provision volume with StorageClass "rbd": failed to create rbd image: executable file not found in $PATH, command output:
Mounted By:  ceph-static

已知的两种解决方式

  1. 替换 kube-controller 镜像
  2. 使用 CSI

记录

  1. kube-controller 没有 rbd 二进制 同样的方式验证

  2. 使用镜像替换

  3. 使用 CSI in-tree–>out-of-tree sidecar 模式

  4. 使用 out-of-tree 遇到的问题

Events:
  Type       Reason                Age   From                                                                              Message
  ----       ------                ----  ----                                                                              -------
  Normal     ExternalProvisioning  11s   persistentvolume-controller                                                       waiting for a volume to be created, either by external provisioner "ceph.com/rbd" or manually created by system administrator
  Normal     Provisioning          9s    ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6  External provisioner is provisioning volume for claim "default/claim1"
  Warning    ProvisioningFailed    2s    ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6  failed to provision volume with StorageClass "rbd": failed to get admin secret from ["kube-system"/"ceph-admin-secret"]: secrets "ceph-admin-secret" is forbidden: User "system:serviceaccount:default:rbd-provisioner" cannot get resource "secrets" in API group "" in the namespace "kube-system"
Mounted By:  <none>

Events:
  Type       Reason                Age                    From                                                                              Message
  ----       ------                ----                   ----                                                                              -------
  Warning    ProvisioningFailed    7m3s                   ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6  failed to provision volume with StorageClass "rbd": failed to get admin secret from ["kube-system"/"ceph-admin-secret"]: secrets "ceph-admin-secret" is forbidden: User "system:serviceaccount:default:rbd-provisioner" cannot get resource "secrets" in API group "" in the namespace "kube-system"
  Normal     Provisioning          3m15s (x5 over 7m10s)  ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6  External provisioner is provisioning volume for claim "default/claim1"
  Warning    ProvisioningFailed    3m14s (x4 over 6m45s)  ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6  failed to provision volume with StorageClass "rbd": missing Ceph monitors
  Normal     ExternalProvisioning  57s (x26 over 7m12s)   persistentvolume-controller                                                       waiting for a volume to be created, either by external provisioner "ceph.com/rbd" or manually created by system administrator
Mounted By:  test-pod

核心: pv-sc-->获取 secret 失败-->用户无法获取资源
failed to provision volume with StorageClass "rbd":
 failed to get admin secret from ["kube-system"/"ceph-admin-secret"]: secrets "ceph-admin-secret" is forbidden: 
 User "system:serviceaccount:default:rbd-provisioner" cannot get resource "secrets" in API group "" in the namespace "kube-system"

在 default 下创建和 kube-system 相同的 secret

  • 需要补习 k8s 权限相关知识
Events:
  Type       Reason                Age   From                                                                              Message
  ----       ------                ----  ----                                                                              -------
  Normal     Provisioning          10s   ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6  External provisioner is provisioning volume for claim "default/claim1"
  Normal     ExternalProvisioning  10s   persistentvolume-controller                                                       waiting for a volume to be created, either by external provisioner "ceph.com/rbd" or manually created by system administrator
  Warning    ProvisioningFailed    10s   ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6  failed to provision volume with StorageClass "rbd": missing Ceph monitors
Mounted By:  test-pod

核心:相关 issue 可能是 dns 解析问题,将 storageclass 中的 ceph 地址换为 ip,不用再域名解析

又一个问题:

  Type       Reason                Age                  From                                                                              Message
  ----       ------                ----                 ----                                                                              -------
  Normal     Provisioning          5m56s (x7 over 21m)  ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6  External provisioner is provisioning volume for claim "default/claim1"
  Warning    ProvisioningFailed    5m56s (x7 over 21m)  ceph.com/rbd_rbd-provisioner-db574c5c-r7bn8_2da3f67c-261a-11ea-862b-4a11e1eb43d6  failed to provision volume with StorageClass "rbd": failed to get admin secret from ["kube-system"/"ceph-admin-secret"]: secrets "ceph-admin-secret" is forbidden: User "system:serviceaccount:default:rbd-provisioner" cannot get resource "secrets" in API group "" in the namespace "kube-system"
  Normal     ExternalProvisioning  64s (x83 over 21m)   persistentvolume-controller                                                       waiting for a volume to be created, either by external provisioner "ceph.com/rbd" or manually created by system administrator
Mounted By:  test-pod

并且时不时的会出现权限问题

更换 provisioner deployment 的 namespace 直接就起不来了

使用 non-rbac 方式部署

环境调整为:kubernetes 1.15.1

  1. default 下 secret
    1. admin
    2. kube
  2. deployment default provisioner
  3. storageclass
  4. pvc
  5. pod

看起来正常,但是 pvc 持续 pending 查看 provisioner 日志:

 kubectl logs -f rbd-provisioner-5dfb574774-5tfcg

 ...
E1125 07:15:35.106667       1 leaderelection.go:234] error retrieving resource lock default/ceph.com-rbd: endpoints "ceph.com-rbd" is forbidden: User "system:serviceaccount:default:default" cannot get resource "endpoints" in API group "" in the namespace "default"

相关 issue: 建议 rbac 方式部署

使用 rbac 方式部署

到底还是权限问题

Events:
  Type     Reason                Age                  From                                                                               Message
  ----     ------                ----                 ----                                                                               -------
  Normal   Provisioning          50s (x6 over 8m35s)  ceph.com/rbd_rbd-provisioner-98b88f5d6-n8mk8_a21cced6-26d8-11ea-95e4-12902d150902  External provisioner is provisioning volume for claim "default/claim1"
  Warning  ProvisioningFailed    50s (x6 over 8m35s)  ceph.com/rbd_rbd-provisioner-98b88f5d6-n8mk8_a21cced6-26d8-11ea-95e4-12902d150902  failed to provision volume with StorageClass "rbd": failed to get admin secret from ["kube-system"/"ceph-admin-secret"]: secrets "ceph-admin-secret" is forbidden: User "system:serviceaccount:default:rbd-provisioner" cannot get resource "secrets" in API group "" in the namespace "kube-system"
  Normal   ExternalProvisioning  25s (x42 over 10m)   persistentvolume-controller                                                        waiting for a volume to be created, either by external provisioner "ceph.com/rbd" or manually created by system administrator

查看相关 issue: 应该是 clusterrole 的权限缺少对 secrets 的读取权限

即使报错:missing Ceph monitors, 也是权限问题

更改 clusterrole.yaml, 添加:

  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["create","get","list","watch"]

重新部署,pvc 终于可以用

  • 还有的小问题

descirbe pvc claim, 显示的 volumeMode 还是文件系统

VolumeMode:    Filesystem

开始 resize 功能测试

rbd 支持 resize 功能,可用相应命令行直接进行测试

[root@k8s-master01 kube]# rbd ls -p kube
kubernetes-dynamic-pvc-c70dd221-26da-11ea-a08c-9e09d9def392
kubernetes-dynamic-pvc-d5d7f8b2-fbb8-11e9-b33f-2ae96f292ca7
[root@k8s-master01 kube]# kubectl get pvc
NAME     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
claim1   Bound    pvc-ca954649-175b-4668-a667-2336a2470a6c   1Gi        RWO            rbd            70m
[root@k8s-master01 kube]# rbd info kubernetes-dynamic-pvc-d5d7f8b2-fbb8-11e9-b33f-2ae96f292ca7 -p kube
rbd image 'kubernetes-dynamic-pvc-d5d7f8b2-fbb8-11e9-b33f-2ae96f292ca7':
	size 1024 MB in 256 objects
	order 22 (4096 kB objects)
	block_name_prefix: rbd_data.1c1af6b8b4567
	format: 2
	features: layering
	flags: 
[root@k8s-master01 kube]# rbd info kubernetes-dynamic-pvc-c70dd221-26da-11ea-a08c-9e09d9def392 -p kube
rbd image 'kubernetes-dynamic-pvc-c70dd221-26da-11ea-a08c-9e09d9def392':
	size 1024 MB in 256 objects
	order 22 (4096 kB objects)
	block_name_prefix: rbd_data.1dc536b8b4567
	format: 2
	features: layering
	flags: 
[root@k8s-master01 kube]# rbd resize --size 2048 kubernetes-dynamic-pvc-d5d7f8b2-fbb8-11e9-b33f-2ae96f292ca7  -p kube
Resizing image: 100% complete...done.
[root@k8s-master01 kube]# rbd resize --size 2048 kubernetes-dynamic-pvc-c70dd221-26da-11ea-a08c-9e09d9def392  -p kube
Resizing image: 100% complete...done.

我们需要测的是 storageclass 创建的 rbd image 自动化扩容:通过编辑已创建的 pvc 的容量,rbd image 动态地、自动化的扩容

大致流程

  1. 打开 sc AllowVolumeExpansion
  2. 打开 k8s 准入控制器
  3. edit pvc 容量=》应该能看到 image 自动扩容
  4. 查看对应 挂在卷是否扩容=》文件系统是否扩容

storageclass.yaml 添加:allowVolumeExpansion: true

修改之后完整的 yaml:

allowVolumeExpansion: true
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: rbd
provisioner: ceph.com/rbd
parameters:
  monitors: 10.20.9.22:6789
  pool: kube
  adminId: admin
  adminSecretNamespace: kube-system
  adminSecretName: ceph-admin-secret
  userId: kube
  userSecretNamespace: kube-system
  userSecretName: ceph-secret
  imageFormat: "2"
  imageFeatures: layering

创建 pvc,创建 pod,edit pvc

看到的效果:VolumeResizeFailed

Conditions:
  Type       Status  LastProbeTime                     LastTransitionTime                Reason  Message
  ----       ------  -----------------                 ------------------                ------  -------
  Resizing   True    Mon, 01 Jan 0001 00:00:00 +0000   Wed, 25 Dec 2019 20:13:14 +0800           
Events:
  Type     Reason                 Age                   From                                                                               Message
  ----     ------                 ----                  ----                                                                               -------
  Normal   ExternalProvisioning   21m                   persistentvolume-controller                                                        waiting for a volume to be created, either by external provisioner "ceph.com/rbd" or manually created by system administrator
  Normal   Provisioning           21m                   ceph.com/rbd_rbd-provisioner-98b88f5d6-vdl99_bc97b613-26da-11ea-a08c-9e09d9def392  External provisioner is provisioning volume for claim "default/claim2"
  Normal   ProvisioningSucceeded  21m                   ceph.com/rbd_rbd-provisioner-98b88f5d6-vdl99_bc97b613-26da-11ea-a08c-9e09d9def392  Successfully provisioned volume pvc-2d89a03c-ca23-401e-bdfa-3daea35b228f
  Warning  VolumeResizeFailed     9m22s (x18 over 20m)  volume_expand
  • 缺少 rbd 相关命令,还是得替换 controller-manager 镜像:直接替换 /etc/kubernetes/manifests/kube-controller-manager.yaml 中的镜像,集群会自动重启(kubelet 监控)

  • edit pvc

  • 需要重启 pod(挂载 pvc 的),查看 pvc 发现容量已经变化

  • 进入 pod 内部,发现对应文件夹大小也发生变化

TODO

  1. 看明白 external-storage 代码:就是启动了个 controller

    1. 根据命令行 生成配置 config
    2. 根据配置 起客户端 clientset
    3. 获取 provisioner name,provisioner id,
    4. 启动 provisioner:查看 log 发现 name=id=“ceph.com/rbd”
    5. 构建完整的 provision controller,并持续运行
  2. 看明白 rbac 用户定义

    1. 定义的 serviceaccount 没有获取 namespace=kube-system 下 secret 的权限
    2. 这个 serviceaccount 绑定了一个 role(有 secret 权限) 和一个 clusterrole(没有 secret),且都在 namespace=default 下
    3. storageclass rbd 在使用这个 serviceaccount,这个 rbd 在 default 下
    4. 需要跨 namespace 权限,则需要在现有 clusterrole 中添加 secret 的权限
  3. 测试 1.12.6 环境

  4. storageclass 替换 ip 为域名进行测试

    1. 确实不行,需要配置额外的域名解析
    Events:
    Type     Reason              Age   From                                                                               Message
    ----     ------              ----  ----                                                                               -------
    Warning  ProvisioningFailed  14s   ceph.com/rbd_rbd-provisioner-98b88f5d6-b77qb_4274cbd3-2733-11ea-a019-36c21d3844ee  failed to provision volume with StorageClass "rbd": failed to create rbd image: exit status 22, command output: did not load config file, using default settings.
    2019-12-25 16:56:16.028 7f431ab1e900 -1 Errors while parsing config file!
    2019-12-25 16:56:16.028 7f431ab1e900 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
    2019-12-25 16:56:16.028 7f431ab1e900 -1 parse_file: cannot open /root/.ceph/ceph.conf: (2) No such file or directory
    2019-12-25 16:56:16.028 7f431ab1e900 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
    2019-12-25 16:56:16.029 7f431ab1e900 -1 Errors while parsing config file!
    2019-12-25 16:56:16.029 7f431ab1e900 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
    2019-12-25 16:56:16.029 7f431ab1e900 -1 parse_file: cannot open /root/.ceph/ceph.conf: (2) No such file or directory
    2019-12-25 16:56:16.029 7f431ab1e900 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
    server name not found: ceph (Name or service not known)
    unable to parse addrs in 'ceph:6789'
    
  5. 构建包含 ceph-common 的 controller-manager 镜像,替换 kube-controller-manager,解决没有 rbd 命令的问题

    1. 可能也能解决 无法 resize 的问题
  6. 搞清楚 视频里为啥不用走这些流程(使用 external-storage,或者更换 controller-manager)

  7. 回收策略怎么看

    PersistentVolumes 可以有多种回收策略,包括 “Retain”、”Recycle” 和 “Delete”。对于动态配置的 PersistentVolumes 来说,默认回收策略为 “Delete”。这表示当用户删除对应的 PersistentVolumeClaim 时,动态配置的 volume 将被自动删除。如果 volume 包含重要数据时,这种自动行为可能是不合适的。那种情况下,更适合使用 “Retain” 策略。使用 “Retain” 时,如果用户删除 PersistentVolumeClaim,对应的 PersistentVolume 不会被删除。相反,它将变为 Released 状态,表示所有的数据可以被手动恢复 总结:

    1. Delete pvc 删除,pv 自动删除
    2. Retain pvc 删除,pv 保留
    3. Recyle
  8. edit 详细用法

  9. apply 和 create 区别

  10. controller-manager 原生 dockerfile

  11. 为什么 non-rbac 方式运行不了