[TOC]

0x00 简述

描述:在学习任何一门新技术总是免不了坑坑拌拌,当您学会了记录坑后然后将其记录当下次遇到,相同问题的时候可以第一时间进行处理;


0x01 配置文件与启动参数

1.Kubelet 启动参数

启动参数总结一览表:

1
--register-node [Boolean] # 节点是否自动注册

/etc/kubernetes/kubelet.conf

关于构建环境

您可以根据自己的情况将构建环境与部署环境分开,例如:

学习时,参考本教程,使用 kubernetes 的 master 节点完成 构建和镜像推送
开发时,在自己的笔记本上完成 构建和镜像推送
工作中,使用 Jenkins Pipeline 或者 gitlab-runner Pipeline 来完成 构建和镜像推送


K8S Containerd 镜像回收GC参数配置
参考地址: https://kubernetes.io/docs/concepts/architecture/garbage-collection/

1
2
3
4
$ vim /var/lib/kubelet/config.yaml
imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
maxPods: 180 # pod最大数

Kubelet 相关配置只要修改后,都需进入如下操作

1
2
systemctl daemon-reload
systemctl restart kubelet.service


0x02 入坑弃坑

问题1.初始化master节点镜像拉取失败问题

描述:APISERVER_NAME 不能是 master 的 hostname,且必须全为小写字母、数字、小数点,不能包含减号export APISERVER_NAME=apiserver.weiyi
POD_SUBNET 所使用的网段不能与 master节点/worker节点 所在的网段重叠( CIDR 值:无类别域间路由,Classless Inter-Domain Routing),export POD_SUBNET=10.100.0.1/16
解决办法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 1.如不能下载 kubernetes 的 docker 镜像 ,请替换镜像源以及手工初始化
# --image-repository= mirrorgcrio
# --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers
~$ kubeadm config images list --image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers
registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.19.6
registry.cn-hangzhou.aliyuncs.com/google_containers/kube-controller-manager:v1.19.6
registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.19.6
registry.cn-hangzhou.aliyuncs.com/google_containers/kube-proxy:v1.19.6
registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.2
registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.4.13-0
registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:1.7.0


#2.检查环境变量
echo MASTER_IP=${MASTER_IP} && echo APISERVER_NAME=${APISERVER_NAME} && echo POD_SUBNET=${POD_SUBNET}

Tips : 在重新初始化 master 节点前,请先执行 kubeadm reset -f 操作;


问题2.Master与pod状态查看显示Pending[ImagePullBackoff]异常问题

问题描述:

  • 1.如果输出结果中出现 ImagePullBackoff 或者长时间处于 Pending 的情况
    1
    2
    3
    4
    5
    6
    $kubectl get pods calico-node-4vql2 -n kube-system -o wide
    NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    calico-node-4vql2 0/1 Pending[ImagePullBackoff ] 0 7m22s <none> node <none> <none>
    NAME READY STATUS RESTARTS AGE
    coredns-94d74667-6dj45 1/1 ImagePullBackOff 0 12m
    calico-node-4vql2 1/1 Pending 0 12m
    解决方法:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    #(1)通过get pods找到pod被调度到了哪一个节点并,确定 Pod 所使用的容器镜像:
    kubectl get pods calico-node-4vql2 -n kube-system -o yaml | grep image:
    - image: calico/node:v3.13.1
    - image: calico/cni:v3.13.1
    - image: calico/pod2daemon-flexvol:v3.13.1

    kubectl get pods coredns-94d74667-6dj45 -n kube-system -o yaml | grep image:
    - image: registry.aliyuncs.com/google_containers/coredns:1.3.1

    #(2)在 Pod 所在节点执行 docker pull 指令(当Node状态为NotReady时候也可以采用此种方法,但不是唯一)d
    docker pull calico/node:v3.13.1
    docker pull calico/cni:v3.13.1
    docker pull calico/pod2daemon-flexvol:v3.13.1

    docker pull registry.aliyuncs.com/google_containers/coredns:1.3.1

    #(3)然后在master节点上查看状态恢复正常
    NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    calico-node-4vql2 1/1 Running 0 36m 10.10.107.192 node <none> <none>
WeiyiGeek.Pending

WeiyiGeek.Pending

  • 2.输出结果中某个 Pod 长期处于 ContainerCreating、PodInitializing 或 Init:0/3 的状态:
    解决办法:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    #(1)查看该 Pod 的状态
    kubectl describe pods -n kube-system calico-node-4vql2
    kubectl describe pods -n kube-system coredns-8567978547-bmd9f

    #(2)如果输出结果中,最后一行显示的是 Pulling image,请耐心等待
    Normal Pulling 44s kubelet, k8s-worker-02 Pulling image "calico/pod2daemon-flexvol:v3.13.1"

    #(3)将该 Pod 删除,系统会自动重建一个新的 Pod
    kubectl delete pod kube-flannel-ds-amd64-8l25c -n kube-system


问题3.worker节点 join加入cluster集群不成功的几种情况

  • 1.#worker 节点不能访问 apiserver
    • 如果 master 节点能够访问 apiserver、而 worker 节点不能,则请检查自己的网络设置,/etc/hosts 是否正确设置? 是否有安全组或防火墙的限制?
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      #master节点验证
      curl -ik https://localhost:6443
      #worker节点验证
      curl -ik https://apiserver.weiyi:6443
      #正常输出结果如下所示:
      HTTP/1.1 403 Forbidden
      Cache-Control: no-cache, private
      Content-Type: application/json
      X-Content-Type-Options: nosniff
      Date: Fri, 15 Nov 2019 04:34:40 GMT
      Content-Length: 233
      {
      "kind": "Status",
      "apiVersion": "v1",
      "metadata": {
      ...
  • 2.#worker 节点默认网卡
    • Kubelet使用的 IP 地址 与 master 节点可互通(无需 NAT 映射),且没有防火墙、安全组隔离
  • 3.#master 节点生成的token已过有效时间为 2 个小时 kubeadm token create


问题4.在master节点上执行kubectl命令报错`localhost:8080 was refused

错误信息:

1
2
kubectl apply -f calico-3.13.1.yaml
The connection to the server localhost:8080 was refused - did you specify the right host or port?

错误原因: 由于在初始化之后没将k8s的/etc/kubernetes/admin.conf拷贝到用户的加目录之中/root/.kube/config
解决办法:
1
2
3
4
5
6
7
8
9
10
11
12
13
# (1) 普通用户对集群访问配置文件设置
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# (2) 自动运行设置 KUBECONFIG 环境以及k8s命令自动补齐
grep "export KUBECONFIG" ~/.profile | echo "export KUBECONFIG=$HOME/.kube/config" >> ~/.profile
tee -a ~/.profile <<'EOF'
source <(kubectl completion bash)
source <(kubeadm completion bash)
# source <(helm completion bash)
EOF
source ~/.profile

PS : 如果在加入 k8s 集群时采用普通需要在前面加sudo kubeadm init ...用以提升权限否则将出现[ERROR IsPrivilegedUser]: user is not running as root该错误;


问题5.安装K8s时候kubelet报错提示`Container runtime network not ready

错误信息:

1
2
3
systemctl status kubelet
6月 23 09:04:02 master-01 kubelet[8085]: E0623 09:04:02.186893 8085 kubelet.go:2187] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady mes...ninitialized
6月 23 09:04:04 master-01 kubelet[8085]: W0623 09:04:04.938700 8085 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d

问题原因: 由于master节点初始化安装后报错,在未进行重置的情况下又进行初始化操作或者重置操作不完整导致,还有一种情况是没有安装网络组件比如(flannel 或者 calico);
解决办法: 执行以下命令重置初始化信息,然后在重新初始化;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
systemctl stop kubelet
docker stop $(docker ps -aq)
docker rm -f $(docker ps -aq)
systemctl stop docker
kubeadm reset
rm -rf $HOME/.kube /etc/kubernetes
rm -rf /var/lib/cni/ /etc/cni/ /var/lib/kubelet/*
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
systemctl start docker
systemctl start kubelet

#安装 calico 网络插件(没有高可用)
rm -f calico-3.13.1.yaml
wget -L https://kuboard.cn/install-script/calico/calico-3.13.1.yaml
kubectl apply -f calico-3.13.1.yaml


问题6.执行kubeadm reset无法进行节点重置,提示retrying of unary invoker failed

错误信息:

1
2
[reset] Removing info for node "master-01" from the ConfigMap "kubeadm-config" in the "kube-system" Namespace
{"level":"warn","ts":"2020-06-23T09:10:30.074+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-174bf993-5731-4b29-9b30-7e958ade79a4/10.10.107.191:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}

问题原因: 在重置前etcd容器处于运转之中导致无法进行节点的重置操作;
解决办法: 停止所有的容器以及docker服务然后再执行节点的重置操作
1
docker stop $(docker ps -aq) && systemctl stop docker


问题7.节点初始化在进行preflight时候提示`error execution phase preflight:[ERROR ImagePull]

问题描述:

1
2
3
4
5
6
kubeadm init --config=kubeadm-config.yaml --upload-certs
[init] Using Kubernetes version: v1.18.4
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR ImagePull]: failed to pull image registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.18.4: output: Error response from daemon: manifest for registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.18.4 not found: manifest unknown: manifest unknown
, error: exit status 1

问题原因: 由于k8s.gcr.io官方镜像网站无法下载镜像,而采用的同步镜像源站registry.cn-hangzhou.aliyuncs.com/google_containers/仓库中没有指定k8s版本的依赖组件;
解决办法: 换其它镜像进行尝试或者离线将镜像包导入的docker中(参考前面的笔记2-Kubernetes入门手动安装部署),建议在进行执行上面的命令前先执行kubeadm config images pull --image-repository mirrorgcrio --kubernetes-version=1.18.4查看镜像是否能被拉取;
1
2
3
4
5
6
7
# 常规k8s.gcr.io镜像站点
# gcr.azk8s.cn/google_containers/ # 已失效
registry.aliyuncs.com/google_containers/
registry.cn-hangzhou.aliyuncs.com/google_containers/

# harbor中k8s.gcr.io的镜像
mirrorgcrio


问题8.容器内部Kubernetes Service不能ping;

问题描述:

1
2
3
PING gateway-example.example.svc.cluster.local (10.105.141.232) 56(84) bytes of data.
From 172.17.76.171 (172.17.76.171) icmp_seq=1 Time to live exceeded
From 172.17.76.171 (172.17.76.171) icmp_seq=2 Time to live exceeded

问题原因:在 Kubernetes 的网络中Service 就是 ping 不通的,因为 Kubernetes 只是为 Service 生成了一个虚拟 IP 地址,实现的方式有三种 User space / Iptables / IPVS 等代理模式;
不管是哪种代理模式Kubernetes Service 的 IP 背后都没有任何实体可以响应「ICMP」全称为 Internet 控制报文协议(Internet Control Message Protocol),但是可以通过curl或者telnet进行访问与
问题解决:
1
2
3
$ kubectl cluster-info
# Kubernetes master is running at https://k8s.weiyigeek.top:6443
# KubeDNS is running at https://k8s.weiyigeek.top:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy


问题9.kubeadm init初始化k8s集群时显示[ERROR Swap]: running with swap on is not supported. Please disable swap`

错误信息:

1
2
3
4
5
6
7
$ sudo kubeadm init --config=/home/weiyigeek/k8s-init/kubeadm-init-config.yaml --upload-certs | tee kubeadm_init.log
[init] Using Kubernetes version: v1.19.6
[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

问题原因: 由于加入节点的机器未禁用swap分区导致
1
2
3
4
weiyigeek@weiyigeek-107:~$ free
total used free shared buff/cache available
Mem: 8151908 299900 7270588 956 581420 7600492
Swap: 4194300 0 4194300

解决版本: 禁用swap分区
1
2
3
4
5
sudo swapoff -a && sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab && free  # CentOS 
sudo swapoff -a && sudo sed -i 's/^\/swap.img\(.*\)$/#\/swap.img \1/g' /etc/fstab && free # Ubuntu
total used free shared buff/cache available
Mem: 8151908 304428 7260196 956 587284 7595204
Swap: 0 0 0


问题10.kubeadm 初始化问题之coredns的STATUS为Pending

  • 环境说明: OS:Ubuntu-20.04 / K8s:1.19.3 / docker:19.03.13 / flannel:v0.13.0
  • 错误信息:0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
  • 错误原因: kubeadm 初始化完成后未安装 flannel 网络插件
  • 解决流程:安装部署 flannel 网络插件
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    $ kubectl get pod --all-namespaces
    # NAMESPACE NAME READY STATUS RESTARTS AGE
    # kube-system coredns-6c76c8bb89-8cgjz 0/1 Pending 0 99s
    # kube-system coredns-6c76c8bb89-wgbs9 0/1 Pending 0 99s

    $ kubectl describe pod -n kube-system coredns-6c76c8bb89-8cgjz
    ...
    # Events:
    # Type Reason Age From Message
    # ---- ------ ---- ---- -------
    # Warning FailedScheduling 39s (x2 over 39s) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.

    $ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
    # podsecuritypolicy.policy/psp.flannel.unprivileged created
    # clusterrole.rbac.authorization.k8s.io/flannel created
    # clusterrolebinding.rbac.authorization.k8s.io/flannel created
    # serviceaccount/flannel created
    # configmap/kube-flannel-cfg created
    # daemonset.apps/kube-flannel-ds created

    $ kubectl get pod --all-namespaces
    # NAMESPACE NAME READY STATUS RESTARTS AGE
    # kube-system coredns-6c76c8bb89-8cgjz 1/1 Running 0 5m12s
    # kube-system coredns-6c76c8bb89-wgbs9 1/1 Running 0 5m12s

    $ kubectl get node
    # NAME STATUS ROLES AGE VERSION
    # ubuntu Ready master 30m v1.19.3


问题11.kubeadm 初始化问题之coredns的STATUS为ContainerCreating

  • 环境说明: OS:Ubuntu-20.04 / K8s:1.19.3 / docker:19.03.13 / flannel:v0.13.0
  • 错误信息:rpc error: code = Unknown desc = [failed to set up sandbox container "355.....4ec7" network for pod "coredns-": networkPlugin cni failed to set up pod "coredns-6c76c8bb89-6xgjl_kube-system"
  • 错误原因: kubeadm 初始化CNI网络插件有误
  • 解决流程:重新进行Kubeadm初始化即可并且验证serviceSubnet是否为10.96.0.0/12;
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    # 资源信息
    weiyigeek@ubuntu:~$ kubectl get pod -n kube-system
    NAME READY STATUS RESTARTS AGE
    coredns-6c76c8bb89-87zh7 0/1 ContainerCreating 0 18h
    coredns-6c76c8bb89-p68x8 0/1 ContainerCreating 0 18h
    etcd-ubuntu 1/1 Running 0 18h
    kube-apiserver-ubuntu 1/1 Running 0 18h
    kube-controller-manager-ubuntu 1/1 Running 0 18h
    kube-proxy-22t2f 1/1 Running 0 17h
    kube-proxy-wcjrv 1/1 Running 0 18h
    kube-scheduler-ubuntu 1/1 Running 0 18h

    # 删除重新够构建
    weiyigeek@ubuntu:~$ kubectl delete pod -n kube-system coredns-6c76c8bb89-87zh7 coredns-6c76c8bb89-p68x8
    pod "coredns-6c76c8bb89-87zh7" deleted
    pod "coredns-6c76c8bb89-p68x8" deleted


问题12.k8s Cluster IP 无法连接报错dial tcp 10.96.0.1:443: connect: no route to host

报错信息:

1
2
dial tcp 10.96.0.1:443: i/o timeout
dial tcp 10.96.0.1:443: connect: no route to host

报错原因:

  • coredns Pod 未正常启动
  • calico 网络插件未安装 或 calico-kube-controllers Pod 未正常启动
    解决办法: 查看对应的报错信息进行解决;
    1
    2
    3
    4
    5

    ~$ kubectl get pod -n kube-system | grep -e "calico|coredns"

    ~$ curl http://10.96.0.1:443
    Client sent an HTTP request to an HTTPS server.


问题13.kubeadm init 执行初始化节点时显示ERROR Swap与 WARNING IsDockerSystemdCheck

错误信息:

1
2
3
4
5
6
7
8
sudo kubeadm init
[init] Using Kubernetes version: v1.21.0
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

报错原因: ERROR Swap 是由于当前操作系统未关闭swap交换分区,而WARNING IsDockerSystemdCheck警告则说明cgroup driver未采用systemd。

解决办法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 1.关闭Swapp交换分区
sudo swapoff -a && sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab && free # CentOS
sudo swapoff -a && sudo sed -i 's/^\/swap.img\(.*\)$/#\/swap.img \1/g' /etc/fstab && free #Ubuntu

# 2.更改docker的 cgroup driver 驱动为systemd
cat /etc/docker/daemon.json
{
"registry-mirrors": [
"https://registry.cn-hangzhou.aliyuncs.com"
],
"max-concurrent-downloads": 10,
"log-driver": "json-file",
"log-level": "warn",
"log-opts": {
"max-size": "10m",
"max-file": "3"
},
"exec-opts": ["native.cgroupdriver=systemd"],
"storage-driver": "overlay2",
"insecure-registries": ["harbor.weiyigeek", "harbor.weiyi", "harbor.cloud"],
"data-root":"/home/data/docker/"
}


问题14.k8s master 依赖镜像无法拉取

错误信息:

1
2
3
Error response from daemon: manifest for registry.cn-hangzhou.aliyuncs.com/google_containers/coredns/coredns:v1.8.0 not found: manifest unknown: manifest unknown
Error response from daemon: No such image: registry.cn-hangzhou.aliyuncs.com/google_containers/coredns/coredns:v1.8.0
Error: No such image: registry.cn-hangzhou.aliyuncs.com/google_containers/coredns/coredns:v1.8.0

解决办法: 上docker的hub平台上搜索拉取后然后更改tag即可,地址:https://hub.docker.com/。
1
2
docker pull coredns/coredns:1.8.0
docker tag coredns/coredns:1.8.0 registry.cn-hangzhou.aliyuncs.com/google_containers/coredns/coredns:v1.8.0

问题15.Pod 一直处于 Pending 状态

描述: Pending 说明 Pod 还没有调度到某个 Node 上面。可以通过 kubectl describe pod <pod-name> 命令查看到当前 Pod 的事件,进而判断为什么没有调度.

  • 错误信息:
    1
    2
    3
    4
    5
    6
    $ kubectl describe pod mypod
    ...
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Warning FailedScheduling 12s (x6 over 27s) default-scheduler 0/4 nodes are available: 2 Insufficient cpu.
  • 可能原因:
    1
    2
    - 资源不足,集群内所有的 Node 都不满足该 Pod 请求的 CPU、内存、GPU 或者临时存储空间等资源。解决方法是删除集群内不用的 Pod 或者增加新的 Node。
    - HostPort 端口已被占用,通常推荐使用 Service 对外开放服务端口


问题16.Pod 一直处于 Waiting 或 ContainerCreating 状态

描述: 首先还是通过 kubectl describe pod 命令查看到当前 Pod 的事件

  • 错误信息:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    $ kubectl -n kube-system describe pod nginx-pod
    # Events:
    # Type Reason Age From Message
    # ---- ------ ---- ---- -------
    # Normal Scheduled 1m default-scheduler Successfully assigned nginx-pod to node1
    # Normal SuccessfulMountVolume 1m kubelet, gpu13 MountVolume.SetUp succeeded for volume "config-volume"
    # Normal SuccessfulMountVolume 1m kubelet, gpu13 MountVolume.SetUp succeeded for volume "coredns-token-sxdmc"
    # Warning FailedSync 2s (x4 over 46s) kubelet, gpu13 Error syncing pod
    # Normal SandboxChanged 1s (x4 over 46s) kubelet, gpu13 Pod sandbox changed, it will be killed and re-created.
  • 问题原因: 发现是 cni0 网桥配置了一个不同网段的 IP 地址导致,删除该网桥(网络插件会自动重新创建)即可修复
    1
    2
    3
    4
    5
    6
    7
    # 可以发现,该 Pod 的 Sandbox 容器无法正常启动,具体原因需要查看 Kubelet 日志:
    $ journalctl -u kubelet
    ...
    Mar 14 04:22:04 node1 kubelet[29801]: E0314 04:22:04.649912 29801 cni.go:294] Error adding network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.4.1/24
    Mar 14 04:22:04 node1 kubelet[29801]: E0314 04:22:04.649941 29801 cni.go:243] Error while adding to cni network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.4.1/24
    Mar 14 04:22:04 node1 kubelet[29801]: W0314 04:22:04.891337 29801 cni.go:258] CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "c4fd616cde0e7052c240173541b8543f746e75c17744872aa04fe06f52b5141c"
    Mar 14 04:22:05 node1 kubelet[29801]: E0314 04:22:05.965801 29801 remote_runtime.go:91] RunPodSandbox from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod "nginx-pod" network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.4.1/24
  • 解决办法:
    1
    2
    $ ip link set cni0 down
    $ brctl delbr cni0
  • 其它原因:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    镜像拉取失败,比如
    配置了错误的镜像
    Kubelet 无法访问镜像(国内环境访问 gcr.io 需要特殊处理)
    私有镜像的密钥配置错误
    镜像太大,拉取超时(可以适当调整 kubelet 的 --image-pull-progress-deadline 和 --runtime-request-timeout 选项)
    CNI 网络错误,一般需要检查 CNI 网络插件的配置,比如
    无法配置 Pod 网络
    无法分配 IP 地址
    容器无法启动,需要检查是否打包了正确的镜像或者是否配置了正确的容器参数


问题17.Pod 处于 ImagePullBackOff 状态

描述: 这通常是镜像名称配置错误或者私有镜像的密钥配置错误导致。这种情况可以使用 docker pull <image> 来验证镜像是否可以正常拉取。

  • 错误信息:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    $ kubectl describe pod mypod
    ...
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Normal Scheduled 36s default-scheduler
    Normal Pulling 17s (x2 over 33s) kubelet, k8s-agentpool1-38622806-0 pulling image "a1pine"
    Warning Failed 14s (x2 over 29s) kubelet, k8s-agentpool1-38622806-0 Failed to pull image "a1pine": rpc error: code = Unknown desc = Error response from daemon: repository a1pine not found: does not exist or no pull access
    Warning Failed 14s (x2 over 29s) kubelet, k8s-agentpool1-38622806-0 Error: ErrImagePull
    Normal SandboxChanged 4s (x7 over 28s) kubelet, k8s-agentpool1-38622806-0 Pod sandbox changed, it will be killed and re-created.
    Normal BackOff 4s (x5 over 25s) kubelet, k8s-agentpool1-38622806-0 Back-off pulling image "a1pine"
    Warning Failed 1s (x6 over 25s) kubelet, k8s-agentpool1-38622806-0 Error: ImagePullBackOff
  • 解决办法:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    # 1. 如果是私有镜像,需要首先创建一个 docker-registry 类型的 Secret
    kubectl create secret docker-registry my-secret --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL

    # 2. 然后在容器中引用这个 Secret
    spec:
    containers:
    - name: private-reg-container
    image: <your-private-image>
    imagePullSecrets:
    - name: my-secret


问题18.Pod 一直处于 CrashLoopBackOff 状态

描述: CrashLoopBackOff 状态说明容器曾经启动了,但又异常退出了。此时 Pod 的 RestartCounts 通常是大于 0 的,可以先查看一下容器的日志

  • 问题原因:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    * 容器进程退出
    * 健康检查失败退出
    * OOMKilled
    $ kubectl describe pod mypod
    ...
    Containers:
    sh:
    Container ID: docker://3f7a2ee0e7e0e16c22090a25f9b6e42b5c06ec049405bc34d3aa183060eb4906
    Image: alpine
    Image ID: docker-pullable://alpine@sha256:7b848083f93822dd21b0a2f14a110bd99f6efb4b838d499df6d04a49d0debf8b
    Port: <none>
    Host Port: <none>
    State: Terminated
    Reason: OOMKilled
    Exit Code: 2
    Last State: Terminated
    Reason: OOMKilled
    Exit Code: 2
    Ready: False
    Restart Count: 3
    Limits:
    cpu: 1
    memory: 1G
    Requests:
    cpu: 100m
    memory: 500M
    ...
    * 如果此时如果还未发现线索,还可以到容器内执行命令来进一步查看退出原因
    kubectl exec cassandra -- cat /var/log/cassandra/system.log
    * 如果还是没有线索,那就需要 SSH 登录该 Pod 所在的 Node 上,查看 Kubelet 或者 Docker 的日志进一步排查了
    # Query Node
    kubectl get pod <pod-name> -o wide

    # SSH to Node
    ssh <username>@<node-name>


问题 19.Pod 处于 Error 状态

通常处于 Error 状态说明 Pod 启动过程中发生了错误。常见的原因包括

依赖的 ConfigMap、Secret 或者 PV 等不存在

请求的资源超过了管理员设置的限制,比如超过了 LimitRange 等

违反集群的安全策略,比如违反了 PodSecurityPolicy 等

容器无权操作集群内的资源,比如开启 RBAC 后,需要为 ServiceAccount 配置角色绑定


问题 20.Pod 处于 Terminating 或 Unknown 状态

从 v1.5 开始,Kubernetes 不会因为 Node 失联而删除其上正在运行的 Pod,而是将其标记为 Terminating 或 Unknown 状态。想要删除这些状态的 Pod 有三种方法:

从集群中删除该 Node。使用公有云时,kube-controller-manager 会在 VM 删除后自动删除对应的 Node。而在物理机部署的集群中,需要管理员手动删除 Node(如 kubectl delete node <node-name>。

Node 恢复正常。Kubelet 会重新跟 kube-apiserver 通信确认这些 Pod 的期待状态,进而再决定删除或者继续运行这些 Pod。

用户强制删除。用户可以执行 kubectl delete pods <pod> --grace-period=0 --force 强制删除 Pod。除非明确知道 Pod 的确处于停止状态(比如 Node 所在 VM 或物理机已经关机),否则不建议使用该方法。特别是 StatefulSet 管理的 Pod,强制删除容易导致脑裂或者数据丢失等问题。

如果 Kubelet 是以 Docker 容器的形式运行的,此时 kubelet 日志中可能会发现如下的错误:

1
2
{"log":"E0926 19:59:39.977461   54420 nestedpendingoperations.go:262] Operation for \"\\\"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\\\" (\\\"30f3ffec-a29f-11e7-b693-246e9607517c\\\")\" failed. No retries permitted until 2017-09-26 19:59:41.977419403 +0800 CST (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume \"default-token-6tpnm\" (UniqueName: \"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\") pod \"30f3ffec-a29f-11e7-b693-246e9607517c\" (UID: \"30f3ffec-a29f-11e7-b693-246e9607517c\") : remove /var/lib/kubelet/pods/30f3ffec-a29f-11e7-b693-246e9607517c/volumes/kubernetes.io~secret/default-token-6tpnm: device or resource busy\n","stream":"stderr","time":"2017-09-26T11:59:39.977728079Z"}
{"log":"E0926 19:59:39.977461 54420 nestedpendingoperations.go:262] Operation for \"\\\"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\\\" (\\\"30f3ffec-a29f-11e7-b693-246e9607517c\\\")\" failed. No retries permitted until 2017-09-26 19:59:41.977419403 +0800 CST (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume \"default-token-6tpnm\" (UniqueName: \"kubernetes.io/secret/30f3ffec-a29f-11e7-b693-246e9607517c-default-token-6tpnm\") pod \"30f3ffec-a29f-11e7-b693-246e9607517c\" (UID: \"30f3ffec-a29f-11e7-b693-246e9607517c\") : remove /var/lib/kubelet/pods/30f3ffec-a29f-11e7-b693-246e9607517c/volumes/kubernetes.io~secret/default-token-6tpnm: device or resource busy\n","stream":"stderr","time":"2017-09-26T11:59:39.977728079Z"}

如果是这种情况,则需要给 kubelet 容器设置 –containerized 参数并传入以下的存储卷
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 以使用 calico 网络插件为例
-v /:/rootfs:ro,shared \
-v /sys:/sys:ro \
-v /dev:/dev:rw \
-v /var/log:/var/log:rw \
-v /run/calico/:/run/calico/:rw \
-v /run/docker/:/run/docker/:rw \
-v /run/docker.sock:/run/docker.sock:rw \
-v /usr/lib/os-release:/etc/os-release \
-v /usr/share/ca-certificates/:/etc/ssl/certs \
-v /var/lib/docker/:/var/lib/docker:rw,shared \
-v /var/lib/kubelet/:/var/lib/kubelet:rw,shared \
-v /etc/kubernetes/ssl/:/etc/kubernetes/ssl/ \
-v /etc/kubernetes/config/:/etc/kubernetes/config/ \
-v /etc/cni/net.d/:/etc/cni/net.d/ \
-v /opt/cni/bin/:/opt/cni/bin/ \

处于 Terminating 状态的 Pod 在 Kubelet 恢复正常运行后一般会自动删除。但有时也会出现无法删除的情况,并且通过 kubectl delete pods –grace-period=0 –force 也无法强制删除。此时一般是由于 finalizers 导致的,通过 kubectl edit 将 finalizers 删除即可解决。
1
2
3
"finalizers": [
"foregroundDeletion"
]

问题 21.修改静态 Pod 的 Manifest 后未自动重建

Kubelet 使用 inotify 机制检测 /etc/kubernetes/manifests 目录(可通过 Kubelet 的 –pod-manifest-path 选项指定)中静态 Pod 的变化,并在文件发生变化后重新创建相应的 Pod。但有时也会发生修改静态 Pod 的 Manifest 后未自动创建新 Pod 的情景,此时一个简单的修复方法是重启 Kubelet。

问题 22.Namespace 一直处于 terminating 状态

Namespace 一直处于 terminating 状态,一般有两种原因:

Namespace 中还有资源正在删除中

Namespace 的 Finalizer 未正常清理

对第一个问题,可以执行下面的命令来查询所有的资源
kubectl api-resources –verbs=list –namespaced -o name | xargs -n 1 kubectl get –show-kind –ignore-not-found -n $NAMESPACE

而第二个问题则需要手动清理 Namespace 的 Finalizer 列表:
kubectl get namespaces $NAMESPACE -o json | jq ‘.spec.finalizers=[]’ > /tmp/ns.json
kubectl proxy &
curl -k -H “Content-Type: application/json” -X PUT –data-binary @/tmp/ns.json http://127.0.0.1:8001/api/v1/namespaces/$NAMESPACE/finalize

Pod

1
2
3
4
5
6
7
8
9
10
11
 Warning  Failed            2m19s (x4 over 3m4s)  kubelet            Error: failed to create containerd task: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: process_linux.go:508: setting cgroup config for procHooks process caused: failed to write "107374182400000": write /sys/fs/cgroup/cpu,cpuacct/system.slice/containerd.service/kubepods-burstable-pod6e586bca_1fd9_412a_892c_a77b38d7f3ec.slice:cri-containerd:app/cpu.cfs_quota_us: invalid argument: unknown


resources:
requests:
memory: "512Mi"
cpu: "100m"
limits:
memory: "2048Mi"
cpu: "1000m"
cpu 没有该Gi单位


问题记录

  • 问题1.MountVolume.SetUp failed for volume “default-token-zglkd” : failed to sync secret cache: timed out waiting for the condition
    问题复现:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    ~/K8s/Day8/demo7$ kubectl get pod
    NAME READY STATUS RESTARTS AGE
    web-pvc-demo-0 0/1 ContainerCreating 0 58s

    ~/K8s/Day8/demo7$ kubectl describe pod web-pvc-demo-0
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Normal Scheduled 2m11s default-scheduler Successfully assigned default/web-pvc-demo-0 to k8s-node-5
    Warning FailedMount 2m11s kubelet MountVolume.SetUp failed for volume "default-token-zglkd" : failed to sync secret cache: timed out waiting for the condition
    Warning FailedMount 9s kubelet Unable to attach or mount volumes: unmounted volumes=[diskpv], unattached volumes=[diskpv default-token-zglkd]: timed out waiting for the condition
    问题原因: 由于kubernetes的MountVolume有一定的缓存导致已删除绑定的PV不可再重复的挂载;
    解决办法: 删除无法挂载的PV卷以及PVC卷,如果还是不能解决直接重启集群;

  • 问题2.使用NFS动态提供Kubernetes存储卷在创建PVC后一直是pending状态, 显示正在等待由外部供应器“fuseim.pri/ifs”或由系统管理员手动创建的卷
    问题环境: k8s(v1.23.1)
    问题复现: 通过kubectl describe命令查看错误提示信息 waiting for a volume to be created, either by external provisioner “fuseim.pri/ifs” or manually created by system administrator。其次是通过
    kubectl logs 命令查看 nfs-client-provisioner pod日志中有unexpected error getting claim reference: selfLink was empty, can’t make reference 提示。
    问题原因: 在v1.16版本将在ObjectMeta和ListMeta对象中弃用SelfLink字段,并且在v1.20版本之后默认禁用了selfLink(但是我们仍然可以通过参数的形式来进行恢复)
    问题解决: 在 k8s 的 master 端 找到 kube-apiserver.yaml 文件,并在文件中的command参数中添加 - --feature-gates=RemoveSelfLink=false或者在其systemd单元服务中加入此参数然后重启即可。
    1
    2
    /etc/kubernetes/manifests/kube-apiserver.yaml
    - --feature-gates=RemoveSelfLink=false
  • 问题3.Kubelet启动异常报Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache错误
    问题复现:
    1
    2
    E0704 15:20:03.875017    7912 kubelet.go:1292] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache
    E0704 15:20:03.920105 7912 kubelet.go:1853] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
    解决办法: 由于我们cgroupdriver使用了systemd 因此我们需要升级systemd.
    1
    yum update -y

在k8s集群中加入额外主机报failure loading certificate for CA: couldn‘t load the certificate file错误。

错误信息: 当k8s做集群高可用的时候,需要将另一个master加入到当前master出现了如下错误。

1
failure loading certificate for CA: couldn't load the certificate file /etc/kubernetes/pki/ca.crt: open /etc/kubernetes/pki/ca.crt: no such file or directory

问题原因: 由于新的节点上没有kubernetes集群上的pki目录中的ca证书。
解决办法:

1
2
3
4
5
scp -rp /etc/kubernetes/pki/ca.* master02:/etc/kubernetes/pki
scp -rp /etc/kubernetes/pki/sa.* master02:/etc/kubernetes/pki
scp -rp /etc/kubernetes/pki/front-proxy-ca.* master02:/etc/kubernetes/pki
scp -rp /etc/kubernetes/pki/etcd/ca.* master02:/etc/kubernetes/pki/etcd
scp -rp /etc/kubernetes/admin.conf master02:/etc/kubernetes


0x03 FAQ