/ Kubernetes

#kubernetes 如何调试k8s集群启动失败的应用

任何应用的开发过程,都不总会一帆风顺,那么,怎么调试,就是一个很重要的问题。

对于k8s集群,即使是按照文档一步步去部署一个很成熟的服务时,依然可能会出现各种各样的错误。

例如,最近在按照文档部署elk时,就出现过各种各样的问题。本文以此为例,示范如何进行调试,查找错误原因。

1. 查看pod列表

kubectl get pods -n <namespace>

示例

kubectl get pod -n k8s-logging

NAME                      READY   STATUS                  RESTARTS   AGE
es-logging-es-default-0   0/1     Init:CrashLoopBackOff   7          14m

2. 查看pod的详细信息

可以用以下命令查看失败状态的pod的详细信息:

kubectl describe pod <pod name> -n <namespace>

该命令会输出pod的Events列表,可以看到该pod运行过程中的相关事件。

有时候,这个Events列表中就已经包含了详细的错误原因。

示例

kubectl describe pod es-logging-es-default-0 -n k8s-logging

Name:         es-logging-es-default-0
Namespace:    k8s-logging
Priority:     0
Node:         k8s-node-02/192.168.1.15
Start Time:   Fri, 14 May 2021 02:35:49 +0000
Labels:       common.k8s.elastic.co/type=elasticsearch
              controller-revision-hash=es-logging-es-default-7ffcbbf5
              elasticsearch.k8s.elastic.co/cluster-name=es-logging
              elasticsearch.k8s.elastic.co/config-hash=1754400308
              elasticsearch.k8s.elastic.co/http-scheme=https
              elasticsearch.k8s.elastic.co/node-data=true
              elasticsearch.k8s.elastic.co/node-ingest=true
              elasticsearch.k8s.elastic.co/node-master=true
              elasticsearch.k8s.elastic.co/node-ml=true
              elasticsearch.k8s.elastic.co/node-remote_cluster_client=true
              elasticsearch.k8s.elastic.co/node-transform=true
              elasticsearch.k8s.elastic.co/node-voting_only=false
              elasticsearch.k8s.elastic.co/statefulset-name=es-logging-es-default
              elasticsearch.k8s.elastic.co/version=7.12.1
              statefulset.kubernetes.io/pod-name=es-logging-es-default-0
Annotations:  co.elastic.logs/module: elasticsearch
              update.k8s.elastic.co/timestamp: 2021-05-14T02:35:52.442769569Z
Status:       Pending
IP:           10.244.2.114
IPs:
  IP:           10.244.2.114
Controlled By:  StatefulSet/es-logging-es-default
Init Containers:
  elastic-internal-init-filesystem:
    Container ID:  docker://ef9155f4c23f12cc95baf4eb56256497ef6495355c0c2a87adfd4c8973686855
    Image:         docker.elastic.co/elasticsearch/elasticsearch:7.12.1
    Image ID:      docker-pullable://docker.elastic.co/elasticsearch/elasticsearch@sha256:561bf27aa989803bfbac48ebd48e32daadb4215cf7940c599a62c13f225427fa
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -c
      /mnt/elastic-internal/scripts/prepare-fs.sh
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 14 May 2021 02:56:55 +0000
      Finished:     Fri, 14 May 2021 02:56:55 +0000
    Ready:          False
    Restart Count:  9
    Limits:
      cpu:     100m
      memory:  50Mi
    Requests:
      cpu:     100m
      memory:  50Mi
    Environment:
      POD_IP:                  (v1:status.podIP)
      POD_NAME:               es-logging-es-default-0 (v1:metadata.name)
      NODE_NAME:               (v1:spec.nodeName)
      NAMESPACE:              k8s-logging (v1:metadata.namespace)
      HEADLESS_SERVICE_NAME:  es-logging-es-default
    Mounts:
      /mnt/elastic-internal/downward-api from downward-api (ro)
      /mnt/elastic-internal/elasticsearch-bin-local from elastic-internal-elasticsearch-bin-local (rw)
      /mnt/elastic-internal/elasticsearch-config from elastic-internal-elasticsearch-config (ro)
      /mnt/elastic-internal/elasticsearch-config-local from elastic-internal-elasticsearch-config-local (rw)
      /mnt/elastic-internal/elasticsearch-plugins-local from elastic-internal-elasticsearch-plugins-local (rw)
      /mnt/elastic-internal/probe-user from elastic-internal-probe-user (ro)
      /mnt/elastic-internal/scripts from elastic-internal-scripts (ro)
      /mnt/elastic-internal/transport-certificates from elastic-internal-transport-certificates (ro)
      /mnt/elastic-internal/unicast-hosts from elastic-internal-unicast-hosts (ro)
      /mnt/elastic-internal/xpack-file-realm from elastic-internal-xpack-file-realm (ro)
      /usr/share/elasticsearch/config/http-certs from elastic-internal-http-certificates (ro)
      /usr/share/elasticsearch/config/transport-remote-certs/ from elastic-internal-remote-certificate-authorities (ro)
      /usr/share/elasticsearch/data from elasticsearch-data (rw)
      /usr/share/elasticsearch/logs from elasticsearch-logs (rw)
Containers:
  elasticsearch:
    Container ID:
    Image:          docker.elastic.co/elasticsearch/elasticsearch:7.12.1
    Image ID:
    Ports:          9200/TCP, 9300/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  2Gi
    Requests:
      memory:   2Gi
    Readiness:  exec [bash -c /mnt/elastic-internal/scripts/readiness-probe-script.sh] delay=10s timeout=5s period=5s #success=1 #failure=3
    Environment:
      POD_IP:                     (v1:status.podIP)
      POD_NAME:                  es-logging-es-default-0 (v1:metadata.name)
      NODE_NAME:                  (v1:spec.nodeName)
      NAMESPACE:                 k8s-logging (v1:metadata.namespace)
      PROBE_PASSWORD_PATH:       /mnt/elastic-internal/probe-user/elastic-internal-probe
      PROBE_USERNAME:            elastic-internal-probe
      READINESS_PROBE_PROTOCOL:  https
      HEADLESS_SERVICE_NAME:     es-logging-es-default
      NSS_SDB_USE_CACHE:         no
    Mounts:
      /mnt/elastic-internal/downward-api from downward-api (ro)
      /mnt/elastic-internal/elasticsearch-config from elastic-internal-elasticsearch-config (ro)
      /mnt/elastic-internal/probe-user from elastic-internal-probe-user (ro)
      /mnt/elastic-internal/scripts from elastic-internal-scripts (ro)
      /mnt/elastic-internal/unicast-hosts from elastic-internal-unicast-hosts (ro)
      /mnt/elastic-internal/xpack-file-realm from elastic-internal-xpack-file-realm (ro)
      /usr/share/elasticsearch/bin from elastic-internal-elasticsearch-bin-local (rw)
      /usr/share/elasticsearch/config from elastic-internal-elasticsearch-config-local (rw)
      /usr/share/elasticsearch/config/http-certs from elastic-internal-http-certificates (ro)
      /usr/share/elasticsearch/config/transport-certs from elastic-internal-transport-certificates (ro)
      /usr/share/elasticsearch/config/transport-remote-certs/ from elastic-internal-remote-certificate-authorities (ro)
      /usr/share/elasticsearch/data from elasticsearch-data (rw)
      /usr/share/elasticsearch/logs from elasticsearch-logs (rw)
      /usr/share/elasticsearch/plugins from elastic-internal-elasticsearch-plugins-local (rw)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  elasticsearch-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  elasticsearch-data-es-logging-es-default-0
    ReadOnly:   false
  downward-api:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
  elastic-internal-elasticsearch-bin-local:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  elastic-internal-elasticsearch-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  es-logging-es-default-es-config
    Optional:    false
  elastic-internal-elasticsearch-config-local:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  elastic-internal-elasticsearch-plugins-local:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  elastic-internal-http-certificates:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  es-logging-es-http-certs-internal
    Optional:    false
  elastic-internal-probe-user:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  es-logging-es-internal-users
    Optional:    false
  elastic-internal-remote-certificate-authorities:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  es-logging-es-remote-ca
    Optional:    false
  elastic-internal-scripts:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      es-logging-es-scripts
    Optional:  false
  elastic-internal-transport-certificates:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  es-logging-es-default-es-transport-certs
    Optional:    false
  elastic-internal-unicast-hosts:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      es-logging-es-unicast-hosts
    Optional:  false
  elastic-internal-xpack-file-realm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  es-logging-es-xpack-file-realm
    Optional:    false
  elasticsearch-logs:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  24m                   default-scheduler  0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
  Normal   Scheduled         23m                   default-scheduler  Successfully assigned k8s-logging/es-logging-es-default-0 to k8s-node-02
  Normal   Pulled            22m (x5 over 23m)     kubelet            Container image "docker.elastic.co/elasticsearch/elasticsearch:7.12.1" already present on machine
  Normal   Created           22m (x5 over 23m)     kubelet            Created container elastic-internal-init-filesystem
  Normal   Started           22m (x5 over 23m)     kubelet            Started container elastic-internal-init-filesystem
  Warning  BackOff           3m57s (x92 over 23m)  kubelet            Back-off restarting failed container

从示例可以看到,该pod最后的错误警告是:

Warning  BackOff           3m57s (x92 over 23m)  kubelet            Back-off restarting failed container

很不幸,Events事件列表中,只包含的比较简单的信息:容器启动失败。

3. 查看pod日志

当事件列表中找不到详细错误时,需要查看pod的详细日志来定位:

kubectl logs <pod name> -n <namespace>

示例

kubectl logs -n k8s-logging es-logging-es-default-0

Error from server (BadRequest): container "elasticsearch" in pod "es-logging-es-default-0" is waiting to start: PodInitializing

表示该pod的默认container是elasticsearch,而它还没有初始化成功,所以没有运行日志。

说明出错的不是默认的container。

4. 查看对应container的日志

这时候需要查看特定contaienr的日志:

kubectl logs <pod name> -n <namespace> -c <container>

示例

从刚刚pod的详情里,找到Init Containers的列表。

示例中,Init Containers只有一个elastic-internal-init-filesystem,这个信息与Events列表中也是一致的。

k8s-debug-01-describe-pod

kubectl logs -c elastic-internal-init-filesystem -n k8s-logging es-logging-es-default-0

# 以下是输出日志
Starting init script
Linking /mnt/elastic-internal/xpack-file-realm/users to /usr/share/elasticsearch/config/users
Linking /mnt/elastic-internal/xpack-file-realm/roles.yml to /usr/share/elasticsearch/config/roles.yml
Linking /mnt/elastic-internal/xpack-file-realm/users_roles to /usr/share/elasticsearch/config/users_roles
Linking /mnt/elastic-internal/elasticsearch-config/elasticsearch.yml to /usr/share/elasticsearch/config/elasticsearch.yml
Linking /mnt/elastic-internal/unicast-hosts/unicast_hosts.txt to /usr/share/elasticsearch/config/unicast_hosts.txt
File linking duration: 0 sec.
Copying /usr/share/elasticsearch/config/* to /mnt/elastic-internal/elasticsearch-config-local/
removed '/mnt/elastic-internal/elasticsearch-config-local/elasticsearch.yml'
'/usr/share/elasticsearch/config/elasticsearch.yml' -> '/mnt/elastic-internal/elasticsearch-config-local/elasticsearch.yml'
'/usr/share/elasticsearch/config/http-certs/..2021_05_14_02_35_50.501721750/ca.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/..2021_05_14_02_35_50.501721750/ca.crt'
'/usr/share/elasticsearch/config/http-certs/..2021_05_14_02_35_50.501721750/tls.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/..2021_05_14_02_35_50.501721750/tls.crt'
'/usr/share/elasticsearch/config/http-certs/..2021_05_14_02_35_50.501721750/tls.key' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/..2021_05_14_02_35_50.501721750/tls.key'
removed '/mnt/elastic-internal/elasticsearch-config-local/http-certs/ca.crt'
'/usr/share/elasticsearch/config/http-certs/ca.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/ca.crt'
removed '/mnt/elastic-internal/elasticsearch-config-local/http-certs/tls.crt'
'/usr/share/elasticsearch/config/http-certs/tls.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/tls.crt'
removed '/mnt/elastic-internal/elasticsearch-config-local/http-certs/tls.key'
'/usr/share/elasticsearch/config/http-certs/tls.key' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/tls.key'
removed '/mnt/elastic-internal/elasticsearch-config-local/http-certs/..data'
'/usr/share/elasticsearch/config/http-certs/..data' -> '/mnt/elastic-internal/elasticsearch-config-local/http-certs/..data'
'/usr/share/elasticsearch/config/jvm.options' -> '/mnt/elastic-internal/elasticsearch-config-local/jvm.options'
'/usr/share/elasticsearch/config/log4j2.file.properties' -> '/mnt/elastic-internal/elasticsearch-config-local/log4j2.file.properties'
'/usr/share/elasticsearch/config/log4j2.properties' -> '/mnt/elastic-internal/elasticsearch-config-local/log4j2.properties'
'/usr/share/elasticsearch/config/role_mapping.yml' -> '/mnt/elastic-internal/elasticsearch-config-local/role_mapping.yml'
removed '/mnt/elastic-internal/elasticsearch-config-local/roles.yml'
'/usr/share/elasticsearch/config/roles.yml' -> '/mnt/elastic-internal/elasticsearch-config-local/roles.yml'
'/usr/share/elasticsearch/config/transport-remote-certs/..2021_05_14_02_35_50.623420157/ca.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/..2021_05_14_02_35_50.623420157/ca.crt'
removed '/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/ca.crt'
'/usr/share/elasticsearch/config/transport-remote-certs/ca.crt' -> '/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/ca.crt'
removed '/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/..data'
'/usr/share/elasticsearch/config/transport-remote-certs/..data' -> '/mnt/elastic-internal/elasticsearch-config-local/transport-remote-certs/..data'
removed '/mnt/elastic-internal/elasticsearch-config-local/unicast_hosts.txt'
'/usr/share/elasticsearch/config/unicast_hosts.txt' -> '/mnt/elastic-internal/elasticsearch-config-local/unicast_hosts.txt'
removed '/mnt/elastic-internal/elasticsearch-config-local/users'
'/usr/share/elasticsearch/config/users' -> '/mnt/elastic-internal/elasticsearch-config-local/users'
removed '/mnt/elastic-internal/elasticsearch-config-local/users_roles'
'/usr/share/elasticsearch/config/users_roles' -> '/mnt/elastic-internal/elasticsearch-config-local/users_roles'
Empty dir /usr/share/elasticsearch/plugins
Copying /usr/share/elasticsearch/bin/* to /mnt/elastic-internal/elasticsearch-bin-local/
'/usr/share/elasticsearch/bin/elasticsearch' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch'
'/usr/share/elasticsearch/bin/elasticsearch-certgen' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-certgen'
'/usr/share/elasticsearch/bin/elasticsearch-certutil' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-certutil'
'/usr/share/elasticsearch/bin/elasticsearch-cli' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-cli'
'/usr/share/elasticsearch/bin/elasticsearch-croneval' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-croneval'
'/usr/share/elasticsearch/bin/elasticsearch-env' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-env'
'/usr/share/elasticsearch/bin/elasticsearch-env-from-file' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-env-from-file'
'/usr/share/elasticsearch/bin/elasticsearch-keystore' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-keystore'
'/usr/share/elasticsearch/bin/elasticsearch-migrate' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-migrate'
'/usr/share/elasticsearch/bin/elasticsearch-node' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-node'
'/usr/share/elasticsearch/bin/elasticsearch-plugin' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-plugin'
'/usr/share/elasticsearch/bin/elasticsearch-saml-metadata' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-saml-metadata'
'/usr/share/elasticsearch/bin/elasticsearch-setup-passwords' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-setup-passwords'
'/usr/share/elasticsearch/bin/elasticsearch-shard' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-shard'
'/usr/share/elasticsearch/bin/elasticsearch-sql-cli' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-sql-cli'
'/usr/share/elasticsearch/bin/elasticsearch-sql-cli-7.12.1.jar' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-sql-cli-7.12.1.jar'
'/usr/share/elasticsearch/bin/elasticsearch-syskeygen' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-syskeygen'
'/usr/share/elasticsearch/bin/elasticsearch-users' -> '/mnt/elastic-internal/elasticsearch-bin-local/elasticsearch-users'
'/usr/share/elasticsearch/bin/x-pack-env' -> '/mnt/elastic-internal/elasticsearch-bin-local/x-pack-env'
'/usr/share/elasticsearch/bin/x-pack-security-env' -> '/mnt/elastic-internal/elasticsearch-bin-local/x-pack-security-env'
'/usr/share/elasticsearch/bin/x-pack-watcher-env' -> '/mnt/elastic-internal/elasticsearch-bin-local/x-pack-watcher-env'
Files copy duration: 0 sec.
chowning /usr/share/elasticsearch/data to elasticsearch:elasticsearch
chown: changing ownership of '/usr/share/elasticsearch/data': Operation not permitted
failed to change ownership of '/usr/share/elasticsearch/data' from 1024:users to elasticsearch:elasticsearch

在日志的最后,可以看到出错原因:

failed to change ownership of '/usr/share/elasticsearch/data' from 1024:users to elasticsearch:elasticsearch

k8s-debug-02-pod-container-log

根据对应的错误,查找原因即可。

更多调试方法

https://kubernetes.io/zh/docs/tasks/debug-application-cluster/debug-running-pod/