在新创建的k8s集群上,push容器镜像到Harbor的时候,发现镜像时不时会上传失败,这个时候看Harbor的Pod都是运行正常的,服务也没有异常,因为Harbor是通过Ingress的方式对外暴露服务的,因此怀疑可能是nginx-ingress-controller的问题。

查看nginx-ingress-controller相关的Pod日志,内容如下:

$ kubectl logs -n kube-system nginx-ingress-controller-d5d6d6954-vb8wt
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:    0.21.0
  Build:      git-b65b85cd9
  Repository: https://github.com/aledbf/ingress-nginx
-------------------------------------------------------------------------------

nginx version: nginx/1.15.6
W0329 08:20:25.976376       9 client_config.go:548] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0329 08:20:25.976784       9 main.go:196] Creating API client for https://10.96.0.1:443
I0329 08:20:25.991435       9 main.go:240] Running in Kubernetes cluster version v1.11+ (v1.11.6) - git (clean) commit 612bdcbaa5ce6967727036073ae81414a0d25af8 - platform linux/amd64
I0329 08:20:25.993591       9 main.go:101] Validated kube-system/default-http-backend as the default backend.
I0329 08:20:26.240013       9 nginx.go:258] Starting NGINX Ingress controller
I0329 08:20:26.246790       9 event.go:221] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"udp-services", UID:"7682fb09-9064-11eb-824d-ecc89cd7753a", APIVersion:"v1", ResourceVersion:"867", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap kube-system/udp-services
I0329 08:20:26.246896       9 event.go:221] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"tcp-services", UID:"76819994-9064-11eb-824d-ecc89cd7753a", APIVersion:"v1", ResourceVersion:"866", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap kube-system/tcp-services
I0329 08:20:26.246945       9 event.go:221] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"nginx-configuration", UID:"767ff18c-9064-11eb-824d-ecc89cd7753a", APIVersion:"v1", ResourceVersion:"2879", FieldPath:""}): type: 'Normal' reason: 'CREATE' ConfigMap kube-system/nginx-configuration
I0329 08:20:27.342725       9 event.go:221] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"kube-system", Name:"harbor-ingress", UID:"7694f15b-9064-11eb-824d-ecc89cd7753a", APIVersion:"extensions/v1beta1", ResourceVersion:"903", FieldPath:""}): type: 'Normal' reason: 'CREATE' Ingress kube-system/harbor-ingress
I0329 08:20:27.344309       9 backend_ssl.go:68] Adding Secret "kube-system/secret-tls" to the local store
I0329 08:20:27.440598       9 nginx.go:279] Starting NGINX process
I0329 08:20:27.440772       9 leaderelection.go:187] attempting to acquire leader lease  kube-system/ingress-controller-leader-nginx...
I0329 08:20:27.441920       9 controller.go:172] Configuration changes detected, backend reload required.
I0329 08:20:27.443592       9 status.go:148] new leader elected: nginx-ingress-controller-d5d6d6954-pt64p
I0329 08:20:27.530325       9 backend_ssl.go:189] Updating local copy of SSL certificate "kube-system/secret-tls" with missing intermediate CA certs
I0329 08:20:28.214064       9 controller.go:190] Backend successfully reloaded.
I0329 08:20:28.214120       9 controller.go:202] Initial sync, sleeping for 1 second.
[29/Mar/2021:08:20:29 +0000]TCP200000.001
I0329 08:20:31.324847       9 controller.go:172] Configuration changes detected, backend reload required.
I0329 08:20:31.921507       9 controller.go:190] Backend successfully reloaded.
[29/Mar/2021:08:20:31 +0000]TCP200000.000
2021/03/29 08:20:31 [alert] 8520#8520: socketpair() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8544#8544: eventfd() failed (24: Too many open files)
2021/03/29 08:20:31 [alert] 8544#8544: socketpair() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8607#8607: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8638#8638: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8651#8651: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8678#8678: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8679#8679: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8680#8680: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8681#8681: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8682#8682: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8683#8683: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8684#8684: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8685#8685: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8686#8686: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8687#8687: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8688#8688: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8689#8689: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8690#8690: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8691#8691: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8692#8692: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:31 [emerg] 8693#8693: epoll_create() failed (24: Too many open files)
2021/03/29 08:20:32 [alert] 49#49: worker process 8607 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8638 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8651 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8678 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8679 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8680 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8681 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8682 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8683 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8684 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8685 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8686 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8687 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8688 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8689 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8690 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8691 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8692 exited with fatal code 2 and cannot be respawned
2021/03/29 08:20:32 [alert] 49#49: worker process 8693 exited with fatal code 2 and cannot be respawned

发现有大量的24: Too many open files的错误日志,这种错误应该是操作系统打开的文件数超过了限制,查看nginx所在节点允许打开的最大文件数:

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514376
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 514376
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

看到最大为1024,对比了下其他正常的集群,这个配置是一致的,那为什么只有在这个集群上遇到了这个问题呢。

查看下nginx相关的进程,发现在节点上启动了特别多的nginx worker进程,因为没有配置nginx的work-processes,所以nginx会启动和当前节点的CPU数量相同的worker,通过lscpu看了下,这个节点有128CPU,因此worker很多,从而达到操作系统允许打开的最大文件数。

降低nginx的worker数量,修改ingress的ConfigMap中nginx相关配置。

$ kubectl edit cm -n kube-system nginx-configuration
# 增加下面的配置,具体worker数可以根据实际来设置。
# 相关说明:https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/configmap/#worker-processes
worker-processes: "8"

修改后重建nginx-ingress-controller的pod,日志不再打印24: Too many open files相关的信息,同时push镜像到Harbor正常。