客户是深信服的订制系统,基于 centos 改的,排查半天发现居然是文件损坏,而不是 docker 的问题。
环境信息docker信息:
$ docker infoContainers: 0 Running: 0 Paused: 0 Stopped: 0Images: 2Server Version: 18.09.3Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: trueLogging Driver: json-fileCgroup Driver: cgroupfsPlugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslogSwarm: inactiveRuntimes: runcDefault Runtime: runcInit Binary: docker-initcontainerd version: e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18erunc version: 6635b4f0c6af3810594d2770f662f34ddc15b40dinit version: fec3683Security Options: seccomp Profile: defaultKernel Version: 3.10.0Operating System: CentOS Linux 7 (Core)OSType: linuxArchitecture: x86_64CPUs: 20Total Memory: 125.3GiBName: eds-1f21a854ID: VZLV:26PU:ZUN6:QQEJ:GW3I:YETT:HMEU:CK6J:SIPL:CHKV:6ASN:5NDFDocker Root Dir: /data/kube/dockerDebug Mode (client): falseDebug Mode (server): falseRegistry: https://index.docker.io/v1/Labels:Experimental: falseInsecure Registries: reg.wps.lan:5000 treg.yun.wps.cn 127.0.0.0/8Registry Mirrors: https://registry.docker-cn.com/ https://docker.mirrors.ustc.edu.cn/Live Restore Enabled: falseProduct License: Community Engine
系统信息
$ uname -aLinux eds-1f21a854 3.10.0 #1 SMP Mon Sep 28 12:00:30 CST 2020 x86_64 x86_64 x86_64 GNU/Linux$ cat /etc/os-releaseNAME='CentOS Linux'VERSION='7 (Core)'ID='centos'ID_LIKE='rhel fedora'VERSION_ID='7'PRETTY_NAME='CentOS Linux 7 (Core)'ANSI_COLOR='0;31'CPE_NAME='cpe:/o:centos:centos:7'HOME_URL='https://www.centos.org/'BUG_REPORT_URL='https://bugs.centos.org/'CENTOS_MANTISBT_PROJECT='CentOS-7'CENTOS_MANTISBT_PROJECT_VERSION='7'REDHAT_SUPPORT_PRODUCT='centos'REDHAT_SUPPORT_PRODUCT_VERSION='7'
服务器信息:
$ cat product_nameSUGON-60G16$ cat sys_vendorSANGFOR$ cat product_version1.2排查过程
安装docker服务器挂了
时间2020 10/29 19:51:
实施: 客户这边部署的时候安装docker的时候服务器挂了我: 开机后/var/log/message有信息吗实施: 只能恢复快照才能进去,服务器进不了,看不了信息我: 不恢复快照起不来吗实施: 是的
到这里我以为是触发了啥内核 bug 直接内核 panic 了服务器起不来。
时间2020 10/30 9:07:
我: 起不来的时候有进控制台去看啥原因起不来吗实施: 是客户服务器没法查看呢我: 客户没去看下吗
然后实施直接发来一个向日葵远程连接,我上去后发现不是常规的操作系统,是基于 centos 改过的,没找到/var/log/message,然后手动执行我们的 docker 安装脚本。
bash -x install-docker.sh
然后输出的信息在某一步就没输出了,应该”挂了”,看了下脚本最后一条输出调试信息的后面是启动 docker,应该是启动 docker 触发的。然后很久后还是无法连上和 ping 通,叫实施问问那边现场看看是硬件服务器的话有没有 idrac,ilo 之类的看看 tty 控制台的信息。
现场人员看了下服务器是”正常开机”的,我这边尝试还是连不上,现场问我们的操作是否是改了路由,现场 systemctl 看了下 docker 是起来的。现场那边还是 ping 不通网关。我这边突然想到是不是压根没挂。。。
叫他 uptime -s 看看上次的启动时间,结果压根没重启。。。
然后现场排查到是 iptables 的问题,启动 docker 的时候把他们的规则刷没了。后面他们改了下都放开了。所以前面的启动 docker 把机器挂了实际上是iptables的影响导致网络断开,机器压根没重启。
启动容器挂掉
然后继续,实施说之前同样的其他机器安装 docker 的时候没出现上面的问题,而是启动的时候出现上面的问题,我就手动执行下部署,结果报错。脚本开-x调试看是load 部署镜像的时候报错了。
error during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.39/images/load?quiet=0: read unix @->/var/run/docker.sock: read: EOF
手动执行下:
$ docker load -i ./kube/images/deploy.tarerror during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.39/images/load?quiet=0: read unix @->/var/run/docker.sock: read: connection reset by peer
jounalctl 看了下 docker daemon 没任何相关日志,这个报错搜了下有的人说是 /var/run/docker.sock 的 docker 组不存在,也有人直接chmod 777解决的。试了下还是不行。前台 debug 下 docker 看看有没有有用的信息:
systemctl stop dockerpkill dockerddockerd -D
另开一个终端执行 load 镜像操作:
$ docker load -i ./kube/images/deploy.tarab6425526dab: Loading layer [==================================================>] 126.3MB/126.3MBc7fe3ea715ef: Loading layer [==================================================>] 340.3MB/340.3MB7f7eae7934f7: Loading layer [==================================================>] 3.584kB/3.584kBe99a66706b50: Loading layer [==================================================>] 2.105MB/2.105MB245b79de6bb7: Loading layer [==================================================>] 283.5MB/283.5MB2a56a4432cef: Loading layer [==================================================>] 93.18kB/93.18kB0c2ea71066ab: Loading layer [==================================================>] 276.5kB/276.5kBbb3f6df0f87c: Loading layer [==================================================>] 82.94kB/82.94kB6f34ead3cef7: Loading layer [==================================================>] 946.7kB/946.7kBc21422dd15f6: Loading layer [==================================================>] 1.97MB/1.97MB940288517f4c: Loading layer [==================================================>] 458.2kB/458.2kB0c52f1af61f4: Loading layer [==================================================>] 5.12kB/5.12kB049e7edd48bc: Loading layer [==================================================>] 1.57MB/1.57MB73307245d702: Loading layer [==================================================>] 5.632kB/5.632kBf109309d8ffc: Loading layer [==================================================>] 2.175MB/2.175MBLoaded image: xxxxxxxxxxxx.cn/platform/deploy-amd64:ubuntu.16.04$ docker imagesREPOSITORY TAG IMAGE ID CREATED SIZExxxxxxxxxxxx.cn/platform/deploy-amd64 ubuntu.16.04 3ad94a76af5d 3 months ago 734MB
调试这边前台日志输出正常
...DEBU[2020-10-30T14:48:39.955963986+08:00] Applied tar sha256:049e7edd48bc46e3dd5edf89c9caa8f0f7efbb41af403c5a54dd4f1008f604a7 to d58edd0d97bb672ef40e82e45c1603ca3ceaad847d9b9fc7c9b0588087019649, size: 1518278DEBU[2020-10-30T14:48:39.960091040+08:00] Applying tar in /data/kube/docker/overlay2/b044bd592ae800ed071208c6b2f650c5cbdc7452702f56a23b9b4ffe4236ac18/diff storage-driver=overlay2DEBU[2020-10-30T14:48:40.059510528+08:00] Applied tar sha256:73307245d7021f9627ca0b2cbfeab3aac0b65abfd476f6ec26bb92c75892d7e2 to b044bd592ae800ed071208c6b2f650c5cbdc7452702f56a23b9b4ffe4236ac18, size: 3284DEBU[2020-10-30T14:48:40.063040538+08:00] Applying tar in /data/kube/docker/overlay2/03918b1d275aa284532b8b9c59ca158409416f904e13cc7084c598ed343e844f/diff storage-driver=overlay2DEBU[2020-10-30T14:48:40.148209852+08:00] Applied tar sha256:f109309d8ffcb76589ad6389e80335d986b411c80122d990ab00a02a3a916e3e to 03918b1d275aa284532b8b9c59ca158409416f904e13cc7084c598ed343e844f, size: 2072803^CINFO[2020-10-30T14:48:55.593523177+08:00] Processing signal ’interrupt’DEBU[2020-10-30T14:48:55.593617229+08:00] daemon configured with a 15 seconds minimum shutdown timeoutDEBU[2020-10-30T14:48:55.593638628+08:00] start clean shutdown of all containers with a 15 seconds timeout...DEBU[2020-10-30T14:48:55.594074457+08:00] Unix socket /run/docker/libnetwork/ebd15186e86385c48c4c5508d5f30eb83d5d74e56f09af5c82b6d6d9d63ec8b8.sock doesn’t exist. cannot accept client connectionsDEBU[2020-10-30T14:48:55.594106623+08:00] Cleaning up old mountid : start.INFO[2020-10-30T14:48:55.594157536+08:00] stopping event stream following graceful shutdown error='<nil>' module=libcontainerd namespace=mobyDEBU[2020-10-30T14:48:55.594343122+08:00] Cleaning up old mountid : done.DEBU[2020-10-30T14:48:55.594501828+08:00] Clean shutdown succeededINFO[2020-10-30T14:48:55.594520918+08:00] stopping healthcheck following graceful shutdown module=libcontainerdINFO[2020-10-30T14:48:55.594531978+08:00] stopping event stream following graceful shutdown error='context canceled' module=libcontainerd namespace=plugins.mobyDEBU[2020-10-30T14:48:55.594603119+08:00] received signal signal=terminatedINFO[2020-10-30T14:48:55.594739890+08:00] pickfirstBalancer: HandleSubConnStateChange: 0xc4201a61b0, TRANSIENT_FAILURE module=grpcINFO[2020-10-30T14:48:55.594751465+08:00] pickfirstBalancer: HandleSubConnStateChange: 0xc4201a61b0, CONNECTING module=grpc
看了下systemd的配置没啥特殊的,就很迷,不知道为啥前台运行就能导入,后面实在想不到怎么排查,就怀疑可能是 socket 问题,尝试用 socat 转发成 tcp 试试,结果还是不行(此处应该daemon那加tcp 监听127试试,不应该通过socket,socat最终也是过的socket)
$ socat -d -d TCP-LISTEN:2375,fork,bind=127.0.0.1 UNIX:/var/run/docker.sock2020/10/30 17:39:58 socat[5201] N listening on AF=2 127.0.0.1:2375^[[C2020/10/30 17:42:06 socat[5201] N accepting connection from AF=2 127.0.0.1:35370 on AF=2 127.0.0.1:23752020/10/30 17:42:06 socat[5201] N forked off child process 115012020/10/30 17:42:06 socat[5201] N listening on AF=2 127.0.0.1:23752020/10/30 17:42:06 socat[11501] N opening connection to AF=1 '/var/run/docker.sock'2020/10/30 17:42:06 socat[11501] N successfully connected from local address AF=1 '000000 x0Dx@7xE9xECx7E000x010000'2020/10/30 17:42:06 socat[11501] N starting data transfer loop with FDs [6,6] and [5,5]2020/10/30 17:42:12 socat[11501] E write(5, 0x55f098349920, 8192): Broken pipe2020/10/30 17:42:12 socat[11501] N exit(1)2020/10/30 17:42:12 socat[5201] N childdied(): handling signal 17$ docker --log-level debug -H tcp://127.0.0.1:2375 load -i kube/images/deploy.tarc7fe3ea715ef: Loading layer [==========================================> ] 286.9MB/340.3MBunexpected EOF
最后耗了挺久的,当时忙,去看了下另一个客户的问题,然会回到这边,突发奇想的试试 load 其他镜像,结果可以。。。
$ docker load -i kube/images/pause_3.1.tare17133b79956: Loading layer [==================================================>] 744.4kB/744.4kBLoaded image: mirrorgooglecontainers/pause-amd64:3.1$ docker load -i kube/images/tiller_v2.16.1.tar77cae8ab23bf: Loading layer [==================================================>] 5.815MB/5.815MB679105aa33fb: Loading layer [==================================================>] 6.184MB/6.184MB639eab5d05b1: Loading layer [==================================================>] 40.46MB/40.46MB87e5687e03f2: Loading layer [==================================================>] 41.13MB/41.13MBLoaded image: gcr.io/kubernetes-helm/tiller:v2.16.1$ docker load -i kube/images/calico_v3.1.3.tarcd7100a72410: Loading layer [==================================================>] 4.403MB/4.403MBddc4cb8dae60: Loading layer [==================================================>] 7.84MB/7.84MB77087b8943a2: Loading layer [==================================================>] 249.3kB/249.3kBc7227c83afaf: Loading layer [==================================================>] 4.801MB/4.801MB2e0e333a66b6: Loading layer [==================================================>] 231.8MB/231.8MBLoaded image: calico/node:v3.1.32580685bfb60: Loading layer [==================================================>] 50.84MB/50.84MBLoaded image: calico/kube-controllers:v3.1.30314be9edf00: Loading layer [==================================================>] 1.36MB/1.36MB15db169413e5: Loading layer [==================================================>] 28.05MB/28.05MB4252efcc5013: Loading layer [==================================================>] 2.818MB/2.818MB76cf2496cf36: Loading layer [==================================================>] 3.03MB/3.03MB91d3d3a16862: Loading layer [==================================================>] 2.995MB/2.995MB18a58488ba3b: Loading layer [==================================================>] 3.474MB/3.474MB8d8197f49da2: Loading layer [==================================================>] 27.34MB/27.34MB7520364e0845: Loading layer [==================================================>] 9.216kB/9.216kBb9d064622bd6: Loading layer [==================================================>] 2.56kB/2.56kBLoaded image: calico/cni:v3.1.3
只有导入这个的时候才报错
$ docker load -i kube/images/deploy.tarerror during connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.39/images/load?quiet=0: read unix @->/var/run/docker.sock: read: connection reset by peer
然后出包的机器上对比了下这个文件的校验值发现不对。。。。
总结有个疑问就是为啥前台可以,其次文件损坏导入的时候 docker daemon 居然不刷任何日志直接 connection reset,新版本没测试过这种情况。
到此这篇关于docker错误的耗时排查过程记录的文章就介绍到这了,更多相关docker错误耗时排查内容请搜索乐呵呵网以前的文章或继续浏览下面的相关文章希望大家以后多多支持乐呵呵网!