记一次Nginx代理到后端诡异502的问题

发布 : 2020-08-19 分类 : 运维 浏览 :

问题

QA压测公司某个新服务的接口会出现502的情况,后端没有收到请求,ingress-nginx 也没有收到请求,怀疑是最前边的nginx异常

排查问题

  1. 服务基本链路如下:

Openresty(nginx)–>> nginx ingress –>> backend server

nginx 基本配置如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

upstream pool_k8s_http {
#ip_hash;
server 10.10.2.208:80 weight=1 max_fails=2 fail_timeout=5s;
server 10.10.2.199:80 weight=1 max_fails=2 fail_timeout=5s;
server 10.10.2.2:80 weight=1 max_fails=2 fail_timeout=5s;
server 10.10.2.207:80 weight=1 max_fails=2 fail_timeout=5s;
server 10.10.2.191:80 weight=1 max_fails=2 fail_timeout=5s;
server 10.10.2.192:80 weight=1 max_fails=2 fail_timeout=5s;
server 10.10.2.193:80 weight=1 max_fails=2 fail_timeout=5s;
server 10.10.2.194:80 weight=1 max_fails=2 fail_timeout=5s;
server 10.10.2.195:80 weight=1 max_fails=2 fail_timeout=5s;
keepalive 1024;
}
....... 省略部分配置.....
location /
{
#add_header Access-Control-Allow-Origin *;
proxy_next_upstream http_502 http_504 error timeout invalid_header;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header REMOTE-HOST $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_pass http://pool_k8s_http;
}

.... 省略部分配置.....

nginx info 日志如下,从日志中看出状态码为502,后端的ingress ip 没有正确显示 ,只显示了 pool_k8s_http, 这里肯定是有问题,去看看nginx 的error日志

1
[16/Jul/2020:18:29:48 +0800] INFO xxxxxxx.cn 502 154 "0.000"  78456579 1471 "112.xxx.13.xxx" pool_k8s_http  "POST /api/toolkit/qr/mapping HTTP/1.1" "-" -   0.000 -  - - 112.253.13.xxx  Apache-HttpClient/4.5.7 (Java/1.8.0_77)  [db_log=1; CASPRIVACY=; org_id=640; noriental=8e62fd9c-9909-461b-9b54-b31b0b224bc1; TGC=eyJhbGciOiJIUzUxMiJ9.WlhsS2FHSkhZMmxQYVVwcllWaEphVXhEU214aWJVMXBUMmxLUWsxVVNUUlJNRXBFVEZWb1ZFMXFWVEpKYmpBdUxtUTRZazU2VTJVMmRGWjBUa293TjNObFpVNUpZM2N1WjJWTlVWQkRRVUZEWTFwQlZrZHRVRGRhZVU5V1dVdENSR3RxUjJ0YVdsSkVWemhGveFYyWkdSWGxCVUc0dFVYRjZVSEJSZGtwSGFETlNkVlIwYTIxa05XNHlVM0pHUlhOSWREbDJWbEEwVGtwRFkyY3dYeTB5UVRBeVluTkNMVmczUTNscFVIY3hSVEpFZEhWYVgzTk5kVFpDTUdsNFVuSnBUV1E0ZVRVM1kxQlBjSFpXVlZOd1pHdGxXR2xKWW5oNGVVNXRNbll6Y0hsWVEwWmpkMmx5ZERCUk5GbE5NRE51YWxWemIwcDBZbk5YVkcxNk9WSTBVREZaWTNoRk16SktPVU5mUkdWU1FtRnhZalUwTjFwSmJGbERkbVp4ZFdaVk5tbzRTRXcyWmpkbVYyMDVUbkJuTTFwTVYzSjJSa1JuVFZCRmVrNHpiMGR4VDJFNVIzTkdSMnBpVm05Qk0wRXRjSGRPVWpaMmFHVlFVRk5oYWpGVFpGZGZRbFJsTFVOc1RERmtZbWx4U2trdWFHOXFjbEIwU25WS2IwdGliWEYyYlhweU1YZEtRUT09.aeZZUiB1kh0Wa0iIDzhfHfdk7Z7W2zoo6euxgaOopV059M60J7IeZsPbCjEf5mZleSmkzb2KotNWxUwjueq4zA; user_action_cookie=user_action_ca450e81-5fee-49ea-a26b-97cffadf610d_62951387098; teacher_id=143411d4a2fc499aaee0db6da96ec540; Hm_lvt_2014de1ca4ec84db492ebee33b1dc46c=1594627133,1594693700,1594707825,1594707944; Hm_lpvt_2014de1ca4ec84db492ebee33b1dc46c=1594712216] -

nginx error 日志如下,很明显是连接后端的ingress-nginx异常了

1
2020/07/16 18:29:48 [error] 47088#0: *225891024 connect() failed (110: Connection timed out) while connecting to upstream, client: 112.253.13.197, server: *.xxx.cn, request: "POST /api/toolkit/qr/mapping HTTP/1.1", upstream: "http://pool_k8s_http/api/toolkit/qr/mapping", host: "xxxxxxx.cn"

压力测试的时候ping和curl都会出现异常,并且连接数较多,查看系统message 日志 ,从message 日志上看出有部分包被丢弃了,这应该就是导致502的原因了

1
2
3
4
5
6
7
8
Jul 16 06:39:39 xxx-10 kernel: nf_conntrack: table full, dropping packet
Jul 16 06:39:39 xxx-10 kernel: nf_conntrack: table full, dropping packet
Jul 16 06:39:39 xxx-10 kernel: nf_conntrack: table full, dropping packet
Jul 16 06:39:39 xxx-10 kernel: nf_conntrack: table full, dropping packet
Jul 16 06:39:39 xxx-10 kernel: nf_conntrack: table full, dropping packet
Jul 16 06:39:39 xxx-10 kernel: nf_conntrack: table full, dropping packet
Jul 16 06:39:39 xxx-10 kernel: nf_conntrack: table full, dropping packet
Jul 16 06:39:39 xxx-10 kernel: nf_conntrack: table full, dropping packet

破案

从上面的日志错误查到问题原因,由于net.netfilter.nf_conntrack_max 满了,导致后边新请求的包都被抛弃了,所以nginx无法代理到后端服务,导致502的异常,知道问题就好解决了,加大nf_conntrack_max 的配置即可

1
2
# echo "net.netfilter.nf_conntrack_max = 524288" >>/etc/sysctl.conf
# sysctl -p

查看限制

1
2
# sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 524288

查看当前是否超限

1
2
# sysctl net.netfilter.nf_conntrack_count
net.netfilter.nf_conntrack_count = 2015

修改完内核参数后在压力测试完美通过~

nf_conntrack是啥?

nf_conntrack是内核中一个用来记录和跟踪连接状态的模块,此问题中的table full 是用于记录各种连接(tcp or udp )的信息及状态用(ESTABLISHED or TIME_WAIT) 的一个表

此表的位置如下:

1
2
# cat /proc/net/nf_conntrack | more
ipv4 2 tcp 6 23 TIME_WAIT src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=47804 dport=80 src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=80 dport=47804 [ASSURED] mark=0 zone=0 use=2

依次为:

地址类型
2 不清楚
tcp 协议类型如(tcp,udp,icmp等)
6 应该是tcp的协议代码
23 为生存时间,秒为单位,如果在生存时间到达之前未接收到新的包,记录会被删除,收到新包则会重新计时
TIME_WAIT 则是链接状态
src dst 就是源地址及目录地址
sport及dport 是源端口及目的端口

系统为什么要跟踪连接的状态?

因为它是状态防火墙和NAT的实现基础。

如何合理设置nf_conntrack_max 的最大链接数?

默认65536,建议设置为理论最大值,大小与机器内存有关,RAMSIZE (in bytes) / 16384 / (ARCH / 32),以64位的64G机器为例,CONNTRACK_MAX = 6410241024*1024/16384/(64/2) = 2097152

本文作者 : WGY
原文链接 : http://geeklive.cn/2020/08/19/nginx502问题/undefined/nginx502问题/
版权声明 : 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明出处!
留下足迹