2021.03.20 20:51:34.432503 [ 286 ] {a66f562c-18c4-40b0-9e83-ddf2e757f882} <Warning> TCPHandler: Client has gone away.
2021.03.20 20:51:34.849092 [ 286 ] {a66f562c-18c4-40b0-9e83-ddf2e757f882} <Information> TCPHandler: Processed in 266.890 sec.
定时任务每次3m后都失败。查询太久了,导致客户端的连接已经主动断开?是什么原因导致主动断开的?
默认值是0。go的client的超时控制,因为没有配置该项,所以应该也不是这个原因导致的。
例子: clickhouse_test.go:Test_Timeout
默认是0。执行时间的超时控制。因为查询是能正常执行完,也不是这个原因导致的。
默认3600s,在指定的秒数后关闭空闲的TCP连接。不是,不过好像找不到可设置的文档?
This is especially important for large clusters with multiple distributed tables on every server, because every server can possibly keep a connection pool to every other server, and after peak query concurrency, connections will stall.
配置在:query_settings.go:querySettingList 里面。跟receive_timeout是clickhouse相关的client查询配置
默认是300秒,该配置指的是插入和接收数据的时间。这个错误应该不是造成本例子的原因。因为例子是因为慢查询导致的tcp断开,还没接收到数据,而不是接受数据太多导致的超时。而且是receive_timeout,数据库应该报错DB::Exception: Timeout exceeded while receiving data from client
。
- Timeout exceeded while receiving data from client: 配置的是clickhouse的receive_timeout、send_timeout配置。
- connect-timeout-receive-timeout-send-timeout
The time in seconds the connection needs to remain idle before TCP starts sending keepalive probes
被http_keep_alive_timeout取代
通过模拟慢查询复现?
set max_execution_speed=10;
select count(1) from system.numbers ;
不能复现,反而发现这个查询即使程序停止了,也没有kill掉。TODO
不是慢查询的问题?联想到其在发生期间,负载是很高的,或者是因为服务端的连接配置主动让client的连接kill掉了?
可能要得生产环境出现后,实地排查更好?
所以是啥原因?tcp本身自带的连接超时配置?
Client has gone away
是客户端因为一些原因挂掉了,可能是tcp的设置,也可能是超出内存被kill掉之类的。
进入server容器执行可行:
clickhouse-client --send_logs_level=trace <<< 'SELECT * FROM mt.ad_aggs_outer' > /dev/null
执行client错误:
docker run -it --rm --link my-clickhouse-server-v2:clickhouse-server yandex/clickhouse-client --host clickhouse-server
docker run --rm --link my-clickhouse-server-v2:clickhouse-server yandex/clickhouse-client --host clickhouse-server --send_logs_level=trace --query 'SELECT * FROM mt.ad_aggs_outer' > /dev/null
报错: Code: 10. DB::Exception: Not found column thread_number in block. There are only columns: event_time, event_time_microseconds, host_name, query_id, thread_id, priority, source, text
原因:ClickHouse/ClickHouse#9539 (comment):
The client cannot read logs that are sent with send_logs_level setting.
In distributed setup, servers will forward logs to the initiator server, then they are sent to the client and the client will show an error
- Prevent out of memory while constructing result of a large JOIN
- Fixed rare crashes when server run out of memory
了解了一下各个参数的配置,发现查询相关的限制应该是正常的,不管是join还是group by 在max_memory_usage_for_all_queries都是能限制住的。
所以是为什么出现了这个内存问题呢?
oom是因为tcphandler申请的,从这出发,要不就是读数据,要不就是写数据。是因为当时刚好有大内存的查询,然后写入也有大量的数据,所以导致oom了?去看下写入的内存限制是否存在?但也申请太多了吧?几G的数据?还是觉得不太合理。
之前的一段时间,内存都是在3G以上。重启后就释放了。因为之前是3G,然后申请配置内的空间不够导致的?类似之前的备份工具消耗内存原因。
再次出现问题,排查后发现很多,insert select的操作。可能原因还是因为select和insert一起加重负载,配置只能限制会查询的内存。但是insert的merges内存无法限制。导致oom。
如何限制?
The number of rows that are read from the merged parts into memory.
合并从merge_max_block_size行的块中的parts读取行,然后合并并将结果写入新parts。读取的块放置在RAM中,因此merge_max_block_size影响合并所需的RAM大小。因此,对于行非常宽的表,合并会消耗大量RAM(如果平均行大小为100kb,则合并10个parts时:(100kb * 10 * 8192) = ~ 8GB of RAM).。通过减少merge_max_block_size,可以减少合并所需的RAM数量,但会降低合并速度。
再次出现oom了,还是赶紧升级21版本靠谱些
# 19版本内存限制
docker run -d -m 2g -p 9001:9000 -p 8123:8123 --name my-clickhouse-server-19 --ulimit nofile=262144:262144 --volume=/home/youmi/data/ch-19:/var/lib/clickhouse -v /home/youmi/config/ch/config.xml:/etc/clickhouse-server/config.xml -v /home/youmi/config/ch/users.xml:/etc/clickhouse-server/users.xml -v /home/youmi/log/clickhouse:/var/log/clickhouse-server yandex/clickhouse-server:19.16.6.17
docker exec -it my-clickhouse-server-19 bash
clickhouse-client --query "SELECT * FROM ( SELECT now(), rand(number), number FROM system.numbers LIMIT 1000000 ) AS t1 LEFT JOIN ( SELECT now(), rand(number), number FROM system.numbers LIMIT 2000000 ) AS t2 ON t1.number = t2.number"
clickhouse-client --query "SELECT name, value FROM system.settings WHERE name LIKE '%mem%';"
多进程跑多几个以上查询,会发现会超出max_memory_usage_for_all_queries
的限制:
from clickhouse_driver import Client
client = Client.from_url("clickhouse://172.16.8.39:9001/mt")
def run():
sql = """
SELECT *
FROM
(
SELECT
now(),
rand(number),
number
FROM system.numbers
LIMIT 1000000
) AS t1
LEFT JOIN
(
SELECT
now(),
rand(number),
number
FROM system.numbers
LIMIT 1000000
) AS t2 ON t1.number = t2.number
"""
for i in range(10):
try:
res = client.execute(
sql
)
print('1')
except Exception as e:
print(e)
break
run()
docker run -d -m 2g -p 9001:9000 -p 8123:8123 --name my-clickhouse-server-21 --ulimit nofile=262144:262144 -v /home/youmi/log/clickhouse:/var/log/clickhouse-server -v /home/youmi/config/ch21/clickhouse-server/:/etc/clickhouse-server yandex/clickhouse-server:21.3.5
新版本的max_memory_usage_for_all_queries不生效了。同样的测试,只限制max_memory_usage,并发的时候无法限制内存。
测试更改max_server_memory_usage配置:
DB::Exception: Memory limit (total) exceeded: would use 587.52 MiB (attempt to allocate chunk of 132647048 bytes), maximum: 476.84 MiB: While executing CreatingSetsTransform. Stack trace:
不会超过限制的最大内存配置,会直接报错。
结论: 19版本经常发生的oom,主要是因为内存限制只是限制了查询相关的。clickhouse还有很多消耗内存的地方,例如插入语句、未压缩数据缓存、客户端读取数据等等。导致内存限制总有上飘。21版本通过总的server限制,从各个方面申请内存达到限制,来避免内存oom。升级版本后再观察。
要注意uncompressed_cache_size类参数对内存的使用