Skip to content

Latest commit

 

History

History
207 lines (127 loc) · 9.24 KB

问题记录.md

File metadata and controls

207 lines (127 loc) · 9.24 KB

timeout

背景

2021.03.20 20:51:34.432503 [ 286 ] {a66f562c-18c4-40b0-9e83-ddf2e757f882} <Warning> TCPHandler: Client has gone away.
2021.03.20 20:51:34.849092 [ 286 ] {a66f562c-18c4-40b0-9e83-ddf2e757f882} <Information> TCPHandler: Processed in 266.890 sec.

定时任务每次3m后都失败。查询太久了,导致客户端的连接已经主动断开?是什么原因导致主动断开的?

read_timeout

默认值是0。go的client的超时控制,因为没有配置该项,所以应该也不是这个原因导致的。

例子: clickhouse_test.go:Test_Timeout

max-execution-time

默认是0。执行时间的超时控制。因为查询是能正常执行完,也不是这个原因导致的。

idle_connection_timeout

默认3600s,在指定的秒数后关闭空闲的TCP连接。不是,不过好像找不到可设置的文档?

This is especially important for large clusters with multiple distributed tables on every server, because every server can possibly keep a connection pool to every other server, and after peak query concurrency, connections will stall.

query_settings

配置在:query_settings.go:querySettingList 里面。跟receive_timeout是clickhouse相关的client查询配置

receive_timeout

默认是300秒,该配置指的是插入和接收数据的时间。这个错误应该不是造成本例子的原因。因为例子是因为慢查询导致的tcp断开,还没接收到数据,而不是接受数据太多导致的超时。而且是receive_timeout,数据库应该报错DB::Exception: Timeout exceeded while receiving data from client

tcp_keep_alive_timeout

The time in seconds the connection needs to remain idle before TCP starts sending keepalive probes

被http_keep_alive_timeout取代

复现

通过模拟慢查询复现?

set max_execution_speed=10;
select count(1) from system.numbers ;

不能复现,反而发现这个查询即使程序停止了,也没有kill掉。TODO

不是慢查询的问题?联想到其在发生期间,负载是很高的,或者是因为服务端的连接配置主动让client的连接kill掉了?

可能要得生产环境出现后,实地排查更好?

所以是啥原因?tcp本身自带的连接超时配置?Client has gone away是客户端因为一些原因挂掉了,可能是tcp的设置,也可能是超出内存被kill掉之类的。


explain

进入server容器执行可行: clickhouse-client --send_logs_level=trace <<< 'SELECT * FROM mt.ad_aggs_outer' > /dev/null

执行client错误: docker run -it --rm --link my-clickhouse-server-v2:clickhouse-server yandex/clickhouse-client --host clickhouse-server docker run --rm --link my-clickhouse-server-v2:clickhouse-server yandex/clickhouse-client --host clickhouse-server --send_logs_level=trace --query 'SELECT * FROM mt.ad_aggs_outer' > /dev/null 报错: Code: 10. DB::Exception: Not found column thread_number in block. There are only columns: event_time, event_time_microseconds, host_name, query_id, thread_id, priority, source, text

原因:ClickHouse/ClickHouse#9539 (comment):

The client cannot read logs that are sent with send_logs_level setting.
In distributed setup, servers will forward logs to the initiator server, then they are sent to the client and the client will show an error

oom

了解了一下各个参数的配置,发现查询相关的限制应该是正常的,不管是join还是group by 在max_memory_usage_for_all_queries都是能限制住的。

所以是为什么出现了这个内存问题呢?

oom是因为tcphandler申请的,从这出发,要不就是读数据,要不就是写数据。是因为当时刚好有大内存的查询,然后写入也有大量的数据,所以导致oom了?去看下写入的内存限制是否存在?但也申请太多了吧?几G的数据?还是觉得不太合理。

之前的一段时间,内存都是在3G以上。重启后就释放了。因为之前是3G,然后申请配置内的空间不够导致的?类似之前的备份工具消耗内存原因。

2021.04.12

再次出现问题,排查后发现很多,insert select的操作。可能原因还是因为select和insert一起加重负载,配置只能限制会查询的内存。但是insert的merges内存无法限制。导致oom。

如何限制?

merge-max-block-size

The number of rows that are read from the merged parts into memory.

合并从merge_max_block_size行的块中的parts读取行,然后合并并将结果写入新parts。读取的块放置在RAM中,因此merge_max_block_size影响合并所需的RAM大小。因此,对于行非常宽的表,合并会消耗大量RAM(如果平均行大小为100kb,则合并10个parts时:(100kb * 10 * 8192) = ~ 8GB of RAM).。通过减少merge_max_block_size,可以减少合并所需的RAM数量,但会降低合并速度。

2021.04.28

再次出现oom了,还是赶紧升级21版本靠谱些

# 19版本内存限制
docker run -d -m 2g -p 9001:9000 -p 8123:8123 --name my-clickhouse-server-19 --ulimit nofile=262144:262144 --volume=/home/youmi/data/ch-19:/var/lib/clickhouse  -v /home/youmi/config/ch/config.xml:/etc/clickhouse-server/config.xml -v /home/youmi/config/ch/users.xml:/etc/clickhouse-server/users.xml -v /home/youmi/log/clickhouse:/var/log/clickhouse-server yandex/clickhouse-server:19.16.6.17

docker exec -it my-clickhouse-server-19 bash

clickhouse-client --query "SELECT * FROM (     SELECT         now(),         rand(number),         number     FROM system.numbers     LIMIT 1000000 ) AS t1 LEFT JOIN (     SELECT         now(),         rand(number),         number     FROM system.numbers     LIMIT 2000000 ) AS t2 ON t1.number = t2.number"

clickhouse-client --query "SELECT name, value FROM system.settings WHERE name LIKE '%mem%';"

多进程跑多几个以上查询,会发现会超出max_memory_usage_for_all_queries的限制:

from clickhouse_driver import Client

client = Client.from_url("clickhouse://172.16.8.39:9001/mt")


def run():
    sql = """
    SELECT *
    FROM
    (
        SELECT
            now(),
            rand(number),
            number
        FROM system.numbers
        LIMIT 1000000
    ) AS t1
    LEFT JOIN
    (
        SELECT
            now(),
            rand(number),
            number
        FROM system.numbers
        LIMIT 1000000
    ) AS t2 ON t1.number = t2.number

    """
    for i in range(10):
        try:
            res = client.execute(
                sql
            )
            print('1')
        except Exception as e:
            print(e)
            break
run()

21版本

docker run -d -m 2g -p 9001:9000 -p 8123:8123 --name my-clickhouse-server-21 --ulimit nofile=262144:262144  -v /home/youmi/log/clickhouse:/var/log/clickhouse-server -v /home/youmi/config/ch21/clickhouse-server/:/etc/clickhouse-server yandex/clickhouse-server:21.3.5

新版本的max_memory_usage_for_all_queries不生效了。同样的测试,只限制max_memory_usage,并发的时候无法限制内存。

测试更改max_server_memory_usage配置:

DB::Exception: Memory limit (total) exceeded: would use 587.52 MiB (attempt to allocate chunk of 132647048 bytes), maximum: 476.84 MiB: While executing CreatingSetsTransform. Stack trace:

不会超过限制的最大内存配置,会直接报错。

结论: 19版本经常发生的oom,主要是因为内存限制只是限制了查询相关的。clickhouse还有很多消耗内存的地方,例如插入语句、未压缩数据缓存、客户端读取数据等等。导致内存限制总有上飘。21版本通过总的server限制,从各个方面申请内存达到限制,来避免内存oom。升级版本后再观察。

要注意uncompressed_cache_size类参数对内存的使用