Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

many keepalive watchdog fired log in tiflash error log #4192

Closed
windtalker opened this issue Mar 8, 2022 · 7 comments · Fixed by #4655
Closed

many keepalive watchdog fired log in tiflash error log #4192

windtalker opened this issue Mar 8, 2022 · 7 comments · Fixed by #4655
Assignees
Labels
affects-6.0 severity/major type/bug The issue is confirmed as a bug.

Comments

@windtalker
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

set up a cluster with tls enable, then run mpp queries

2. What did you expect to see? (Required)

the error log should be clean

3. What did you see instead (Required)

many keepalive watchdog fired log in tiflash error log

[2022/03/08 12:04:16.189 +08:00] [ERROR] [<unknown>] ["grpc:/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc, line number: 2882, log msg : ipv4:172.16.5.85:9626: Keepalive watchdog fired. Closing transport."] [thread_id=221]                                                                    
[2022/03/08 12:04:18.544 +08:00] [ERROR] [<unknown>] ["grpc:/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc, line number: 2882, log msg : ipv4:172.16.5.81:9626: Keepalive watchdog fired. Closing transport."] [thread_id=115]

It seems the log is harmless since the query is not affected, but we still need to find the root cause, and avoid this error.

4. What is your TiFlash version? (Required)

master @ a605801

@windtalker windtalker added the type/bug The issue is confirmed as a bug. label Mar 8, 2022
@zanmato1984
Copy link
Contributor

Is this error only seen when TLS enabled?

@windtalker
Copy link
Contributor Author

Not sure, need to test on non-tls TiDB cluster.

@windtalker
Copy link
Contributor Author

test on nightly(2021-03-13) TiDB cluster, with tls on

[2022/03/18 09:22:19.042 +08:00] [ERROR] [<unknown>] ["grpc:/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/ext/transport/chttp2/transport/chttp2_t
ransport.cc, line number: 2882, log msg : ipv4:172.16.5.81:9635: Keepalive watchdog fired. Closing transport."] [thread_id=10]
[2022/03/18 09:22:19.105 +08:00] [ERROR] [<unknown>] ["grpc:/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/ext/transport/chttp2/transport/chttp2_t
ransport.cc, line number: 2882, log msg : ipv4:172.16.5.82:9635: Keepalive watchdog fired. Closing transport."] [thread_id=10]
[2022/03/18 09:22:23.612 +08:00] [ERROR] [<unknown>] ["grpc:/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/ext/transport/chttp2/transport/chttp2_t
ransport.cc, line number: 2882, log msg : ipv4:172.16.5.81:9635: Keepalive watchdog fired. Closing transport."] [thread_id=10]
[2022/03/18 09:22:23.929 +08:00] [ERROR] [<unknown>] ["grpc:/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/ext/transport/chttp2/transport/chttp2_t
ransport.cc, line number: 2882, log msg : ipv4:172.16.5.82:9635: Keepalive watchdog fired. Closing transport."] [thread_id=16]
[2022/03/18 09:22:34.146 +08:00] [ERROR] [<unknown>] ["grpc:/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/ext/transport/chttp2/transport/chttp2_t
ransport.cc, line number: 2882, log msg : ipv4:172.16.5.81:7535: Keepalive watchdog fired. Closing transport."] [thread_id=16]
[2022/03/18 09:22:34.169 +08:00] [ERROR] [<unknown>] ["grpc:/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/ext/transport/chttp2/transport/chttp2_t
ransport.cc, line number: 2882, log msg : ipv4:172.16.5.81:7535: Keepalive watchdog fired. Closing transport."] [thread_id=10]
[2022/03/18 09:24:08.301 +08:00] [ERROR] [<unknown>] ["grpc:/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/ext/transport/chttp2/transport/chttp2_t
ransport.cc, line number: 2882, log msg : ipv4:172.16.5.81:7535: Keepalive watchdog fired. Closing transport."] [thread_id=16]
[2022/03/18 09:24:39.716 +08:00] [ERROR] [<unknown>] ["grpc:/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/contrib/grpc/src/core/ext/transport/chttp2/transport/chttp2_t
ransport.cc, line number: 2882, log msg : ipv4:172.16.5.81:7535: Keepalive watchdog fired. Closing transport."] [thread_id=16]

172.16.5.81:7535 is TiKV server, ipv4:172.16.5.82:9635 is TiFlash server.
Errors from TiFlash server will stopped serveral minutes after query is stopped, while errors from TiKV server seems will last during the whole lifetime of TiFlash server.

with tls off, does not found these errors.

So it seems another problem with tls enabled.

@windtalker
Copy link
Contributor Author

The client's keepalive config is almost the same of client-c and client-go, but no such errors found in TiDB's log.

@windtalker windtalker self-assigned this Mar 21, 2022
@windtalker
Copy link
Contributor Author

windtalker commented Mar 21, 2022

The root cause is in client-c, we set GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS to 1, which mean the client channel will send keepalive pings without any outstanding streams, and in the implementation of grpc-core, for client channnels without activing polling threads(which in my understanding, client channel without any outstanding streams will eventually get into this situation), it uses backup poller to poll the message(including the ping ack), and the backup polls are run in the timer threads and the default poll interval is 5000ms. However, the keepalive timeout threshold is set to 3000ms, so it will meet keepalive timeout randomly.

Some possible solutions:

  • Disable GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS in client-c
  • Set client channel keepalive timeout and default poll interval carefully:
    • Default poll interval must be less than the keepalive timeout, assuming default_poll_interval = keepalive_timeout - xms, then in the worst case, the actual keepalive_timeout is xms(consider a poll is triggered at time t, then the next poll will be triggered at time t + default_poll_interval, and a keepalive ping is sent at t + default_poll_interval - x, then it takes x time for before the next poll, and if a ping ack is not arrived at that time, it will be treated as timeout)
    • In TiDB the default keepalive ping timeout is 3000ms, so I think x should be 3000 as well.
    • If x is set to 3000, and default_poll_interval = 5000, then the keepalive_timeout in client-c has to be set to 8000ms, which I think maybe too big, so maybe in this case we need to set default_poll_interval = 2000/3000?
    • Furthermore, default_poll_interval can not be set by channel arguments, it is a global variable, and has to be set using environment variable GRPC_CLIENT_CHANNEL_BACKUP_POLL_INTERVAL_MS. So we need set it in run_tiflash.sh

I prefer to use the first solution because

  • It is much simple
  • Client-rust(which is also based on grpc-cpp) does not set it neither
  • Seems don't have too much side effect,
  • GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS is disabled by default in Grpc

After some discussion with @zanmato1984 and @fuzhe1989, I decide to set the timeout of keepalive ping to 8000ms as a quick fix, and if we update GRPC_CLIENT_CHANNEL_BACKUP_POLL_INTERVAL_MS some day, we can adjust the keepalive ping timeout back to a short one.

@LittleFall
Copy link
Contributor

I will do cherry-pick to 5.X

@windtalker
Copy link
Contributor Author

Unfortunately, after set GRPC_ARG_KEEPALIVE_TIMEOUT_MS to 8000 ms, TiFlash still meet this error randomly if it is running with high load. The root cause is in GRPC v1.26.0:

  • each epoll will only pull at most 100 events
  • If an epoll has polled more than 16 events, in pollable_process_events, only the first 16 events are handle, the rest will be handled in the next time when pollset_work is called.
  • The backup poller is triggered every 5 second, so if an event is not handle during this trigger, it has to wait at least 5 seconds before got handled

And since the keep_alive_timeout is 8 seconds, so if the ping ack event is not handled at the first time it is polled, there will be a high probability of timeout.

Since GRPC 1.31.0, there is no limit of handling event number in pollable_process_events, but an epoll will still poll at most 100 events, so if we upgrade GRPC to 1.31.0 or newer, we can avoid this keep alive timeout in most scenes, but still there is a chance that a ping ack event is not polled in time and cause the keep alive timeout error.

windtalker added a commit to windtalker/tiflash that referenced this issue Apr 14, 2022
windtalker added a commit to windtalker/tiflash that referenced this issue Apr 14, 2022
windtalker added a commit to windtalker/tiflash that referenced this issue Apr 21, 2022
windtalker added a commit to windtalker/tiflash that referenced this issue Apr 21, 2022
windtalker added a commit to windtalker/tiflash that referenced this issue Apr 21, 2022
windtalker added a commit to windtalker/tiflash that referenced this issue Apr 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.0 severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants