-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed table hung - test draft showing problem #9900
Conversation
|
||
result = instance_with_dist_table.query("SELECT hostName(), x FROM distributed ORDER BY hostName(), x SETTINGS load_balancing='in_order', prefer_localhost_replica=0") | ||
assert TSV(result) == TSV('node_1_1\t1\nnode_2_2\t2\nnode_2_2\t3') | ||
# сейчас: висит 5 минут (receive_timeout), остановить запрос с помощью ctrl+c или по max_execution_time - невозможно |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is expected -- you're pausing the container, so the server process is suspended. The kernel still accepts the connection, so the connection timeout doesn't fire. But the paused server can't reply to hello packet, hence the initiator waits for receive_timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the mechanics. But i'm not satisfied with the result.
It need to work better. We have another healthy replica.
# ожидаемое поведение: | ||
# убеждаемся что реплика не даёт никакого acknowkegment (можно какой-то доп райунтрип "ты жива? да", | ||
# или ack после отправки запроса по типу "запроос принят, буду обрабатывать") в течение | ||
# connect_timeout_with_failover_ms, и если ack нет - идём в другую (здоровую) реплику |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't help, because if the server can be suddenly paused before hello packet, it can also be paused after it, with the same result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heartbeats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding more kinds of packets to the protocol doesn't help -- they are all subject to the same problem, heartbeat packets included. The problem is not in the protocol but in the fact that we work with replicas synchronously. Anyway, we have some understanding of what can be done about this case, AFAIR @tavplubix wanted to make a prototype but haven't got to it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it should be async.
Fixed in #19291. |
@Avogar did you add that test? |
Yes, I added a similar test: ClickHouse/tests/integration/test_hedged_requests/test.py Lines 86 to 105 in 2e7f756
|
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
...
Detailed description / Documentation draft:
...
By adding documentation, you'll allow users to try your new feature immediately, not when someone else will have time to document it later. Documentation is necessary for all features that affect user experience in any way. You can add brief documentation draft above, or add documentation right into your patch as Markdown files in docs folder.
If you are doing this for the first time, it's recommended to read the lightweight Contributing to ClickHouse Documentation guide first.