Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed table hung - test draft showing problem #9900

Closed
wants to merge 3 commits into from

Conversation

filimonov
Copy link
Contributor

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Other

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

...

Detailed description / Documentation draft:

...

By adding documentation, you'll allow users to try your new feature immediately, not when someone else will have time to document it later. Documentation is necessary for all features that affect user experience in any way. You can add brief documentation draft above, or add documentation right into your patch as Markdown files in docs folder.

If you are doing this for the first time, it's recommended to read the lightweight Contributing to ClickHouse Documentation guide first.

@blinkov blinkov added the pr-other Pull request with changes not fitting to other categories label Apr 1, 2020

result = instance_with_dist_table.query("SELECT hostName(), x FROM distributed ORDER BY hostName(), x SETTINGS load_balancing='in_order', prefer_localhost_replica=0")
assert TSV(result) == TSV('node_1_1\t1\nnode_2_2\t2\nnode_2_2\t3')
# сейчас: висит 5 минут (receive_timeout), остановить запрос с помощью ctrl+c или по max_execution_time - невозможно
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is expected -- you're pausing the container, so the server process is suspended. The kernel still accepts the connection, so the connection timeout doesn't fire. But the paused server can't reply to hello packet, hence the initiator waits for receive_timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the mechanics. But i'm not satisfied with the result.

It need to work better. We have another healthy replica.

Comment on lines +137 to +140
# ожидаемое поведение:
# убеждаемся что реплика не даёт никакого acknowkegment (можно какой-то доп райунтрип "ты жива? да",
# или ack после отправки запроса по типу "запроос принят, буду обрабатывать") в течение
# connect_timeout_with_failover_ms, и если ack нет - идём в другую (здоровую) реплику
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't help, because if the server can be suddenly paused before hello packet, it can also be paused after it, with the same result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heartbeats

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding more kinds of packets to the protocol doesn't help -- they are all subject to the same problem, heartbeat packets included. The problem is not in the protocol but in the fact that we work with replicas synchronously. Anyway, we have some understanding of what can be done about this case, AFAIR @tavplubix wanted to make a prototype but haven't got to it yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it should be async.

@alexey-milovidov alexey-milovidov changed the title distibuted table hung - test draft showing problem distributed table hung - test draft showing problem Sep 16, 2020
@alexey-milovidov alexey-milovidov added the st-hold We've paused the work on issue for some reason label Dec 20, 2020
@Avogar
Copy link
Member

Avogar commented Mar 3, 2021

Fixed in #19291.

@Avogar Avogar closed this Mar 3, 2021
@filimonov
Copy link
Contributor Author

@Avogar did you add that test?

@Avogar
Copy link
Member

Avogar commented Mar 12, 2021

@Avogar did you add that test?

Yes, I added a similar test:

def test_stuck_replica(started_cluster):
cluster.pause_container("node_1")
check_query(expected_replica="node_2")
result = NODES['node'].query("SELECT slowdowns_count FROM system.clusters WHERE cluster='test_cluster' and host_name='node_1'")
assert TSV(result) == TSV("1")
result = NODES['node'].query("SELECT hostName(), id FROM distributed ORDER BY id LIMIT 1");
assert TSV(result) == TSV("node_2\t0")
# Check that we didn't choose node_1 first again and slowdowns_count didn't increase.
result = NODES['node'].query("SELECT slowdowns_count FROM system.clusters WHERE cluster='test_cluster' and host_name='node_1'")
assert TSV(result) == TSV("1")
cluster.unpause_container("node_1")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-docs-needed pr-other Pull request with changes not fitting to other categories st-hold We've paused the work on issue for some reason
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants