Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tablets] Requests served imbalance after adding nodes to cluster #19107

Closed
2 tasks
soyacz opened this issue Jun 5, 2024 · 87 comments
Closed
2 tasks

[Tablets] Requests served imbalance after adding nodes to cluster #19107

soyacz opened this issue Jun 5, 2024 · 87 comments
Assignees
Labels
area/drivers Relates to one or more of ScyllaDB drivers functionality area/elastic cloud area/tablets P1 Urgent symptom/performance Issues causing performance problems triage/master Looking for assignee
Milestone

Comments

@soyacz
Copy link
Contributor

soyacz commented Jun 5, 2024

Packages

Scylla version: 6.1.0~dev-20240528.519317dc5833 with build-id 75e8987548653166f5131039236650c1ead746f4

Kernel Version: 5.15.0-1062-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Test scenario covering case of scaling out cluster from 3 nodes to 6, with new nodes added in parallel.
During adding node we can see one node takes over most of the cluster load while the rest served requests drop significantly.
Also, after growing, requests served are still not balanced (before grow we can see all nodes serving equally).
Test uses c-s with java driver 3.11.5.2 which is tablet aware.

Impact

Degraded performance

How frequently does it reproduce?

Reproduces in all tablets elasticity tests (write, read, mixed)

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

  • perf-latency-grow-shrink-ubuntu-db-node-f417745e-6 (3.91.25.250 | 10.12.1.135) (shards: 7)
  • perf-latency-grow-shrink-ubuntu-db-node-f417745e-5 (44.204.102.74 | 10.12.0.9) (shards: 7)
  • perf-latency-grow-shrink-ubuntu-db-node-f417745e-4 (3.232.133.68 | 10.12.2.98) (shards: 7)
  • perf-latency-grow-shrink-ubuntu-db-node-f417745e-3 (107.23.70.58 | 10.12.0.47) (shards: 7)
  • perf-latency-grow-shrink-ubuntu-db-node-f417745e-2 (34.232.109.63 | 10.12.0.113) (shards: 7)
  • perf-latency-grow-shrink-ubuntu-db-node-f417745e-1 (3.239.252.252 | 10.12.2.208) (shards: 7)

OS / Image: ami-0a070c0d6ef92b552 (aws: undefined_region)

Test: scylla-master-perf-regression-latency-650gb-grow-shrink
Test id: f417745e-0067-4479-95ee-24c9182267ce
Test name: scylla-staging/lukasz/scylla-master-perf-regression-latency-650gb-grow-shrink
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor f417745e-0067-4479-95ee-24c9182267ce
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs f417745e-0067-4479-95ee-24c9182267ce

Logs:

Date Log type Link
20190101_010101 prometheus https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/prometheus_snapshot_20240530_213037.tar.gz
20190101_010101 prometheus https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/prometheus_snapshot_20240531_001921.tar.gz
20190101_010101 prometheus https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/prometheus_snapshot_20240531_002918.tar.gz
20190101_010101 prometheus https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/prometheus_snapshot_20240531_003853.tar.gz
20240530_212207 grafana https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240530_212207/grafana-screenshot-overview-20240530_212207-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240530_212207 grafana https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240530_212207/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240530_212338-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240530_222727 grafana https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240530_222727/grafana-screenshot-overview-20240530_222727-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240530_222727 grafana https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240530_222727/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240530_222812-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240530_230010 grafana https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240530_230010/grafana-screenshot-overview-20240530_230032-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240530_230010 grafana https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240530_230010/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240530_230110-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240531_001123 grafana https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_001123/grafana-screenshot-overview-20240531_001144-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240531_001123 grafana https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_001123/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240531_001222-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240531_002134 grafana [https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_002134/grafana-screenshot-overview-20240531_002134-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png](https://cloudius-jenkins-test.s3.amazonaws.com/f417745e
20240531_002134 grafana https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_002134/grafana-screenshot-overview-20240531_002134-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240531_002134 grafana https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_002134/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240531_002219-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240531_003108 grafana https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_003108/grafana-screenshot-overview-20240531_003108-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240531_003108 grafana https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_003108/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240531_003153-perf-latency-grow-shrink-ubuntu-monitor-node-f417745e-1.png
20240531_003920 db-cluster https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_003920/db-cluster-f417745e.tar.gz
20240531_003920 loader-set https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_003920/loader-set-f417745e.tar.gz
20240531_003920 monitor-set https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_003920/monitor-set-f417745e.tar.gz
20240531_003920 sct https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_003920/sct-f417745e.log.tar.gz
20240531_003920 event https://cloudius-jenkins-test.s3.amazonaws.com/f417745e-0067-4479-95ee-24c9182267ce/20240531_003920/sct-runner-events-f417745e.tar.gz

Jenkins job URL
Argus

@soyacz soyacz added symptom/performance Issues causing performance problems triage/master Looking for assignee labels Jun 5, 2024
@michoecho
Copy link
Contributor

Seems like a driver-side load balancing issue. What's the driver's load balancing policy?

@bhalevy
Copy link
Member

bhalevy commented Jun 5, 2024

Which metric(s) are imbalanced?
coordinator or replica side?
reads, writes or both?
what is the consistency level?
I agree with @michoecho that this could be a load-balancing issue on the driver side.

@michoecho
Copy link
Contributor

michoecho commented Jun 5, 2024

Which metric(s) are imbalanced? coordinator or replica side? reads, writes or both? what is the consistency level? I agree with @michoecho that this could be a load-balancing issue on the driver side.

Coordinator work is unbalanced. Replica work is balanced. There are writes only. CL doesn't matter, since it's writes only.

The test starts with 3 nodes (7 shards on each node, 18 tablets replicated on each shard). Initially, work is perfectly balanced across coordinators.

Then, 3 nodes are bootstrapped in parallel. As soon as the bootstrap starts, the balance shatters — one of the 3 original nodes starts handling 80% of requests, while the other two handle 10% each. (I'm only showing the per-instance graph here, because shards within each instance are mostly symmetric, so the per-shard view isn't very interesting in this case).

The coordinators never return to balance after that, even after replica work is eventually balanced perfectly.

Also: shard awareness seems to break down thoroughly after the bootstrap (i.e. most requests are sent to the wrong shard), and it never recovers. However, node awareness works (i.e. all requests are sent to the right node). Edit: this part is wrong, see my later comment.

Hypothesis: tablet awareness load balancing in the java driver gets broken by tablet migrations.

image

@soyacz
Copy link
Contributor Author

soyacz commented Jun 6, 2024

Seems like a driver-side load balancing issue. What's the driver's load balancing policy?

I don't know the answer for this, cassandra-stress default is used.

@bhalevy
Copy link
Member

bhalevy commented Jun 6, 2024

Summoning @piodul

@piodul
Copy link
Contributor

piodul commented Jun 6, 2024

Summoning @piodul

I have no idea about the java driver's implementation of support for tablets. I might be wrong, but AFAIK @Bouncheck implemented it and @Lorak-mmk was reviewing it, so they might have some ideas.

@Lorak-mmk
Copy link
Contributor

Summoning @piodul

I have no idea about the java driver's implementation of support for tablets. I might be wrong, but AFAIK @Bouncheck implemented it and @Lorak-mmk was reviewing it, so they might have some ideas.

I'm reviewing implementation in Java Driver 4.x (btw why isn't c-s using 4.x?), I didn't really look at 3.x implementation

@piodul
Copy link
Contributor

piodul commented Jun 6, 2024

cc: @avelanarius are there other people from your team who could take a look at it?

@Lorak-mmk
Copy link
Contributor

Lorak-mmk commented Jun 6, 2024

@soyacz would it be hard to run similar scenario using cql-stress's c-s mode? Current master of cql-stress uses Rust Driver 0.13 which has tablet awareness.
This way we'd know if it's a bug in implementation of Java Driver 3.x, or something more universal (server bug / inherent problem with how tablet awareness works).

@soyacz
Copy link
Contributor Author

soyacz commented Jun 6, 2024

@soyacz would it be hard to run similar scenario using cql-stress's c-s mode? Current master of cql-stress uses Rust Driver 0.13 which has tablet awareness. This way we'd know if it's a bug in implementation of Java Driver 3.x, or something more universal (server bug / inherent problem with how tablet awareness works).

I can try, if throttling is supported should work without much effort (only prepare stage needs to be adjusted possibly to not overload cluster). But first would be good if we updated cql-stress to newest release with newest drivers, see scylladb/scylla-cluster-tests#7582

@Lorak-mmk
Copy link
Contributor

@soyacz would it be hard to run similar scenario using cql-stress's c-s mode? Current master of cql-stress uses Rust Driver 0.13 which has tablet awareness. This way we'd know if it's a bug in implementation of Java Driver 3.x, or something more universal (server bug / inherent problem with how tablet awareness works).

I can try, if throttling is supported should work without much effort (only prepare stage needs to be adjusted possibly to not overload cluster). But first would be good if we updated cql-stress to newest release with newest drivers, see scylladb/scylla-cluster-tests#7582

I see in that issue that @fruch managed to build the new version. Is there anything blocking the update?

@soyacz
Copy link
Contributor Author

soyacz commented Jun 6, 2024

@soyacz would it be hard to run similar scenario using cql-stress's c-s mode? Current master of cql-stress uses Rust Driver 0.13 which has tablet awareness. This way we'd know if it's a bug in implementation of Java Driver 3.x, or something more universal (server bug / inherent problem with how tablet awareness works).

I can try, if throttling is supported should work without much effort (only prepare stage needs to be adjusted possibly to not overload cluster). But first would be good if we updated cql-stress to newest release with newest drivers, see scylladb/scylla-cluster-tests#7582

I see in that issue that @fruch managed to build the new version. Is there anything blocking the update?

trying it.

@soyacz
Copy link
Contributor Author

soyacz commented Jun 6, 2024

test unfortunately failed before growing cluster due parsing cql-stress result. Issue created: scylladb/cql-stress#95

@soyacz
Copy link
Contributor Author

soyacz commented Jun 7, 2024

I managed to execute test with cql-stress as a stress tool (based on rust driver with tablet support) and it looks differently, but still bad (still unbalanced). This is write only test.:
image

Unfortunately there are no email report with details. For more metrics: hydra investigate show-monitor 7634564e-35e4-4cff-bbbb-f8aab3dc4d03
logs:

| Date | Log type | Link |
+-----------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 20190101_010101 | prometheus | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/prometheus_snapshot_20240607_081603.tar.gz |
| 20190101_010101 | prometheus | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/prometheus_snapshot_20240607_111725.tar.gz |
| 20190101_010101 | prometheus | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/prometheus_snapshot_20240607_111949.tar.gz |
| 20190101_010101 | prometheus | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/prometheus_snapshot_20240607_112232.tar.gz |
| 20240607_081402 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_081402/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240607_081433-perf-latency-grow-shrink-ubuntu-monitor-node-7634564e-1.png |
| 20240607_091847 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_091847/grafana-screenshot-overview-20240607_091847-perf-latency-grow-shrink-ubuntu-monitor-node-7634564e-1.png |
| 20240607_091847 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_091847/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240607_091953-perf-latency-grow-shrink-ubuntu-monitor-node-7634564e-1.png |
| 20240607_093035 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_093035/grafana-screenshot-overview-20240607_093035-perf-latency-grow-shrink-ubuntu-monitor-node-7634564e-1.png |
| 20240607_093035 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_093035/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240607_093141-perf-latency-grow-shrink-ubuntu-monitor-node-7634564e-1.png |
| 20240607_103923 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_103923/grafana-screenshot-overview-20240607_103945-perf-latency-grow-shrink-ubuntu-monitor-node-7634564e-1.png |
| 20240607_103923 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_103923/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240607_104051-perf-latency-grow-shrink-ubuntu-monitor-node-7634564e-1.png |
| 20240607_111449 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_111449/grafana-screenshot-overview-20240607_111511-perf-latency-grow-shrink-ubuntu-monitor-node-7634564e-1.png |
| 20240607_111449 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_111449/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240607_111617-perf-latency-grow-shrink-ubuntu-monitor-node-7634564e-1.png |
| 20240607_111735 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_111735/grafana-screenshot-overview-20240607_111735-perf-latency-grow-shrink-ubuntu-monitor-node-7634564e-1.png |
| 20240607_111735 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_111735/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240607_111841-perf-latency-grow-shrink-ubuntu-monitor-node-7634564e-1.png |
| 20240607_112018 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_112018/grafana-screenshot-overview-20240607_112018-perf-latency-grow-shrink-ubuntu-monitor-node-7634564e-1.png |
| 20240607_112018 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_112018/grafana-screenshot-scylla-master-perf-regression-latency-650gb-grow-shrink-scylla-per-server-metrics-nemesis-20240607_112124-perf-latency-grow-shrink-ubuntu-monitor-node-7634564e-1.png |
| 20240607_112256 | db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_112256/db-cluster-7634564e.tar.gz |
| 20240607_112256 | loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_112256/loader-set-7634564e.tar.gz |
| 20240607_112256 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_112256/monitor-set-7634564e.tar.gz |
| 20240607_112256 | sct | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_112256/sct-7634564e.log.tar.gz |
| 20240607_112256 | event | https://cloudius-jenkins-test.s3.amazonaws.com/7634564e-35e4-4cff-bbbb-f8aab3dc4d03/20240607_112256/sct-runner-events-7634564e.tar.gz |

@michoecho
Copy link
Contributor

I managed to execute test with cql-stress as a stress tool (based on rust driver with tablet support) and it looks differently, but still bad (still unbalanced). This is write only test.:

In this case the problem is different.

image

As you can see, this time coordinator work is directly proportional to replica work — which means that this time load balancing works.

This time, the imbalance doesn't come from bad load balancing on the client side, but from bad balancing of tablets on the server. cql-stress apparently creates two tables instead of one — standard1 and counter1, cassandra-stress creates only one of those, depending on the test — and their mix on each shard is arbitrary. So there are some shards with a majority of standard1 tablets and some with a majority of counter1 tablets, but only standard1 is used, so replica work is unbalanced.

Note that there is still a high rate of cross-shard ops. I've said earlier that "shard awareness seems to break down", but I've just realized that this isn't true — it's a server-side issue. Shards only communicate with their siblings on other nodes. With vnodes, a replica set for a given token range is always replicated on a set of sibling shards, so shards can send replica requests to their siblings directly, and there are no cross-shard ops. With tablets, there is no such property — on different nodes, the same tablet will be replicated on shards with different numbers, so cross-shard ops are unavoidable.

@michoecho
Copy link
Contributor

michoecho commented Jun 7, 2024

So, to sum up: with cql-stress, the results look (to me) as expected. So it would appear that the problem is with the java driver. (And the problem is probably just with load balancing. I was wrong earlier about requests being sent to non-replica shards).

@Lorak-mmk
Copy link
Contributor

So, to sum up: with cql-stress, the results look (to me) as expected. So it would appear that the problem is with the java driver. (And the problem is probably just with load balancing. I was wrong earlier about requests being sent to non-replica shards).

In that case @Bouncheck will be the best person to investigate this

@roydahan
Copy link

So, to sum up: with cql-stress, the results look (to me) as expected.

@michoecho regarding the cql-stress and counter table, something doesn't adds up.
AFAIU from @muzarski, the counters table was created but there are no writes/reads to it.
Furthermore, counters table isn't supported for tablets so according to @avikivity it should have been rejected and not being created at all.

@michoecho
Copy link
Contributor

@michoecho regarding the cql-stress and counter table, something doesn't adds up.
AFAIU from @muzarski, the counters table was created but there are no writes/reads to it.

Correct. What about this doesn't add up? Tablet load balancer doesn't care about traffic, only the number of tablets.

Furthermore, counters table isn't supported for tablets so according to @avikivity it should have been rejected and not being created at all.

Then apparently the codebase didn't get the memo, because they aren't rejected.

@bhalevy
Copy link
Member

bhalevy commented Jun 20, 2024

@michoecho regarding the cql-stress and counter table, something doesn't adds up.
AFAIU from @muzarski, the counters table was created but there are no writes/reads to it.

Correct. What about this doesn't add up? Tablet load balancer doesn't care about traffic, only the number of tablets.

Furthermore, counters table isn't supported for tablets so according to @avikivity it should have been rejected and not being created at all.

Then apparently the codebase didn't get the memo, because they aren't rejected.

issue number?

@michoecho
Copy link
Contributor

Furthermore, counters table isn't supported for tablets so according to @avikivity it should have been rejected and not being created at all.

Then apparently the codebase didn't get the memo, because they aren't rejected.

issue number?

Are you asking whether there is an existing ticket for this, or are you asking me to create one?

Opened #19449.

@bhalevy
Copy link
Member

bhalevy commented Jun 24, 2024

Furthermore, counters table isn't supported for tablets so according to @avikivity it should have been rejected and not being created at all.

Then apparently the codebase didn't get the memo, because they aren't rejected.

issue number?

Are you asking whether there is an existing ticket for this, or are you asking me to create one?

Either...

Opened #19449.

Thanks!

@fee-mendes
Copy link
Member

I've been playing with cql-stress (I manually patched it myself to stop creating counter tables) and I get a similar issue as described here. While scaling or downscaling a cluster, I observe the driver/stressor emit the following log lines:

2024-06-20T22:33:46.136372Z  WARN scylla::transport::topology: Failed to establish control connection and fetch metadata on all known peers. Falling back to initial contact points.
2024-06-20T22:33:46.136430Z  WARN scylla::transport::topology: Failed to fetch metadata using current control connection control_connection_address="172.31.16.15:9042" error=Protocol Error: system.peers or system.local has invalid column type
2024-06-20T22:33:46.146289Z ERROR scylla::transport::topology: Could not fetch metadata error=Protocol Error: system.peers or system.local has invalid column type

What happens after is that throughput basically tanks down. A naive user will then try to restart the client to see if it "helps" with anything, this actually worsens the situation because now the driver will only connect to the initial contact point. For example:

image

If anyone would like more metrics I can easily reproduce it and make it available somewhere.

@Lorak-mmk
Copy link
Contributor

I've been playing with cql-stress (I manually patched it myself to stop creating counter tables) and I get a similar issue as described here. While scaling or downscaling a cluster, I observe the driver/stressor emit the following log lines:

2024-06-20T22:33:46.136372Z  WARN scylla::transport::topology: Failed to establish control connection and fetch metadata on all known peers. Falling back to initial contact points.
2024-06-20T22:33:46.136430Z  WARN scylla::transport::topology: Failed to fetch metadata using current control connection control_connection_address="172.31.16.15:9042" error=Protocol Error: system.peers or system.local has invalid column type
2024-06-20T22:33:46.146289Z ERROR scylla::transport::topology: Could not fetch metadata error=Protocol Error: system.peers or system.local has invalid column type

What happens after is that throughput basically tanks down. A naive user will then try to restart the client to see if it "helps" with anything, this actually worsens the situation because now the driver will only connect to the initial contact point. For example:
image

If anyone would like more metrics I can easily reproduce it and make it available somewhere.

Could you describe how can I reproduce it myself? I'd like to debug this from the driver side.

@fee-mendes
Copy link
Member

Could you describe how can I reproduce it myself? I'd like to debug this from the driver side.

I run the following load (starts at a small 3-node i4i.xlarge cluster) :

  1. Ingest whatever:
cargo run --release --bin cql-stress-cassandra-stress -- write n=100M cl=local_quorum keysize=100 -col n=5 size='FIXED(200)' -mode cql3 -rate throttle=120000/s threads=8 -pop seq=1..100M -node 172.31.16.15
  1. Run a mixed workload from (1):
cargo run --release --bin cql-stress-cassandra-stress -- mixed duration=6h cl=local_quorum keysize=100 'ratio(read=8,write=2)' -col n=5 size='FIXED(200)' -mode cql3 -rate throttle=120000/s threads=32 -pop seq=1..1M -node 172.31.16.15
  1. Add 3 nodes to the cluster:
---
- name: Double cluster-size
  hosts: double_cluster
  become: True

  tasks: 
    - name: Start ScyllaDB Service
      ansible.builtin.systemd_service:
        name: scylla-server.service
        state: started

    - name: Waiting for CQL port readiness
      wait_for:
        port: 9042
        host: 127.0.0.1
        connect_timeout: 3
        delay: 3
        sleep: 10
        timeout: 1200
        state: present

Considering you added enough data in (1), you'll see the problem shortly after you run 3 above, and throughput and clients won't recover until the full tablet migration is complete. As soon as you see the Warning/Error in logs, restart the client - You will notice the driver will only route traffic to the contact point you specified on the command line.

@Lorak-mmk
Copy link
Contributor

Drivers are suppose to occasionally "forget" the tablet mapping in order to get a fresh one.

They don't forget the whole mapping. When they send a statement to wrong node, they will get a payload with correct tablet for this statement. Then the driver will remove from it's local mapping tablets that overlap with newly received one and insert the newly received one.

@tgrabiec
Copy link
Contributor

tgrabiec commented Jun 27, 2024 via email

@mykaul
Copy link
Contributor

mykaul commented Oct 6, 2024

@dimakr - can you please ensure there's nothing to do here in any of the drivers?

@dimakr
Copy link

dimakr commented Oct 6, 2024

I assume the question is for @dkropachev

@mykaul
Copy link
Contributor

mykaul commented Nov 4, 2024

I assume the question is for @dkropachev

@dkropachev ?

@Bouncheck
Copy link

I'll prioritize investigating the java driver (3.x) side now

@dkropachev
Copy link
Contributor

dkropachev commented Nov 4, 2024

@dimakr - can you please ensure there's nothing to do here in any of the drivers?

There is definitely a problem on java-drver 3.x side with imbalanced load after nodes are added.
rust driver problem was fixed here - scylladb/scylla-rust-driver#1023

@Bouncheck
Copy link

Java-driver tablets seem to be fine but there's a different problem at the same time. Root cause of requests imbalance problem is that cassandra-stress uses TokenAwarePolicy with ReplicaOrdering.NEUTRAL. The default for java-driver 3.x and older versions of cassandra-stress is ReplicaOrdering.RANDOM. This ordering is what is doing the shuffling for the tablet replicas (and vnode token map replicas too, I assume). So in this test, the LBP is just not doing the shuffling because that's how it's set up in c-s code.

TokenAwarePolicy is usually wrapped around another policy. Default for java-driver 3.x is DCAwareRoundRobinPolicy and for c-s it depends on the settings. ReplicaOrdering was changed around a year ago to NEUTRAL in c-s to make driver respect ordering of RackAwareRoundRobinPolicy in case we use it. This neutral ordering now applies in all cases (Rack, Dc, etc). With neutral we do not do the shuffling, leading to imbalance.

Im still running some tests, but after reverting back to random ordering the requests served look normal (this is with java-driver 3.x master):
Image

Before I make a PR with the fix I need to look at what was the problem with RackAwareRoundRobinPolicy and make it so that it will work too after reverting (seems it's not compatible with TokenAwarePolicy). Or maybe I'll come up with a different fix.

@mykaul mykaul added the area/drivers Relates to one or more of ScyllaDB drivers functionality label Nov 18, 2024
@dkropachev
Copy link
Contributor

So, we can close it, it is missconfiguration on c-s part that lead to this. does anyone have reason to keep it open ?

@roydahan
Copy link

Do we have a version of c-s that "fixes" it?

@dkropachev
Copy link
Contributor

Do we have a version of c-s that "fixes" it?

Not now, this change was done on c-s to fix problem with RackAware policy, we are evaluation this fix and problem to understand what would be a proper fix it to.

@Bouncheck
Copy link

I've created scylladb/cassandra-stress#32 to switch back to RANDOM ordering where we can right now.

@roydahan
Copy link

IIUC the comments in scylladb/scylla-cluster-tests#9296 the fix for c-s solves the imbalance but doesn't improve or fix the latency issue (https://github.com/scylladb/scylla-enterprise/issues/4504#issuecomment-2495859043)

@michoecho
Copy link
Contributor

IIUC the comments in scylladb/scylla-cluster-tests#9296 the fix for c-s solves the imbalance but doesn't improve or fix the latency issue (scylladb/scylla-enterprise#4504 (comment))

@roydahan This thread doesn't say anything about latency, though.

Since this thread is about the coordinator imbalance, which has been addressed, we can probably close it.

@roydahan
Copy link

Yes, I agree.
Just wanted to clarify that because there was assumption/hope that closing this one will close all other performance issues.

@bhalevy
Copy link
Member

bhalevy commented Nov 26, 2024

@dkropachev is there an ETA for fixing this issue?

@roydahan
Copy link

It's already fixed with latest c-s release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/drivers Relates to one or more of ScyllaDB drivers functionality area/elastic cloud area/tablets P1 Urgent symptom/performance Issues causing performance problems triage/master Looking for assignee
Projects
None yet
Development

No branches or pull requests