Metadata queries with configurable server-side timeouts #1171

wprzytula · 2025-01-17T09:07:28Z

Motivation

Some users tune their server-side timeouts so that they are tighter. For clusters with a large schema, however, this sometimes made schema queries time out. This PR excludes metadata queries from the default server-side timeout by overriding it with a custom one, to prevent timeouts upon querying metadata.

What's done

Adds USING TIMEOUT clause to metadata queries when applicable.

By metadata I mean schema + topology. I did it for both for consistency, even though only schema was required in the issue).

Note: this is purposefully opened against branch-0.15.x, as it has been considered an urgent fix to real user problems. We're going to release a minor (0.15.2) with this. After it's accepted and #1166 is done, I'll port this to main.

Implementation

To ensure that no fetches are accidentally omitted from having the timeout added, both now and in the future, I created a new abstraction: ControlConnection, which is just a wrapper over Arc<Connection> that exposes some methods of Arc<Connection> (query_iter, execute_iter, prepare, and some minor getters), taking care of adding the timeout before execution if applicable.

A corresponding setting is added to SessionConfig, and SessionBuilder gets a new method for configuring it.

Testing

I added two tests:

A unit test of ControlConnection, which asserts that for ScyllaDB the timeout is enforced (if set) and for Cassandra is always ignored, in the following cases:
- when explicitly disabled (no custom timeout follows),
- when explicitly set to some (only set for ScyllaDB).
An integration test for the driver, which asserts that for ScyllaDB the timeout is enforced (if set) and for Cassandra is always ignored, in the following cases:
- when explicitly disabled (no custom timeout follows),
- when explicitly set to some (only set for ScyllaDB),
- when left as implicit default (only set to ScyllaDB).

Fixes: #1052

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

It's in legacy serialization testing code, so it's going to be removed soon anyway. (cherry picked from commit 0eb98bc)

github-actions · 2025-01-17T09:13:14Z

cargo semver-checks found no API-breaking changes in this PR! 🎉🥳
Checked commit: 0a53197

wprzytula · 2025-01-17T09:16:24Z

As you can see, our SSL and Authenticate workflows fail, because the ScyllaDB run there is so old that it does not recognize USING TIMEOUT syntax...:

scylladb/scylla-passauth
•Updated almost 4 years ago
scylladb/scylla-tls
•Updated almost 4 years ago

I guess we urgently need to create new images, to really test that the driver is compatible with those features in the new Scyllas, not 4-years-old ones...

cc @dkropachev

dkropachev · 2025-01-18T02:18:26Z

As you can see, our SSL and Authenticate workflows fail, because the ScyllaDB run there is so old that it does not recognize USING TIMEOUT syntax...:

scylladb/scylla-passauth
•Updated almost 4 years ago

scylladb/scylla-tls
•Updated almost 4 years ago

I guess we urgently need to create new images, to really test that the driver is compatible with those features in the new Scyllas, not 4-years-old ones...

cc @dkropachev

From what I see these are regular images, based of scylladb/scylla/4.3.rc0 with some patches at /etc/scylla on top of it.
What I recomend you doing, create a configuration that you need to have at /etc/scylla/ put it in repo and mount it at /etc/scylla, and use scylladb/scylla as an image.

dkropachev · 2025-01-18T04:32:09Z

As you can see, our SSL and Authenticate workflows fail, because the ScyllaDB run there is so old that it does not recognize USING TIMEOUT syntax...:

scylladb/scylla-passauth
•Updated almost 4 years ago

scylladb/scylla-tls
•Updated almost 4 years ago

I guess we urgently need to create new images, to really test that the driver is compatible with those features in the new Scyllas, not 4-years-old ones...
cc @dkropachev

From what I see these are regular images, based of scylladb/scylla/4.3.rc0 with some patches at /etc/scylla on top of it. What I recomend you doing, create a configuration that you need to have at /etc/scylla/ put it in repo and mount it at /etc/scylla, and use scylladb/scylla as an image.

Done up here

muzarski

LGTM. Nice piece of code - especially the test cases

scylla/src/transport/topology.rs

scylla/src/transport/connection.rs

Lorak-mmk · 2025-01-20T15:40:54Z

scylla/src/transport/topology.rs

+        /// Tests that ControlConnection enforces the provided custom timeout
+        /// iff ScyllaDB is the target node (else ignores the custom timeout).
+        #[cfg(not(scylla_cloud_tests))]
+        #[tokio::test]
+        #[ntest::timeout(2000)]
+        async fn test_custom_timeouts() {
+            setup_tracing();
+
+            let proxy_addr = SocketAddr::new(scylla_proxy::get_exclusive_local_address(), 9042);
+            let uri = std::env::var("SCYLLA_URI").unwrap_or_else(|_| "127.0.0.1:9042".to_string());
+            let node_addr: SocketAddr = resolve_hostname(&uri).await;
+


❓ This test requires a running Scylla / C* cluster, right? Is it possible to avoid this?

It is possible to avoid this by using the proxy in the dry mode. We could mock the whole CQL handshake (OPTIONS -> SUPPORTED -> STARTUP -> READY -> REGISTER -> READY) and then intercept QUERYs and PREPAREs as before. An additional gain would be to test multiple scenarios at ones, i.e. sharded and non-sharded endpoints, independently of the actual cluster being deployed for tests.

I've made the test use the dry mode to avoid dependency on a running cluster.

Lorak-mmk · 2025-01-20T15:56:00Z

scylla/src/transport/topology.rs

+impl ControlConnection {
+    async fn query_metadata(
+        self,
+        connect_port: u16,
+        keyspace_to_fetch: &[String],
+        fetch_schema: bool,
+    ) -> Result<Metadata, QueryError> {
+        let peers_query = self.clone().query_peers(connect_port);
+        let keyspaces_query = self.query_keyspaces(keyspace_to_fetch, fetch_schema);


Commit: "topology: make metadata fetchers methods on ControlConnection "

🔧 For me it is extremely weird and unintuitive to have multiple impl blocks for a struct (ControlConnection), scattered in various places of 2 modules (control_connection and topology).
Can we please put all the impl ControlConnection into the control_connection module, and preferably make them a single impl block?

https://www.reddit.com/r/rust/comments/w5l320/is_having_multiple_impl_blocks_idiomatic/

It's a perfectly valid way of logical separation. In this particular case, we separate metadata-related functionalities (defined in the impl block in metadata.rs main module) from the lower-level functionalities (defined in the impl block in control_connection module). Also, what we get is that methods defined out of control_connection module can't access mod-private fields and methods of ControlConnection.

Lorak-mmk · 2025-01-20T15:57:53Z

scylla/src/transport/topology.rs

+impl ControlConnection {
+    async fn query_metadata(
+        self,
+        connect_port: u16,
+        keyspace_to_fetch: &[String],
+        fetch_schema: bool,
+    ) -> Result<Metadata, QueryError> {
+        let peers_query = self.clone().query_peers(connect_port);


🌱 When porting the PR to the main branch, after applying my previous comment about ControlConnection impls, I think it would be a good idea to extract control_connection module to a separate file.

I think ControlConnection could be extracted to a separate file, but with those higher-level methods (query_peers etc.) left in the metadata.rs file. The reason is that ControlConnection itself is a medium, which should be logically separate from the logic performing queries using it.

Without this derive, one cannot call unwrap_err() on query_iter()'s result.

This makes the code more rusty and leverages the type system. Also, it extracts the constants for shared use in the next commit.

Tests need this function to mock ScyllaDB's SUPPORTED frames. For use with the proxy, especially in the dry mode.

wprzytula · 2025-01-26T11:54:31Z

v1.1: resolved comments,

made the unit test not require a running cluster.

The test asserts that for ScyllaDB the timeout is enforced (if set) and for Cassandra is always ignored, in the following cases: - when explicitly disabled (no custom timeout follows), - when explicitly set to some (only set for ScyllaDB),

The timeout is now stored in SessionConfig and passed down to the MetadataReader, which sets up all its `ControlConnection`s with that timeout.

The tests asserts that for ScyllaDB the timeout is enforced (if set) and for Cassandra is always ignored, in the following cases: - when explicitly disabled (no custom timeout follows), - when explicitly set to some (only set for ScyllaDB), - when left as implicit default (only set to ScyllaDB).

roydahan · 2025-01-27T13:21:48Z

There is no point in merging this into 0.15. branch, we're not going to release it another 0.15 release.

appease new clippy requirement

0887454

It's in legacy serialization testing code, so it's going to be removed soon anyway. (cherry picked from commit 0eb98bc)

wprzytula self-assigned this Jan 17, 2025

wprzytula requested review from Lorak-mmk and muzarski January 17, 2025 09:07

wprzytula force-pushed the schema-query-increased-timeouts branch from 3271538 to a5adae0 Compare January 17, 2025 09:17

wprzytula added the area/metadata label Jan 17, 2025

muzarski approved these changes Jan 20, 2025

View reviewed changes

scylla/src/transport/topology.rs Show resolved Hide resolved

Lorak-mmk requested changes Jan 20, 2025

View reviewed changes

wprzytula added 6 commits January 26, 2025 12:33

iterator: derive Debug on QueryPager

65200e6

Without this derive, one cannot call unwrap_err() on query_iter()'s result.

topology: introduce ControlConnection

ad24e15

topology: make metadata fetchers methods on ControlConnection

01b50d7

topology: customise ControlConnection's request timeout

6bf6036

sharding: refactor ShardInfo try_from options

3428c9b

This makes the code more rusty and leverages the type system. Also, it extracts the constants for shared use in the next commit.

routing: implement ShardInfo::add_to_options for tests

4e2bf40

Tests need this function to mock ScyllaDB's SUPPORTED frames. For use with the proxy, especially in the dry mode.

wprzytula force-pushed the schema-query-increased-timeouts branch from a5adae0 to 6ab1f65 Compare January 26, 2025 11:53

wprzytula requested review from Lorak-mmk and muzarski January 26, 2025 11:54

wprzytula added 4 commits January 26, 2025 13:23

session: add metadata request timeout to SessionConfig

c92854c

The timeout is now stored in SessionConfig and passed down to the MetadataReader, which sets up all its `ControlConnection`s with that timeout.

session_builder: add metadata request timeout setting

72e25ed

wprzytula force-pushed the schema-query-increased-timeouts branch from 6ab1f65 to 0a53197 Compare January 26, 2025 12:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata queries with configurable server-side timeouts #1171

Metadata queries with configurable server-side timeouts #1171

wprzytula commented Jan 17, 2025 •

edited

Loading

github-actions bot commented Jan 17, 2025 •

edited

Loading

wprzytula commented Jan 17, 2025 •

edited

Loading

dkropachev commented Jan 18, 2025

dkropachev commented Jan 18, 2025

muzarski left a comment

Lorak-mmk Jan 20, 2025

wprzytula Jan 26, 2025

wprzytula Jan 26, 2025

Lorak-mmk Jan 20, 2025

wprzytula Jan 23, 2025

Lorak-mmk Jan 20, 2025

wprzytula Jan 26, 2025

wprzytula commented Jan 26, 2025

roydahan commented Jan 27, 2025

Metadata queries with configurable server-side timeouts #1171

Are you sure you want to change the base?

Metadata queries with configurable server-side timeouts #1171

Conversation

wprzytula commented Jan 17, 2025 • edited Loading

Motivation

What's done

Implementation

Testing

Pre-review checklist

github-actions bot commented Jan 17, 2025 • edited Loading

wprzytula commented Jan 17, 2025 • edited Loading

dkropachev commented Jan 18, 2025

dkropachev commented Jan 18, 2025

muzarski left a comment

Choose a reason for hiding this comment

Lorak-mmk Jan 20, 2025

Choose a reason for hiding this comment

wprzytula Jan 26, 2025

Choose a reason for hiding this comment

wprzytula Jan 26, 2025

Choose a reason for hiding this comment

Lorak-mmk Jan 20, 2025

Choose a reason for hiding this comment

wprzytula Jan 23, 2025

Choose a reason for hiding this comment

Lorak-mmk Jan 20, 2025

Choose a reason for hiding this comment

wprzytula Jan 26, 2025

Choose a reason for hiding this comment

wprzytula commented Jan 26, 2025

roydahan commented Jan 27, 2025

wprzytula commented Jan 17, 2025 •

edited

Loading

github-actions bot commented Jan 17, 2025 •

edited

Loading

wprzytula commented Jan 17, 2025 •

edited

Loading