Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata queries with configurable server-side timeouts #1171

Open
wants to merge 11 commits into
base: branch-0.15.x
Choose a base branch
from

Conversation

wprzytula
Copy link
Collaborator

@wprzytula wprzytula commented Jan 17, 2025

Motivation

Some users tune their server-side timeouts so that they are tighter. For clusters with a large schema, however, this sometimes made schema queries time out. This PR excludes metadata queries from the default server-side timeout by overriding it with a custom one, to prevent timeouts upon querying metadata.

What's done

Adds USING TIMEOUT clause to metadata queries when applicable.

By metadata I mean schema + topology. I did it for both for consistency, even though only schema was required in the issue).

Note: this is purposefully opened against branch-0.15.x, as it has been considered an urgent fix to real user problems. We're going to release a minor (0.15.2) with this. After it's accepted and #1166 is done, I'll port this to main.

Implementation

To ensure that no fetches are accidentally omitted from having the timeout added, both now and in the future, I created a new abstraction: ControlConnection, which is just a wrapper over Arc<Connection> that exposes some methods of Arc<Connection> (query_iter, execute_iter, prepare, and some minor getters), taking care of adding the timeout before execution if applicable.

A corresponding setting is added to SessionConfig, and SessionBuilder gets a new method for configuring it.

Testing

I added two tests:

  1. A unit test of ControlConnection, which asserts that for ScyllaDB the timeout is enforced (if set) and for Cassandra is always ignored, in the following cases:
    • when explicitly disabled (no custom timeout follows),
    • when explicitly set to some (only set for ScyllaDB).
  2. An integration test for the driver, which asserts that for ScyllaDB the timeout is enforced (if set) and for Cassandra is always ignored, in the following cases:
    • when explicitly disabled (no custom timeout follows),
    • when explicitly set to some (only set for ScyllaDB),
    • when left as implicit default (only set to ScyllaDB).

Fixes: #1052

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

It's in legacy serialization testing code, so it's going to be removed
soon anyway.

(cherry picked from commit 0eb98bc)
@wprzytula wprzytula self-assigned this Jan 17, 2025
Copy link

github-actions bot commented Jan 17, 2025

cargo semver-checks found no API-breaking changes in this PR! 🎉🥳
Checked commit: 0a53197

@wprzytula
Copy link
Collaborator Author

wprzytula commented Jan 17, 2025

As you can see, our SSL and Authenticate workflows fail, because the ScyllaDB run there is so old that it does not recognize USING TIMEOUT syntax...:

I guess we urgently need to create new images, to really test that the driver is compatible with those features in the new Scyllas, not 4-years-old ones...

cc @dkropachev

@wprzytula wprzytula force-pushed the schema-query-increased-timeouts branch from 3271538 to a5adae0 Compare January 17, 2025 09:17
@dkropachev
Copy link
Collaborator

As you can see, our SSL and Authenticate workflows fail, because the ScyllaDB run there is so old that it does not recognize USING TIMEOUT syntax...:

I guess we urgently need to create new images, to really test that the driver is compatible with those features in the new Scyllas, not 4-years-old ones...

cc @dkropachev

From what I see these are regular images, based of scylladb/scylla/4.3.rc0 with some patches at /etc/scylla on top of it.
What I recomend you doing, create a configuration that you need to have at /etc/scylla/ put it in repo and mount it at /etc/scylla, and use scylladb/scylla as an image.

@dkropachev
Copy link
Collaborator

As you can see, our SSL and Authenticate workflows fail, because the ScyllaDB run there is so old that it does not recognize USING TIMEOUT syntax...:

I guess we urgently need to create new images, to really test that the driver is compatible with those features in the new Scyllas, not 4-years-old ones...
cc @dkropachev

From what I see these are regular images, based of scylladb/scylla/4.3.rc0 with some patches at /etc/scylla on top of it. What I recomend you doing, create a configuration that you need to have at /etc/scylla/ put it in repo and mount it at /etc/scylla, and use scylladb/scylla as an image.

Done up here

Copy link
Contributor

@muzarski muzarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice piece of code - especially the test cases

scylla/src/transport/topology.rs Show resolved Hide resolved
scylla/src/transport/connection.rs Outdated Show resolved Hide resolved
Comment on lines 894 to 908
/// Tests that ControlConnection enforces the provided custom timeout
/// iff ScyllaDB is the target node (else ignores the custom timeout).
#[cfg(not(scylla_cloud_tests))]
#[tokio::test]
#[ntest::timeout(2000)]
async fn test_custom_timeouts() {
setup_tracing();

let proxy_addr = SocketAddr::new(scylla_proxy::get_exclusive_local_address(), 9042);
let uri = std::env::var("SCYLLA_URI").unwrap_or_else(|_| "127.0.0.1:9042".to_string());
let node_addr: SocketAddr = resolve_hostname(&uri).await;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ This test requires a running Scylla / C* cluster, right? Is it possible to avoid this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to avoid this by using the proxy in the dry mode. We could mock the whole CQL handshake (OPTIONS -> SUPPORTED -> STARTUP -> READY -> REGISTER -> READY) and then intercept QUERYs and PREPAREs as before. An additional gain would be to test multiple scenarios at ones, i.e. sharded and non-sharded endpoints, independently of the actual cluster being deployed for tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the test use the dry mode to avoid dependency on a running cluster.

Comment on lines +746 to +758
impl ControlConnection {
async fn query_metadata(
self,
connect_port: u16,
keyspace_to_fetch: &[String],
fetch_schema: bool,
) -> Result<Metadata, QueryError> {
let peers_query = self.clone().query_peers(connect_port);
let keyspaces_query = self.query_keyspaces(keyspace_to_fetch, fetch_schema);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit: "topology: make metadata fetchers methods on ControlConnection "

🔧 For me it is extremely weird and unintuitive to have multiple impl blocks for a struct (ControlConnection), scattered in various places of 2 modules (control_connection and topology).
Can we please put all the impl ControlConnection into the control_connection module, and preferably make them a single impl block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.reddit.com/r/rust/comments/w5l320/is_having_multiple_impl_blocks_idiomatic/

It's a perfectly valid way of logical separation. In this particular case, we separate metadata-related functionalities (defined in the impl block in metadata.rs main module) from the lower-level functionalities (defined in the impl block in control_connection module). Also, what we get is that methods defined out of control_connection module can't access mod-private fields and methods of ControlConnection.

Comment on lines +746 to +757
impl ControlConnection {
async fn query_metadata(
self,
connect_port: u16,
keyspace_to_fetch: &[String],
fetch_schema: bool,
) -> Result<Metadata, QueryError> {
let peers_query = self.clone().query_peers(connect_port);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌱 When porting the PR to the main branch, after applying my previous comment about ControlConnection impls, I think it would be a good idea to extract control_connection module to a separate file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ControlConnection could be extracted to a separate file, but with those higher-level methods (query_peers etc.) left in the metadata.rs file. The reason is that ControlConnection itself is a medium, which should be logically separate from the logic performing queries using it.

Without this derive, one cannot call unwrap_err() on query_iter()'s
result.
This makes the code more rusty and leverages the type system. Also, it
extracts the constants for shared use in the next commit.
Tests need this function to mock ScyllaDB's SUPPORTED frames. For use
with the proxy, especially in the dry mode.
@wprzytula wprzytula force-pushed the schema-query-increased-timeouts branch from a5adae0 to 6ab1f65 Compare January 26, 2025 11:53
@wprzytula
Copy link
Collaborator Author

v1.1: resolved comments,

  • made the unit test not require a running cluster.

The test asserts that for ScyllaDB the timeout is enforced (if set) and
for Cassandra is always ignored, in the following cases:
- when explicitly disabled (no custom timeout follows),
- when explicitly set to some (only set for ScyllaDB),
The timeout is now stored in SessionConfig and passed down to the
MetadataReader, which sets up all its `ControlConnection`s with that
timeout.
The tests asserts that for ScyllaDB the timeout is enforced (if set) and
for Cassandra is always ignored, in the following cases:
- when explicitly disabled (no custom timeout follows),
- when explicitly set to some (only set for ScyllaDB),
- when left as implicit default (only set to ScyllaDB).
@wprzytula wprzytula force-pushed the schema-query-increased-timeouts branch from 6ab1f65 to 0a53197 Compare January 26, 2025 12:24
@roydahan
Copy link
Collaborator

There is no point in merging this into 0.15. branch, we're not going to release it another 0.15 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants