Skip to content

Don't panic on cass_session_get_client_id() on disconnected sessions #327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

wprzytula
Copy link
Collaborator

@wprzytula wprzytula commented Jun 24, 2025

Problem statement

cass_session_get_client_id() contained a bug: for disconnected sessions, it would panic on unwrap(), because the client_id was only stored on the CassConnectedSession, which is None when the session is disconnected.

Solution

This PR fixes the bug by storing the client_id on the CassSessionInner struct, so that it is always available, even when
the session is disconnected.
Upon session creation (cass_session_new()), a random UUID v4 is generated and stored in the CassSessionInner. If the user sets a custom client id in the CassCluster, it will replace the random one upon cass_session_connect().

Testing

A unit test is added, which asserts the following:

  1. We can get a client ID from a not-yet-connected session.
  2. The client ID is inherited from the cluster when connecting.
  3. We can still get the client ID after disconnecting the session, which is still the cluster's client ID, if set.

Open for discussion

Note that we're using RUNTIME.block_on() to drive the future returned from RwLock::read_owned() to completion. This is necessary because RwLock::blocking_read() does not have an owned version.

I had a longer thought about blocking_read() (which we use in some places) vs RUNTIME.block_on(read[_owned]()), and I think that blocking_read() is strictly worse: it simply blocks the current thread until the lock is acquired, while RUNTIME.block_on() will also allow other tasks to run while waiting for the lock, which is not only more efficient, but also prevents deadlocks in some cases. In a follow-up PR, I'm thus going to replace all blocking_read() with RUNTIME.block_on(read[_owned]()).

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have implemented Rust unit tests for the features/changes introduced.
  • [ ] I have enabled appropriate tests in Makefile in {SCYLLA,CASSANDRA}_(NO_VALGRIND_)TEST_FILTER.
  • [ ] I added appropriate Fixes: annotations to PR description.

@wprzytula wprzytula added this to the 0.6 milestone Jun 24, 2025
@wprzytula wprzytula self-assigned this Jun 24, 2025
@wprzytula wprzytula added the bug Something isn't working label Jun 24, 2025
@wprzytula wprzytula requested a review from Lorak-mmk June 24, 2025 15:11
@wprzytula wprzytula force-pushed the fix-session-client-id branch from 8ee5289 to 86e8d28 Compare June 24, 2025 15:36
@wprzytula wprzytula requested a review from Copilot June 24, 2025 15:39
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Fix panic in cass_session_get_client_id() for disconnected sessions by persisting a client_id in the session wrapper.

  • Refactor session internals: split CassConnectedSession (connected state) from CassSessionInner (wrapper with persistent client_id and optional connection)
  • Update all FFI calls and methods to use the new wrapper and access client_id regardless of connection state
  • Add a unit test verifying cass_session_get_client_id() works before connect, after connect, and (intended) after disconnect

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
scylla-rust-wrapper/src/session.rs Refactored session state to persist client_id, updated FFI APIs, added test
scylla-rust-wrapper/src/exec_profile.rs Switched exec‐profile resolution to use CassConnectedSession
Comments suppressed due to low confidence (1)

scylla-rust-wrapper/src/session.rs:1809

  • The test intends to verify behavior after disconnecting but never actually calls cass_session_close() before this assertion; add an explicit close call to ensure client_id is still retrievable after a real disconnect.
            let session_client_id = cass_session_get_client_id(session_raw.borrow());

@@ -243,17 +256,17 @@ pub unsafe extern "C" fn cass_session_execute_batch(

let future = async move {
let session_guard = session_opt.read().await;
if session_guard.is_none() {
if session_guard.connected.is_none() {
Copy link
Preview

Copilot AI Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern of checking .connected.is_none() and then immediately unwrapping appears repeatedly; consider extracting a helper like get_connected_session() on CassSessionInner to centralize this logic and reduce duplication.

Copilot uses AI. Check for mistakes.

Comment on lines 642 to 644
for (keyspace_name, keyspace) in cass_session
.blocking_read()
Copy link
Preview

Copilot AI Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using blocking_read() can block the current thread. For consistency with the write side and to allow other tasks to run, consider using RUNTIME.block_on(self.read_owned()) instead.

Suggested change
for (keyspace_name, keyspace) in cass_session
.blocking_read()
for (keyspace_name, keyspace) in RUNTIME
.block_on(cass_session.read_owned())

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit: session: store client id on disconnected session too

and I think that
blocking_read() is strictly worse: it simply blocks the current
thread until the lock is acquired, while RUNTIME.block_on() will
also allow other tasks to run while waiting for the lock

When blocking_read is executed in a user thread, this isn't really true, right? And I think most our usages fall into this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When blocking_read is executed in a user thread, this isn't really true, right?
Why?

  1. If using current_thread executor, using blocking_read() ends up with a deadlock.
  2. If using multi_thread executor, using RUNTIME.block_on(read()) will make the thread join the executor pool and temporarily serve as an executor thread, which may be beneficial or not.

Comment on lines 84 to 89
let mut session_guard = RUNTIME.block_on(session_opt.write_owned());

if let Some(cluster_client_id) = cluster.get_client_id() {
// If the user set a client id, use it instead of the random one.
session_guard.client_id = cluster_client_id;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a point of doing it this way, instead of passing cluster.get_client_id() to Self::connect_fut and performing the assignment there. It would require less changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely necessary for correctness.
Consider the previous way of doing the assignment in the future. Then, we would have no control over when the assignment really takes place wrt to the end of the cass_session_connect() function. In such case, the unit test I wrote would fail, because having returned from cass_session_connect() would not guarantee that the client id had been already set.

Copy link
Collaborator Author

@wprzytula wprzytula Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A related bug (one about locking the RwLock only in the async block instead of doing it synchronously beforehand) exists in other places in the code, which I uncovered today. By playing with request execution delays I managed to create a reproducer that makes a test fail due to unfortunate interleavings. You'll see more about this once I open another PR fixing that.

Copy link
Collaborator

@Lorak-mmk Lorak-mmk Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because having returned from cass_session_connect() would not guarantee that the client id had been already set.

I don't see why we need this guarantee. cass_session_connect returns CassFuture that performs some work, and does not guarantee anything about session before that future completes.

Copy link
Collaborator Author

@wprzytula wprzytula Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I understand the reasoning you described. However, it still makes more sense to me to perform as big part of the connection configuration as possible in a synchronous manner, before the function returns. In fact, this is what CPP Driver does (with some irrelevant code removed):

Future::Ptr SessionBase::connect(const Config& config, const String& keyspace) {
  Future::Ptr future(new SessionFuture());

  ScopedMutex l(&mutex_);
  if (state_ != SESSION_STATE_CLOSED) {
    future->set_error(CASS_ERROR_LIB_UNABLE_TO_CONNECT,
                      "Already connecting, closing, or connected");
    return future;
  }

  if (config.is_client_id_set()) {
    client_id_ = config.client_id();
  }

  config_ = config.new_instance();
  connect_keyspace_ = keyspace;
  connect_future_ = future;
  state_ = SESSION_STATE_CONNECTING;

  if (config.use_randomized_contact_points()) {
    random_.reset(new Random());
  } else {
    random_.reset();
  }

  cluster_.reset();
  ClusterConnector::Ptr connector(
      new ClusterConnector(config_.contact_points(), config_.protocol_version(),
                           bind_callback(&SessionBase::on_initialize, this)));

  ClusterSettings settings(config_);
  settings.control_connection_settings.connection_settings.client_id = to_string(client_id_);
  settings.disable_events_on_startup = true;

  connector->with_listener(this)
      ->with_settings(settings)
      ->with_random(random_.get())
      ->with_metrics(metrics_.get())
      ->connect(event_loop_.get());

  return future;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it still makes more sense to me to perform as big part of the connection configuration as possible in a synchronous manner, before the function returns.

Why? It complicates the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it gives stronger guarantees, just this. And we are closer to the CPP-Driver's behaviour.
I don't see how this complicates the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping. I'd like to merge this, and I'm convinced the approach I've taken is correct.

@wprzytula wprzytula added the P2 P2 item - probably some people use this, let's implement that label Jun 29, 2025
@wprzytula wprzytula requested a review from Lorak-mmk June 30, 2025 16:52
@wprzytula
Copy link
Collaborator Author

I've observed that the CI fails in the specific way when b93e8b9 is missing.

wprzytula added 5 commits July 5, 2025 16:58
This name change reflects that this structure corresponds to a session
that is connected. This is because in the next commits we will introduce
parts of the non necessarily connected session.
The new CassSessionInner is a struct that is held under the
CassSession's RwLock. At the moment, it contains only an
`Option<CassConnectedSession>`, but soon it will be extended.
`cass_session_get_client_id()` contained a bug: for disconnected
sessions, it would panic on `unwrap()`, because the `client_id` was only
stored on the `CassConnectedSession`, which is `None` when the session
is disconnected. This commit fixes that by storing the `client_id` on
the `CassSessionInner` struct, so that it is always available, even when
the session is disconnected.
Upon session creation (`cass_session_new()`), a random UUID v4 is
generated and stored in the `CassSessionInner`. If the user sets a
custom client id in the `CassCluster`, it will replace the random
one upon `cass_session_connect()`.

Note that we're using `RUNTIME.block_on()` to drive the future returned
from `RwLock::read_owned()` to completion. This is necessary because
`RwLock::blocking_read()` does not have an owned version.
I had a longer thought about `blocking_read()` (which we use in some
places) vs `RUNTIME.block_on(read[_owned]())`, and I think that
`blocking_read()` is strictly worse: it simply blocks the current
thread until the lock is acquired, while `RUNTIME.block_on()` will
also allow other tasks to run while waiting for the lock, which is not
only more efficient, but also prevents deadlocks in some cases.
I a follow-up PR, I'm thus going to replace all `blocking_read()` with
`RUNTIME.block_on(read[_owned]())`.
It asserts the following:
1. We can get a client ID from a not-yet-connected session.
2. The client ID is inherited from the cluster when connecting.
3. We can still get the client ID after disconnecting the session,
   which is still the cluster's client ID, if set.
After the recent refactor, the `session_opt` variable is renamed
to `session` to better reflect its purpose.
@wprzytula wprzytula force-pushed the fix-session-client-id branch from 366626a to b66cf27 Compare July 5, 2025 15:00
@wprzytula
Copy link
Collaborator Author

Rebased on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P2 P2 item - probably some people use this, let's implement that
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants