-
Notifications
You must be signed in to change notification settings - Fork 13
Don't panic on cass_session_get_client_id()
on disconnected sessions
#327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
8ee5289
to
86e8d28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Fix panic in cass_session_get_client_id()
for disconnected sessions by persisting a client_id
in the session wrapper.
- Refactor session internals: split
CassConnectedSession
(connected state) fromCassSessionInner
(wrapper with persistentclient_id
and optional connection) - Update all FFI calls and methods to use the new wrapper and access
client_id
regardless of connection state - Add a unit test verifying
cass_session_get_client_id()
works before connect, after connect, and (intended) after disconnect
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
scylla-rust-wrapper/src/session.rs | Refactored session state to persist client_id , updated FFI APIs, added test |
scylla-rust-wrapper/src/exec_profile.rs | Switched exec‐profile resolution to use CassConnectedSession |
Comments suppressed due to low confidence (1)
scylla-rust-wrapper/src/session.rs:1809
- The test intends to verify behavior after disconnecting but never actually calls
cass_session_close()
before this assertion; add an explicit close call to ensureclient_id
is still retrievable after a real disconnect.
let session_client_id = cass_session_get_client_id(session_raw.borrow());
@@ -243,17 +256,17 @@ pub unsafe extern "C" fn cass_session_execute_batch( | |||
|
|||
let future = async move { | |||
let session_guard = session_opt.read().await; | |||
if session_guard.is_none() { | |||
if session_guard.connected.is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern of checking .connected.is_none()
and then immediately unwrapping appears repeatedly; consider extracting a helper like get_connected_session()
on CassSessionInner
to centralize this logic and reduce duplication.
Copilot uses AI. Check for mistakes.
for (keyspace_name, keyspace) in cass_session | ||
.blocking_read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using blocking_read()
can block the current thread. For consistency with the write side and to allow other tasks to run, consider using RUNTIME.block_on(self.read_owned())
instead.
for (keyspace_name, keyspace) in cass_session | |
.blocking_read() | |
for (keyspace_name, keyspace) in RUNTIME | |
.block_on(cass_session.read_owned()) |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit: session: store client id on disconnected session too
and I think that
blocking_read()
is strictly worse: it simply blocks the current
thread until the lock is acquired, whileRUNTIME.block_on()
will
also allow other tasks to run while waiting for the lock
When blocking_read is executed in a user thread, this isn't really true, right? And I think most our usages fall into this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When blocking_read is executed in a user thread, this isn't really true, right?
Why?
- If using current_thread executor, using
blocking_read()
ends up with a deadlock. - If using multi_thread executor, using
RUNTIME.block_on(read())
will make the thread join the executor pool and temporarily serve as an executor thread, which may be beneficial or not.
scylla-rust-wrapper/src/session.rs
Outdated
let mut session_guard = RUNTIME.block_on(session_opt.write_owned()); | ||
|
||
if let Some(cluster_client_id) = cluster.get_client_id() { | ||
// If the user set a client id, use it instead of the random one. | ||
session_guard.client_id = cluster_client_id; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a point of doing it this way, instead of passing cluster.get_client_id()
to Self::connect_fut
and performing the assignment there. It would require less changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is absolutely necessary for correctness.
Consider the previous way of doing the assignment in the future. Then, we would have no control over when the assignment really takes place wrt to the end of the cass_session_connect()
function. In such case, the unit test I wrote would fail, because having returned from cass_session_connect()
would not guarantee that the client id had been already set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A related bug (one about locking the RwLock only in the async block instead of doing it synchronously beforehand) exists in other places in the code, which I uncovered today. By playing with request execution delays I managed to create a reproducer that makes a test fail due to unfortunate interleavings. You'll see more about this once I open another PR fixing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because having returned from cass_session_connect() would not guarantee that the client id had been already set.
I don't see why we need this guarantee. cass_session_connect
returns CassFuture that performs some work, and does not guarantee anything about session before that future completes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I understand the reasoning you described. However, it still makes more sense to me to perform as big part of the connection configuration as possible in a synchronous manner, before the function returns. In fact, this is what CPP Driver does (with some irrelevant code removed):
Future::Ptr SessionBase::connect(const Config& config, const String& keyspace) {
Future::Ptr future(new SessionFuture());
ScopedMutex l(&mutex_);
if (state_ != SESSION_STATE_CLOSED) {
future->set_error(CASS_ERROR_LIB_UNABLE_TO_CONNECT,
"Already connecting, closing, or connected");
return future;
}
if (config.is_client_id_set()) {
client_id_ = config.client_id();
}
config_ = config.new_instance();
connect_keyspace_ = keyspace;
connect_future_ = future;
state_ = SESSION_STATE_CONNECTING;
if (config.use_randomized_contact_points()) {
random_.reset(new Random());
} else {
random_.reset();
}
cluster_.reset();
ClusterConnector::Ptr connector(
new ClusterConnector(config_.contact_points(), config_.protocol_version(),
bind_callback(&SessionBase::on_initialize, this)));
ClusterSettings settings(config_);
settings.control_connection_settings.connection_settings.client_id = to_string(client_id_);
settings.disable_events_on_startup = true;
connector->with_listener(this)
->with_settings(settings)
->with_random(random_.get())
->with_metrics(metrics_.get())
->connect(event_loop_.get());
return future;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, it still makes more sense to me to perform as big part of the connection configuration as possible in a synchronous manner, before the function returns.
Why? It complicates the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it gives stronger guarantees, just this. And we are closer to the CPP-Driver's behaviour.
I don't see how this complicates the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping. I'd like to merge this, and I'm convinced the approach I've taken is correct.
I've observed that the CI fails in the specific way when b93e8b9 is missing. |
This name change reflects that this structure corresponds to a session that is connected. This is because in the next commits we will introduce parts of the non necessarily connected session.
The new CassSessionInner is a struct that is held under the CassSession's RwLock. At the moment, it contains only an `Option<CassConnectedSession>`, but soon it will be extended.
`cass_session_get_client_id()` contained a bug: for disconnected sessions, it would panic on `unwrap()`, because the `client_id` was only stored on the `CassConnectedSession`, which is `None` when the session is disconnected. This commit fixes that by storing the `client_id` on the `CassSessionInner` struct, so that it is always available, even when the session is disconnected. Upon session creation (`cass_session_new()`), a random UUID v4 is generated and stored in the `CassSessionInner`. If the user sets a custom client id in the `CassCluster`, it will replace the random one upon `cass_session_connect()`. Note that we're using `RUNTIME.block_on()` to drive the future returned from `RwLock::read_owned()` to completion. This is necessary because `RwLock::blocking_read()` does not have an owned version. I had a longer thought about `blocking_read()` (which we use in some places) vs `RUNTIME.block_on(read[_owned]())`, and I think that `blocking_read()` is strictly worse: it simply blocks the current thread until the lock is acquired, while `RUNTIME.block_on()` will also allow other tasks to run while waiting for the lock, which is not only more efficient, but also prevents deadlocks in some cases. I a follow-up PR, I'm thus going to replace all `blocking_read()` with `RUNTIME.block_on(read[_owned]())`.
It asserts the following: 1. We can get a client ID from a not-yet-connected session. 2. The client ID is inherited from the cluster when connecting. 3. We can still get the client ID after disconnecting the session, which is still the cluster's client ID, if set.
After the recent refactor, the `session_opt` variable is renamed to `session` to better reflect its purpose.
366626a
to
b66cf27
Compare
Rebased on master. |
Problem statement
cass_session_get_client_id()
contained a bug: for disconnected sessions, it would panic onunwrap()
, because theclient_id
was only stored on theCassConnectedSession
, which isNone
when the session is disconnected.Solution
This PR fixes the bug by storing the
client_id
on theCassSessionInner
struct, so that it is always available, even whenthe session is disconnected.
Upon session creation (
cass_session_new()
), a random UUID v4 is generated and stored in theCassSessionInner
. If the user sets a custom client id in theCassCluster
, it will replace the random one uponcass_session_connect()
.Testing
A unit test is added, which asserts the following:
Open for discussion
Note that we're using
RUNTIME.block_on()
to drive the future returned fromRwLock::read_owned()
to completion. This is necessary becauseRwLock::blocking_read()
does not have an owned version.I had a longer thought about
blocking_read()
(which we use in some places) vsRUNTIME.block_on(read[_owned]())
, and I think thatblocking_read()
is strictly worse: it simply blocks the current thread until the lock is acquired, whileRUNTIME.block_on()
will also allow other tasks to run while waiting for the lock, which is not only more efficient, but also prevents deadlocks in some cases. In a follow-up PR, I'm thus going to replace allblocking_read()
withRUNTIME.block_on(read[_owned]())
.Pre-review checklist
[ ] I have enabled appropriate tests inMakefile
in{SCYLLA,CASSANDRA}_(NO_VALGRIND_)TEST_FILTER
.[ ] I added appropriateFixes:
annotations to PR description.