RUST-565 Properly report connection closures due to error #258

patrickfreed · 2020-09-30T23:12:15Z

This PR fixes a bug where connection closure events emitted due to errors never contained the ConnectionClosedReason::Error as a reason. This also fixes a bug where ok: 0 responses from the initial handshake were ignored.

For convenience in testing, this PR also adds a FailPoint type that faciliates the enabling and cleanup of fail points.

In order to test this, RUST-304 and RUST-568 are also fixed in this PR.

patrickfreed · 2020-09-30T23:14:28Z

src/cmap/establish/handshake/mod.rs

@@ -201,6 +201,7 @@ impl Handshaker {
        let response = conn.send_command(command, None).await?;
        let end_time = PreciseTime::now();

+        response.validate()?;


We weren't checking the initial handshake response to see if it returned an error, mistakenly returning Ok(connection) here even if the handshake failed.

Good catch!

patrickfreed · 2020-09-30T23:15:11Z

src/cmap/mod.rs

@@ -348,6 +353,14 @@ impl ConnectionPoolInner {
                // Establishing a pending connection failed, so that must be reflected in to total
                // connection count.
                self.total_connection_count.fetch_sub(1, Ordering::SeqCst);
+                self.emit_event(|handler| {


According to CMAP, we are supposed to be emitting ConnectionClosed events here. This makes sense because we emitted a ConnectionCreatedEvent when this pending connection was created.

patrickfreed · 2020-09-30T23:16:33Z

src/cmap/test/event.rs

        self.events.write().unwrap().push(event);
    }
+
+    pub fn subscribe(&self) -> EventSubscriber {


I added a mechanism by which threads could subscribe to the event handler and wait for a particular event. This seemed like a more async-y approach than our current sleep for a little + check again. I updated the cmap spec tests to use it too, hopefully reducing those test failures we get every now and then due to not waiting long enough. I filed RUST-572 to cover the work for introducing this to the test suite as a whole.

patrickfreed · 2020-09-30T23:21:49Z

src/cmap/test/integration.rs

@@ -137,3 +138,73 @@ async fn concurrent_connections() {
        .await
        .expect("disabling fail point should succeed");
 }
+
+#[cfg_attr(feature = "tokio-runtime", tokio::test(threaded_scheduler))]


In order to reduce the boilerplate needed to enable failpoints, I introduced a FailPoint type (see failpoint.rs). Enabling this type returns a FailPointGuard that automatically disables the failpoint server side in its drop. This is especially useful because the drop gets called even when the thread panics, as is common in failing tests. The problem with this is that we can't use the "dispatch a task from drop" methodology we use elsewhere in the driver, since after a panic the runtime itself is usually dropped and the tasks won't get executed. To get around this, I used RUNTIME.block_on, but this deadlocks without using the threaded scheduler in tokio. Using the basic scheduler for the tests is nice because it ensures we're not blocking in the driver anywhere, though, so we don't want to completely eliminate it. I figured a decent compromise would be to only use the threaded scheduler in tests that require fail points and leave the rest of the tests on basic to keep verifying we're not blocking in the driver.

I wonder if it's possible to take this a step further, e.g. automatically acquire the lock exclusively when we create a failpoint. This would presumably require us not to acquire the lock manually in the tests to avoid deadlocks, but maybe we could get around that by always acquring a read lock and then passing that into the constructor for Failpoint (which then drops it and acquires a write lock instead). There wouldn't be any way to get the read lock back in the test function after the Failpoint is dropped, but I'm guessing there isn't really anywherte we'd need that.

In a lot of cases we need to check the server version before we enable failpoints, so we definitely need to acquire at least a read lock beforehand. Also, we may need to repeatedly enable failpoints over the duration of a test (e.g. in a spec test), so I don't think we can use the one that upgrades the lock either. I think we'll just have to rely on our diligence until we come up with a macro for automating the lock acquisition or something.

Fair enough

patrickfreed · 2020-09-30T23:23:52Z

src/event/cmap.rs

@@ -19,7 +19,7 @@ fn empty_address() -> StreamAddress {
 }

 /// Event emitted when a connection pool is created.
-#[derive(Debug, Deserialize, PartialEq)]
+#[derive(Clone, Debug, Deserialize, PartialEq)]


This is an API change driven by testing, but I figured it would be useful for both us and users so I went ahead with it.

patrickfreed · 2020-09-30T23:27:26Z

src/test/util/failpoint.rs

+use crate::{error::Result, operation::append_options, RUNTIME};
+
+#[derive(Debug)]
+pub struct FailPoint {


I'm currently only using this type in the new test I wrote for this work, but we should try to slowly convert all usages of failpoints in the driver to use it, including ones we deserialize in spec tests (we do this in Swift).

patrickfreed · 2020-09-30T23:28:16Z

src/test/util/failpoint.rs

+        });
+
+        if let Err(e) = result {
+            println!("failed disabling failpoint: {:?}", e);


we can't really do much besides log the error here

patrickfreed · 2020-10-06T21:40:10Z

src/cmap/test/mod.rs

+                    drop(conn);
+
+                    // wait for event to be emitted to ensure check in has completed.
+                    subscriber


The new subscriber functionality allows us to test the actual path we use to close connections rather than a test-only shortcut. Ditto for pools.

src/test/spec/retryable_reads.rs

patrickfreed · 2020-10-06T22:05:34Z

src/test/spec/crud_v1/aggregate.rs

@@ -19,6 +19,7 @@ struct Arguments {

 #[function_name::named]
 async fn run_aggregate_test(test_file: TestFile) {
+    let _guard: RwLockReadGuard<()> = LOCK.run_concurrently().await;


Because we're setting failpoints in tests, acquiring the lock needs to be done before any i/o is performed whatsoever (e.g. in TestClient::new or in the background threads of a client). To that end, I updated many tests who were in violation of this, since they were interfering with the failpoints set in the tests required for this work.

src/cmap/conn/mod.rs

saghm · 2020-10-13T16:47:06Z

src/cmap/establish/handshake/mod.rs

@@ -201,6 +201,7 @@ impl Handshaker {
        let response = conn.send_command(command, None).await?;
        let end_time = PreciseTime::now();

+        response.validate()?;


Good catch!

src/cmap/mod.rs

src/cmap/test/event.rs

saghm · 2020-10-13T16:53:40Z

src/cmap/test/integration.rs

@@ -137,3 +138,73 @@ async fn concurrent_connections() {
        .await
        .expect("disabling fail point should succeed");
 }
+
+#[cfg_attr(feature = "tokio-runtime", tokio::test(threaded_scheduler))]


I wonder if it's possible to take this a step further, e.g. automatically acquire the lock exclusively when we create a failpoint. This would presumably require us not to acquire the lock manually in the tests to avoid deadlocks, but maybe we could get around that by always acquring a read lock and then passing that into the constructor for Failpoint (which then drops it and acquires a write lock instead). There wouldn't be any way to get the read lock back in the test function after the Failpoint is dropped, but I'm guessing there isn't really anywherte we'd need that.

saghm · 2020-10-13T16:57:02Z

src/cmap/test/mod.rs

+                    drop(conn);
+
+                    // wait for event to be emitted to ensure check in has completed.
+                    subscriber


src/test/spec/retryable_reads.rs

patrickfreed commented Oct 6, 2020

View reviewed changes

patrickfreed marked this pull request as ready for review October 6, 2020 22:08

patrickfreed requested review from saghm and isabelatkinson October 6, 2020 22:08

patrickfreed mentioned this pull request Oct 10, 2020

RUST-556 POC of maxConnecting #259

Merged

saghm reviewed Oct 13, 2020

View reviewed changes

patrickfreed added 24 commits October 19, 2020 12:24

properly report Conncection closures due to errors

343acd8

fix failing tests

b25146d

use subscriber in cmap spec tests

1978770

use large timeout when waiting for events

cf7e3b5

up broadcast channel capacity, handle lag

cf019b9

update spec tests to use longer timeout, wait for events in ops

6b5958c

remove ClientPool::check_in

9b9405c

fail monitor checks due to command errors

e369b27

accept impl Into in fail_command

68a4813

monitoring threads wait for check request

d5edff1

limit max pool size in integration test

b699e15

rename test

214b3ea

fix heartbeat frequency test

676a97e

acquire lock in tests

c0558af

remove redundant lock

8c30198

switch message manager to a subscription model

bec0200

acquire lock at beginning of crudv1 tests

7efd55d

remove heartbeat frequency test

af7f239

fix more test lock acquisition

3f84524

increase times on failpoints

d448df2

fix clippy

04a6b9c

is_errored -> has_errored

2234827

add comment to EventSubscriber

4a58c32

fix rustfmt

d469eca

patrickfreed force-pushed the RUST-565/connection-closed-error branch from 931887a to d469eca Compare October 19, 2020 16:27

saghm approved these changes Oct 20, 2020

View reviewed changes

patrickfreed mentioned this pull request Oct 20, 2020

RUST-549 Increase ulimit in evergreen tests #262

Merged

isabelatkinson approved these changes Oct 21, 2020

View reviewed changes

patrickfreed merged commit 9c0f1f1 into mongodb:master Oct 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RUST-565 Properly report connection closures due to error #258

RUST-565 Properly report connection closures due to error #258

patrickfreed commented Sep 30, 2020 •

edited

Loading

patrickfreed Sep 30, 2020

saghm Oct 13, 2020

patrickfreed Sep 30, 2020

patrickfreed Sep 30, 2020

patrickfreed Sep 30, 2020

saghm Oct 13, 2020

patrickfreed Oct 19, 2020

saghm Oct 20, 2020

patrickfreed Sep 30, 2020

patrickfreed Sep 30, 2020

patrickfreed Sep 30, 2020

patrickfreed Oct 6, 2020

saghm Oct 13, 2020

patrickfreed Oct 6, 2020

saghm Oct 13, 2020

saghm Oct 13, 2020

saghm Oct 13, 2020

RUST-565 Properly report connection closures due to error #258

RUST-565 Properly report connection closures due to error #258

Conversation

patrickfreed commented Sep 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickfreed commented Sep 30, 2020 •

edited

Loading