Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in PipeNetCon #8643

Merged
merged 17 commits into from
Oct 27, 2021
Merged

Fix race condition in PipeNetCon #8643

merged 17 commits into from
Oct 27, 2021

Conversation

tcsc
Copy link
Contributor

@tcsc tcsc commented Oct 18, 2021

The race condition detector is being triped by a concurrent Write and Close in the PipeNetCon in several integration tests.

Example race detector report:

==================
WARNING: DATA RACE
Read at 0x00c00307e060 by goroutine 318:
  golang.org/x/crypto/ssh.(*channel).WriteExtended()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:233 +0x5b
  golang.org/x/crypto/ssh.(*channel).Write()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:535 +0x88
  golang.org/x/crypto/ssh.(*sessionStdin).Write()
      <autogenerated>:1 +0x87
  github.com/gravitational/teleport/lib/utils.(*PipeNetConn).Write()
      /workspace/lib/utils/pipenetconn.go:56 +0x79
  github.com/gravitational/teleport/lib/client.(*NodeSession).pipeInOut.func2()
      /workspace/lib/client/session.go:618 +0x20d

Previous write at 0x00c00307e060 by goroutine 678:
  ??()
      -:0 +0xffffffffffffffff
  internal/poll.ignoringEINTRIO()
      /opt/go/src/internal/poll/fd_unix.go:581 +0x16e
  internal/poll.(*FD).Write()
      /opt/go/src/internal/poll/fd_unix.go:274 +0x294
  net.(*netFD).Write()
      /opt/go/src/net/fd_posix.go:73 +0x68
  net.(*conn).Write()
      /opt/go/src/net/net.go:195 +0xeb
  net.(*TCPConn).Write()
      <autogenerated>:1 +0x69
  bufio.(*Writer).Flush()
      /opt/go/src/bufio/bufio.go:607 +0x141
  golang.org/x/crypto/ssh.(*connectionState).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/transport.go:182 +0x148
  golang.org/x/crypto/ssh.(*transport).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/transport.go:172 +0xbe
  golang.org/x/crypto/ssh.(*handshakeTransport).pushPacket()
      /workspace/vendor/golang.org/x/crypto/ssh/handshake.go:223 +0x52a
  golang.org/x/crypto/ssh.(*handshakeTransport).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/handshake.go:516 +0x535
  golang.org/x/crypto/ssh.(*channel).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:215 +0x12c
  golang.org/x/crypto/ssh.(*channel).WriteExtended()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:271 +0x644
  golang.org/x/crypto/ssh.(*channel).Write()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:535 +0x88
  golang.org/x/crypto/ssh.(*sessionStdin).Write()
      <autogenerated>:1 +0x87
  github.com/gravitational/teleport/lib/utils.(*PipeNetConn).Write()
      /workspace/lib/utils/pipenetconn.go:56 +0x79
  bufio.(*Writer).Flush()
      /opt/go/src/bufio/bufio.go:607 +0x141
  golang.org/x/crypto/ssh.(*connectionState).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/transport.go:182 +0x148
  golang.org/x/crypto/ssh.(*transport).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/transport.go:172 +0xbe
  golang.org/x/crypto/ssh.(*handshakeTransport).pushPacket()
      /workspace/vendor/golang.org/x/crypto/ssh/handshake.go:223 +0x52a
  golang.org/x/crypto/ssh.(*handshakeTransport).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/handshake.go:516 +0x535
  golang.org/x/crypto/ssh.(*channel).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:215 +0x12c
  golang.org/x/crypto/ssh.(*channel).sendMessage()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:227 +0x211
  golang.org/x/crypto/ssh.(*channel).SendRequest()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:586 +0x1bc
  golang.org/x/crypto/ssh.(*Session).RequestPty()
      /workspace/vendor/golang.org/x/crypto/ssh/session.go:209 +0x4f9
  github.com/gravitational/teleport/lib/client.(*NodeSession).allocateTerminal()
      /workspace/lib/client/session.go:299 +0x13a
  github.com/gravitational/teleport/lib/client.(*NodeSession).interactiveSession()
      /workspace/lib/client/session.go:248 +0xf7
  github.com/gravitational/teleport/lib/client.(*NodeSession).runShell()
      /workspace/lib/client/session.go:441 +0x72
  github.com/gravitational/teleport/lib/client.(*TeleportClient).runShell()
      /workspace/lib/client/api.go:1956 +0x24f
  github.com/gravitational/teleport/lib/client.(*TeleportClient).SSH()
      /workspace/lib/client/api.go:1397 +0x5ec
  github.com/gravitational/teleport/integration.testPAM.func1.2()
      /workspace/integration/integration_test.go:3388 +0x34a

Goroutine 318 (running) created at:
  github.com/gravitational/teleport/lib/client.(*NodeSession).pipeInOut()
      /workspace/lib/client/session.go:592 +0x9a
  github.com/gravitational/teleport/lib/client.(*NodeSession).interactiveSession()
      /workspace/lib/client/session.go:271 +0x210
  github.com/gravitational/teleport/lib/client.(*NodeSession).runShell()
      /workspace/lib/client/session.go:441 +0x72
  github.com/gravitational/teleport/lib/client.(*TeleportClient).runShell()
      /workspace/lib/client/api.go:1956 +0x24f
  github.com/gravitational/teleport/lib/client.(*TeleportClient).SSH()
      /workspace/lib/client/api.go:1397 +0x5ec
  github.com/gravitational/teleport/integration.testPAM.func1.2()
      /workspace/integration/integration_test.go:3388 +0x34a
==================

This is a naive fix to serialise the write and close operations to resolve the race condition.

The affected tests were also not handling asynchronous error reporting correctly (i.e. it's not legal to call require.XYZ() from a goroutine other than the one executing the test function.). This patch introduces some plumbing to marshal asynchronous errors back into the main test routine before failing the test.

This is an incredibly naive way to address the race condition, an i hope that the review process will help bring out a better solution.

The race condition detector is being triped by a concurrent `Write` and
`Close` in the `PipeNetCon` in several integration tests. This is a naive
fix to serialise the write and close operations to resolve the race
condition.

The affected tests were also not handling asynchronous error reporting
correctly (i.e. it's not legal to call `require.XYZ()` from a goroutine
other than the one executing the test function.). This patch introduces
some plumbing to marshal asynchronous errors back into the main test
routine before failing the test.
Copy link
Collaborator

@zmb3 zmb3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts about changing to a RWMutex and acquiring a read lock for the Read call? Feels odd to synchronize Close/Write and not Read.

lib/utils/pipenetconn.go Outdated Show resolved Hide resolved
lib/utils/pipenetconn.go Outdated Show resolved Hide resolved
Copy link
Contributor

@codingllama codingllama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Trent, sorry for the delay, I meant to get to this one yesterday. Thanks for once again fixing our tests.

I agree with Zac on the RWMutex for PipeNetConn - guarding all operations seems easier to reason about.

In regards to the assertions, I think we can do away with asyncError and asyncAssertion and use plain error to communicate what we need - see the suggestions and let me know your thoughts. Ideally we would "simply" run the assertions in the proper goroutine, but I'm assuming that is fairly difficult to achieve.

lib/utils/pipenetconn.go Outdated Show resolved Hide resolved
integration/integration_test.go Outdated Show resolved Hide resolved
integration/integration_test.go Outdated Show resolved Hide resolved
integration/integration_test.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@tcsc tcsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would "simply" run the assertions in the proper goroutine, but I'm assuming that is fairly difficult to achieve.

In a perfect world, yes. Unfortunately, the go testing package doesn't help us. From the docs for testing.T:

A test ends when its Test function returns or calls any of the methods FailNow, Fatal, Fatalf, SkipNow, Skip, or Skipf. Those methods, as well as the Parallel method, must be called only from the goroutine running the Test function.

Given that testify is a pretty thin layer over these functions, it doesn't help in this case either.

lib/utils/pipenetconn.go Outdated Show resolved Hide resolved
lib/utils/pipenetconn.go Outdated Show resolved Hide resolved
integration/integration_test.go Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
@tcsc tcsc force-pushed the tcsc/net-pipe-race branch from 3714ead to 2fcd7f5 Compare October 27, 2021 00:26
@tcsc
Copy link
Contributor Author

tcsc commented Oct 27, 2021

Any thoughts about changing to a RWMutex and acquiring a read lock for the Read call? Feels odd to synchronize Close/Write and not Read.

I tried, and it introduces a bunch of deadlocks, which is reasonable as I'd expect calls to Read() and Write() to be independent. I've added some documentation as why we only care about serializing Read() and Close() to the PipeNetCon.

@tcsc tcsc requested a review from codingllama October 27, 2021 01:57
@tcsc tcsc requested a review from zmb3 October 27, 2021 01:57
Copy link
Contributor

@russjones russjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot.

@tcsc tcsc merged commit 5463c79 into master Oct 27, 2021
@tcsc tcsc deleted the tcsc/net-pipe-race branch October 27, 2021 22:38
tcsc added a commit that referenced this pull request Dec 1, 2021
The race condition detector is being tripped by a concurrent `Write` and
`Close` in the `PipeNetCon` in several integration tests. This is a naive
fix to serialize the write and close operations to resolve the race
condition.

The affected tests were also not handling asynchronous error reporting
correctly (i.e. it's not legal to call `require.XYZ()` from a goroutine
other than the one executing the test function.). This patch introduces
some plumbing to marshal asynchronous errors back into the main test
routine before failing the test.
tcsc added a commit that referenced this pull request Dec 1, 2021
The race condition detector is being tripped by a concurrent `Write` and
`Close` in the `PipeNetCon` in several integration tests. This is a naive
fix to serialize the write and close operations to resolve the race
condition.

The affected tests were also not handling asynchronous error reporting
correctly (i.e. it's not legal to call `require.XYZ()` from a goroutine
other than the one executing the test function.). This patch introduces
some plumbing to marshal asynchronous errors back into the main test
routine before failing the test.
tcsc added a commit that referenced this pull request Dec 1, 2021
Part of this change is implementing a "no secrets" policy for CI. Given that

  1.  we have to support CI for arbitrary external contributors, and
  2.  it is easy to craft a malicious PR that exfiltrates secrets during a CI build

any test that runs under CI must be able to do so without any injected secrets.

This means that several of the test we currently run under Drone will not be 
run on GCB, at least as part of the regular CI. The plan is to create a separate
task that periodically runs tests that require external credentials (e.g. Kube tests,
various backend data stores, etc.) in a more secure way and report failures
asynchronously. And while these tests will not run under CI, the should still be
built under CI so that required changes are caught during review.

Note: this backport includes various data race fixes added separately in the master branch:

See-Also: #8643
See-Also: #8888
See-Also: #9117
See-Also: #9119
tcsc added a commit that referenced this pull request Dec 13, 2021
russjones pushed a commit that referenced this pull request Dec 15, 2021
tcsc added a commit that referenced this pull request Dec 16, 2021
Part of this change is implementing a "no secrets" policy for CI. Given that

    we have to support CI for arbitrary external contributors, and
    it is easy to craft a malicious PR that exfiltrates secrets during a CI build

any test that runs under CI must be able to do so without any injected secrets.

This means that several of the test we currently run under Drone will not be run
on GCB, at least as part of the regular CI. The plan is to create a separate task
that periodically runs tests that require external credentials (e.g. Kube tests,
various backend data stores, etc.) in a more secure way and report failures
asynchronously. And while these tests will not run under CI, the should still be
built under CI so that required changes are caught during review.

See-Also: #8608
See-Also: #8643
See-Also: #8888
See-Also: #9117
See-Also: #9119
@russjones russjones mentioned this pull request Dec 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants