Fix race condition in PipeNetCon #8643

tcsc · 2021-10-18T11:07:47Z

The race condition detector is being triped by a concurrent Write and Close in the PipeNetCon in several integration tests.

Example race detector report:

==================
WARNING: DATA RACE
Read at 0x00c00307e060 by goroutine 318:
  golang.org/x/crypto/ssh.(*channel).WriteExtended()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:233 +0x5b
  golang.org/x/crypto/ssh.(*channel).Write()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:535 +0x88
  golang.org/x/crypto/ssh.(*sessionStdin).Write()
      <autogenerated>:1 +0x87
  github.com/gravitational/teleport/lib/utils.(*PipeNetConn).Write()
      /workspace/lib/utils/pipenetconn.go:56 +0x79
  github.com/gravitational/teleport/lib/client.(*NodeSession).pipeInOut.func2()
      /workspace/lib/client/session.go:618 +0x20d

Previous write at 0x00c00307e060 by goroutine 678:
  ??()
      -:0 +0xffffffffffffffff
  internal/poll.ignoringEINTRIO()
      /opt/go/src/internal/poll/fd_unix.go:581 +0x16e
  internal/poll.(*FD).Write()
      /opt/go/src/internal/poll/fd_unix.go:274 +0x294
  net.(*netFD).Write()
      /opt/go/src/net/fd_posix.go:73 +0x68
  net.(*conn).Write()
      /opt/go/src/net/net.go:195 +0xeb
  net.(*TCPConn).Write()
      <autogenerated>:1 +0x69
  bufio.(*Writer).Flush()
      /opt/go/src/bufio/bufio.go:607 +0x141
  golang.org/x/crypto/ssh.(*connectionState).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/transport.go:182 +0x148
  golang.org/x/crypto/ssh.(*transport).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/transport.go:172 +0xbe
  golang.org/x/crypto/ssh.(*handshakeTransport).pushPacket()
      /workspace/vendor/golang.org/x/crypto/ssh/handshake.go:223 +0x52a
  golang.org/x/crypto/ssh.(*handshakeTransport).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/handshake.go:516 +0x535
  golang.org/x/crypto/ssh.(*channel).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:215 +0x12c
  golang.org/x/crypto/ssh.(*channel).WriteExtended()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:271 +0x644
  golang.org/x/crypto/ssh.(*channel).Write()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:535 +0x88
  golang.org/x/crypto/ssh.(*sessionStdin).Write()
      <autogenerated>:1 +0x87
  github.com/gravitational/teleport/lib/utils.(*PipeNetConn).Write()
      /workspace/lib/utils/pipenetconn.go:56 +0x79
  bufio.(*Writer).Flush()
      /opt/go/src/bufio/bufio.go:607 +0x141
  golang.org/x/crypto/ssh.(*connectionState).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/transport.go:182 +0x148
  golang.org/x/crypto/ssh.(*transport).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/transport.go:172 +0xbe
  golang.org/x/crypto/ssh.(*handshakeTransport).pushPacket()
      /workspace/vendor/golang.org/x/crypto/ssh/handshake.go:223 +0x52a
  golang.org/x/crypto/ssh.(*handshakeTransport).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/handshake.go:516 +0x535
  golang.org/x/crypto/ssh.(*channel).writePacket()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:215 +0x12c
  golang.org/x/crypto/ssh.(*channel).sendMessage()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:227 +0x211
  golang.org/x/crypto/ssh.(*channel).SendRequest()
      /workspace/vendor/golang.org/x/crypto/ssh/channel.go:586 +0x1bc
  golang.org/x/crypto/ssh.(*Session).RequestPty()
      /workspace/vendor/golang.org/x/crypto/ssh/session.go:209 +0x4f9
  github.com/gravitational/teleport/lib/client.(*NodeSession).allocateTerminal()
      /workspace/lib/client/session.go:299 +0x13a
  github.com/gravitational/teleport/lib/client.(*NodeSession).interactiveSession()
      /workspace/lib/client/session.go:248 +0xf7
  github.com/gravitational/teleport/lib/client.(*NodeSession).runShell()
      /workspace/lib/client/session.go:441 +0x72
  github.com/gravitational/teleport/lib/client.(*TeleportClient).runShell()
      /workspace/lib/client/api.go:1956 +0x24f
  github.com/gravitational/teleport/lib/client.(*TeleportClient).SSH()
      /workspace/lib/client/api.go:1397 +0x5ec
  github.com/gravitational/teleport/integration.testPAM.func1.2()
      /workspace/integration/integration_test.go:3388 +0x34a

Goroutine 318 (running) created at:
  github.com/gravitational/teleport/lib/client.(*NodeSession).pipeInOut()
      /workspace/lib/client/session.go:592 +0x9a
  github.com/gravitational/teleport/lib/client.(*NodeSession).interactiveSession()
      /workspace/lib/client/session.go:271 +0x210
  github.com/gravitational/teleport/lib/client.(*NodeSession).runShell()
      /workspace/lib/client/session.go:441 +0x72
  github.com/gravitational/teleport/lib/client.(*TeleportClient).runShell()
      /workspace/lib/client/api.go:1956 +0x24f
  github.com/gravitational/teleport/lib/client.(*TeleportClient).SSH()
      /workspace/lib/client/api.go:1397 +0x5ec
  github.com/gravitational/teleport/integration.testPAM.func1.2()
      /workspace/integration/integration_test.go:3388 +0x34a
==================

This is a naive fix to serialise the write and close operations to resolve the race condition.

The affected tests were also not handling asynchronous error reporting correctly (i.e. it's not legal to call require.XYZ() from a goroutine other than the one executing the test function.). This patch introduces some plumbing to marshal asynchronous errors back into the main test routine before failing the test.

This is an incredibly naive way to address the race condition, an i hope that the review process will help bring out a better solution.

The race condition detector is being triped by a concurrent `Write` and `Close` in the `PipeNetCon` in several integration tests. This is a naive fix to serialise the write and close operations to resolve the race condition. The affected tests were also not handling asynchronous error reporting correctly (i.e. it's not legal to call `require.XYZ()` from a goroutine other than the one executing the test function.). This patch introduces some plumbing to marshal asynchronous errors back into the main test routine before failing the test.

zmb3

Any thoughts about changing to a RWMutex and acquiring a read lock for the Read call? Feels odd to synchronize Close/Write and not Read.

lib/utils/pipenetconn.go

codingllama

Hey Trent, sorry for the delay, I meant to get to this one yesterday. Thanks for once again fixing our tests.

I agree with Zac on the RWMutex for PipeNetConn - guarding all operations seems easier to reason about.

In regards to the assertions, I think we can do away with asyncError and asyncAssertion and use plain error to communicate what we need - see the suggestions and let me know your thoughts. Ideally we would "simply" run the assertions in the proper goroutine, but I'm assuming that is fairly difficult to achieve.

lib/utils/pipenetconn.go

integration/integration_test.go

tcsc

Ideally we would "simply" run the assertions in the proper goroutine, but I'm assuming that is fairly difficult to achieve.

In a perfect world, yes. Unfortunately, the go testing package doesn't help us. From the docs for testing.T:

A test ends when its Test function returns or calls any of the methods FailNow, Fatal, Fatalf, SkipNow, Skip, or Skipf. Those methods, as well as the Parallel method, must be called only from the goroutine running the Test function.

Given that testify is a pretty thin layer over these functions, it doesn't help in this case either.

lib/utils/pipenetconn.go

integration/integration_test.go

Makefile

integration/integration_test.go

tcsc · 2021-10-27T01:51:20Z

Any thoughts about changing to a RWMutex and acquiring a read lock for the Read call? Feels odd to synchronize Close/Write and not Read.

I tried, and it introduces a bunch of deadlocks, which is reasonable as I'd expect calls to Read() and Write() to be independent. I've added some documentation as why we only care about serializing Read() and Close() to the PipeNetCon.

russjones

Bot.

The race condition detector is being tripped by a concurrent `Write` and `Close` in the `PipeNetCon` in several integration tests. This is a naive fix to serialize the write and close operations to resolve the race condition. The affected tests were also not handling asynchronous error reporting correctly (i.e. it's not legal to call `require.XYZ()` from a goroutine other than the one executing the test function.). This patch introduces some plumbing to marshal asynchronous errors back into the main test routine before failing the test.

Part of this change is implementing a "no secrets" policy for CI. Given that 1. we have to support CI for arbitrary external contributors, and 2. it is easy to craft a malicious PR that exfiltrates secrets during a CI build any test that runs under CI must be able to do so without any injected secrets. This means that several of the test we currently run under Drone will not be run on GCB, at least as part of the regular CI. The plan is to create a separate task that periodically runs tests that require external credentials (e.g. Kube tests, various backend data stores, etc.) in a more secure way and report failures asynchronously. And while these tests will not run under CI, the should still be built under CI so that required changes are caught during review. Note: this backport includes various data race fixes added separately in the master branch: See-Also: #8643 See-Also: #8888 See-Also: #9117 See-Also: #9119

Part of this change is implementing a "no secrets" policy for CI. Given that we have to support CI for arbitrary external contributors, and it is easy to craft a malicious PR that exfiltrates secrets during a CI build any test that runs under CI must be able to do so without any injected secrets. This means that several of the test we currently run under Drone will not be run on GCB, at least as part of the regular CI. The plan is to create a separate task that periodically runs tests that require external credentials (e.g. Kube tests, various backend data stores, etc.) in a more secure way and report failures asynchronously. And while these tests will not run under CI, the should still be built under CI so that required changes are caught during review. See-Also: #8608 See-Also: #8643 See-Also: #8888 See-Also: #9117 See-Also: #9119

tcsc requested review from klizhentas, r0mant and russjones as code owners October 18, 2021 11:07

tcsc requested review from nklaassen, codingllama and fspmarshall October 18, 2021 11:09

tcsc added 2 commits October 19, 2021 08:52

Merge branch 'master' into tcsc/net-pipe-race

290df2d

Merge branch 'master' into tcsc/net-pipe-race

9e1c432

zmb3 reviewed Oct 19, 2021

View reviewed changes

lib/utils/pipenetconn.go Outdated Show resolved Hide resolved

lib/utils/pipenetconn.go Outdated Show resolved Hide resolved

codingllama reviewed Oct 19, 2021

View reviewed changes

lib/utils/pipenetconn.go Outdated Show resolved Hide resolved

integration/integration_test.go Outdated Show resolved Hide resolved

integration/integration_test.go Outdated Show resolved Hide resolved

integration/integration_test.go Outdated Show resolved Hide resolved

tcsc added 2 commits October 20, 2021 10:50

Merge branch 'master' into tcsc/net-pipe-race

0a949a2

Merge remote-tracking branch 'origin/master' into tcsc/net-pipe-race

15ee115

tcsc commented Oct 21, 2021

View reviewed changes

lib/utils/pipenetconn.go Outdated Show resolved Hide resolved

lib/utils/pipenetconn.go Outdated Show resolved Hide resolved

integration/integration_test.go Outdated Show resolved Hide resolved

tcsc added 7 commits October 21, 2021 17:30

Review feedback

f808533

linter appeasement

a295ddd

Remuve spurious exit code

9c43548

Explicitly set 0 on successful exit

e45b74d

disable test cache

8a1b0a3

disable test cache

b854a14

disable test cache

67d4f8e

zmb3 reviewed Oct 21, 2021

View reviewed changes

Makefile Outdated Show resolved Hide resolved

zmb3 reviewed Oct 21, 2021

View reviewed changes

integration/integration_test.go Outdated Show resolved Hide resolved

tcsc added 2 commits October 22, 2021 12:38

stash

d6ee9e7

Merge remote-tracking branch 'origin/master' into tcsc/net-pipe-race

2fcd7f5

tcsc force-pushed the tcsc/net-pipe-race branch from 3714ead to 2fcd7f5 Compare October 27, 2021 00:26

tcsc added 2 commits October 27, 2021 12:00

Merge branch 'master' into tcsc/net-pipe-race

3f053e3

Revert read lock

a25b025

tcsc requested a review from codingllama October 27, 2021 01:57

tcsc requested a review from zmb3 October 27, 2021 01:57

codingllama approved these changes Oct 27, 2021

View reviewed changes

zmb3 approved these changes Oct 27, 2021

View reviewed changes

russjones approved these changes Oct 27, 2021

View reviewed changes

Merge branch 'master' into tcsc/net-pipe-race

051ba69

tcsc merged commit 5463c79 into master Oct 27, 2021

tcsc deleted the tcsc/net-pipe-race branch October 27, 2021 22:38

tcsc mentioned this pull request Dec 1, 2021

[Backport] Google CloudBuild support (#9090) #9165

Merged

tcsc mentioned this pull request Dec 1, 2021

[Backport] Google CloudBuild support + build fixes (#9090) #9166

Merged

tcsc mentioned this pull request Dec 13, 2021

[Backport] Google CloudBuild support + Build fixes (#9090) #9167

Merged

tcsc added a commit that referenced this pull request Dec 13, 2021

Partial backport of #8643

7eb8e77

russjones pushed a commit that referenced this pull request Dec 15, 2021

Partial backport of #8643

19c0635

russjones mentioned this pull request Dec 19, 2021

Flaky Test Tracker #9492

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in PipeNetCon #8643

Fix race condition in PipeNetCon #8643

tcsc commented Oct 18, 2021

zmb3 left a comment

codingllama left a comment

tcsc left a comment

tcsc commented Oct 27, 2021 •

edited

Loading

russjones left a comment

Fix race condition in PipeNetCon #8643

Fix race condition in PipeNetCon #8643

Conversation

tcsc commented Oct 18, 2021

zmb3 left a comment

Choose a reason for hiding this comment

codingllama left a comment

Choose a reason for hiding this comment

tcsc left a comment

Choose a reason for hiding this comment

tcsc commented Oct 27, 2021 • edited Loading

russjones left a comment

Choose a reason for hiding this comment

tcsc commented Oct 27, 2021 •

edited

Loading