Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data race in TestAidAttestationFailiures/AidAttestation_fails_if_server_does_not_sends_a_challenge #2711

Closed
zmt opened this issue Jan 31, 2022 · 7 comments

Comments

@zmt
Copy link
Contributor

zmt commented Jan 31, 2022

  • Version: 2067119
  • Platform: Linux [redacted] 5.4.129+ #1 SMP Wed Aug 18 19:58:18 PDT 2021 x86_64 GNU/Linux
  • Subsystem: tpm nodeattestor plugin

The TestAidAttestationFailiures/AidAttestation_fails_if_server_does_not_sends_a_challenge exhibits a data race between the code under test and the t.Cleanup call to sim.Close(). That is the only subtest which uses streambuilder.FailAndBuild from the nodeattestortest package.

See example failure.

Discovered while addressing #2379.

@zmt
Copy link
Contributor Author

zmt commented Jan 31, 2022

Also note typo in test name: "Failiures".

@zmt
Copy link
Contributor Author

zmt commented Jan 31, 2022

I may have a simple patch for this.

@zmt
Copy link
Contributor Author

zmt commented Jan 31, 2022

Nope. If I remove the t.Cleanup(...sim.Close()...) in favor of a defer sim.Close() after each call to setupSimulator, it avoids the data race, but makes a different subtest flaky with or without the race detector:

% go test -race -count 10 -failfast ./pkg/agent/plugin/nodeattestor/tpmdevid
--- FAIL: TestAidAttestationFailures (0.48s)
    --- FAIL: TestAidAttestationFailures/AidAttestation_fails_if_a_new_session_cannot_be_started (0.48s)
        devid_test.go:467:
            	Error Trace:	devid_test.go:467
            	Error:      	"rpc error: code = Internal desc = nodeattestor(tpm_devid): unable to get endorsement certificate: failed to read NV index 01c00002: GetCapability for TPM_PT_NV_BUFFER_MAX failed: attempting to use a closed simulator" does not contain "rpc error: code = Internal desc = nodeattestor(tpm_devid): unable to start a new TPM session: cannot load DevID key on TPM"
            	Test:       	TestAidAttestationFailures/AidAttestation_fails_if_a_new_session_cannot_be_started
FAIL
FAIL	github.com/spiffe/spire/pkg/agent/plugin/nodeattestor/tpmdevid	9.383s
% go test -count 10 -failfast ./pkg/agent/plugin/nodeattestor/tpmdevid
--- FAIL: TestAidAttestationFailures (0.35s)
    --- FAIL: TestAidAttestationFailures/AidAttestation_fails_if_a_new_session_cannot_be_started (0.35s)
        devid_test.go:467:
            	Error Trace:	devid_test.go:467
            	Error:      	"rpc error: code = Internal desc = nodeattestor(tpm_devid): unable to get endorsement certificate: failed to read NV index 01c00002: GetCapability for TPM_PT_NV_BUFFER_MAX failed: attempting to use a closed simulator" does not contain "rpc error: code = Internal desc = nodeattestor(tpm_devid): unable to start a new TPM session: cannot load DevID key on TPM"
            	Test:       	TestAidAttestationFailures/AidAttestation_fails_if_a_new_session_cannot_be_started
FAIL
FAIL	github.com/spiffe/spire/pkg/agent/plugin/nodeattestor/tpmdevid	7.081s

@zmt
Copy link
Contributor Author

zmt commented Jan 31, 2022

Also see #2379 & #2705.

@azdagron
Copy link
Member

azdagron commented Feb 1, 2022

Thanks for opening this @zmt !

This race is ultimately a deficiency in the way the TPMSimulator wrapper has been designed. When the test executes, the expected error condition is met, and the test closes the TPMSimulator. Unfortunately, the AidAttestation handler is still being executed concurrently and closes its session, causing the data race. The TPM simulator is not designed to be accessed from more than one goroutine. The plugin is "unloaded" before the TPM simulator is torn down, however, the AidAttestation handler is still running is because the catalog only calls Stop() on the plugin gRPC server, which does not wait for pending RPCs to complete before returning.

We can and should restructure the tpm simulator wrapper and the way it is used that it is only handled by a single goroutine, but either way, this race can be solved by changing the catalog to perform a GracefulStop(), which I think is probably the right thing to do anyway.

I'll put up a PR.

azdagron added a commit to azdagron/spire that referenced this issue Feb 1, 2022
SPIRE currently calls `Stop` on the plugin servers. This immediately
closes the server and cancels outstanding RPCs. However, `Stop` does not
wait for RPCs to complete (i.e. respond to the cancelation).

This change updates SPIRE to instead call `GracefulStop`, which will
allow outstanding RPCs a chance to complete.

There is _some_ risk with this change, i.e. that a stalled plugin could
prevent the server from shutting down. However, this can (and should) be
mitigated by instituting timeouts on plugin requests, which we have
already added for all KeyManager requests. Other timeouts can be added
if necessary. In order to facilitate debugging this kind of scenario,
I've added logging when plugins are being unloaded.

There is an additional benefit to this change, which is that unit tests
can know that when a plugin is unloaded that there is no longer any
executing plugin code. Not having this feature makes it harder, in
certain circumstances, to construct stable (and data-race free) tests
(see spiffe#2711).

Signed-off-by: Andrew Harding <aharding@vmware.com>
@zmt
Copy link
Contributor Author

zmt commented Feb 1, 2022

The TPM simulator is not designed to be accessed from more than one goroutine.

I noticed it wasn't safe for concurrency and that the wrapper followed suit. My instinct was to add a mutex to the wrapper, but wasn't sure that was the right direction. I was also looking at a hack to skip the t.Cleanup() for the racy subtest in favor of a defer sim.Close(). When I changed everything to a defer sim.Close() other tests broke so was going to try a janky conditional thing. Your PR is def. a big step up from the hack I was considering to avoid the race.

azdagron added a commit that referenced this issue Feb 4, 2022
SPIRE currently calls `Stop` on the plugin servers. This immediately
closes the server and cancels outstanding RPCs. However, `Stop` does not
wait for RPCs to complete (i.e. respond to the cancelation).

This change updates SPIRE to instead call `GracefulStop`, which will
allow outstanding RPCs a chance to complete.

There is _some_ risk with this change, i.e. that a stalled plugin could
prevent the server from shutting down. However, this can (and should) be
mitigated by instituting timeouts on plugin requests, which we have
already added for all KeyManager requests. Other timeouts can be added
if necessary. In order to facilitate debugging this kind of scenario,
I've added logging when plugins are being unloaded.

There is an additional benefit to this change, which is that unit tests
can know that when a plugin is unloaded that there is no longer any
executing plugin code. Not having this feature makes it harder, in
certain circumstances, to construct stable (and data-race free) tests
(see #2711).

Signed-off-by: Andrew Harding <aharding@vmware.com>
@zmt
Copy link
Contributor Author

zmt commented Mar 11, 2022

I can no longer reproduce this. I guess #2722 fixed it.

@zmt zmt closed this as completed Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants