data race in TestAidAttestationFailiures/AidAttestation_fails_if_server_does_not_sends_a_challenge #2711

zmt · 2022-01-31T19:36:11Z

Version: 2067119
Platform: Linux [redacted] 5.4.129+ #1 SMP Wed Aug 18 19:58:18 PDT 2021 x86_64 GNU/Linux
Subsystem: tpm nodeattestor plugin

The TestAidAttestationFailiures/AidAttestation_fails_if_server_does_not_sends_a_challenge exhibits a data race between the code under test and the t.Cleanup call to sim.Close(). That is the only subtest which uses streambuilder.FailAndBuild from the nodeattestortest package.

See example failure.

Discovered while addressing #2379.

The text was updated successfully, but these errors were encountered:

zmt · 2022-01-31T19:38:04Z

Also note typo in test name: "Failiures".

zmt · 2022-01-31T21:44:08Z

I may have a simple patch for this.

zmt · 2022-01-31T21:49:59Z

Nope. If I remove the t.Cleanup(...sim.Close()...) in favor of a defer sim.Close() after each call to setupSimulator, it avoids the data race, but makes a different subtest flaky with or without the race detector:

% go test -race -count 10 -failfast ./pkg/agent/plugin/nodeattestor/tpmdevid
--- FAIL: TestAidAttestationFailures (0.48s)
    --- FAIL: TestAidAttestationFailures/AidAttestation_fails_if_a_new_session_cannot_be_started (0.48s)
        devid_test.go:467:
            	Error Trace:	devid_test.go:467
            	Error:      	"rpc error: code = Internal desc = nodeattestor(tpm_devid): unable to get endorsement certificate: failed to read NV index 01c00002: GetCapability for TPM_PT_NV_BUFFER_MAX failed: attempting to use a closed simulator" does not contain "rpc error: code = Internal desc = nodeattestor(tpm_devid): unable to start a new TPM session: cannot load DevID key on TPM"
            	Test:       	TestAidAttestationFailures/AidAttestation_fails_if_a_new_session_cannot_be_started
FAIL
FAIL	github.com/spiffe/spire/pkg/agent/plugin/nodeattestor/tpmdevid	9.383s
% go test -count 10 -failfast ./pkg/agent/plugin/nodeattestor/tpmdevid
--- FAIL: TestAidAttestationFailures (0.35s)
    --- FAIL: TestAidAttestationFailures/AidAttestation_fails_if_a_new_session_cannot_be_started (0.35s)
        devid_test.go:467:
            	Error Trace:	devid_test.go:467
            	Error:      	"rpc error: code = Internal desc = nodeattestor(tpm_devid): unable to get endorsement certificate: failed to read NV index 01c00002: GetCapability for TPM_PT_NV_BUFFER_MAX failed: attempting to use a closed simulator" does not contain "rpc error: code = Internal desc = nodeattestor(tpm_devid): unable to start a new TPM session: cannot load DevID key on TPM"
            	Test:       	TestAidAttestationFailures/AidAttestation_fails_if_a_new_session_cannot_be_started
FAIL
FAIL	github.com/spiffe/spire/pkg/agent/plugin/nodeattestor/tpmdevid	7.081s

zmt · 2022-01-31T22:12:28Z

Also see #2379 & #2705.

azdagron · 2022-02-01T17:07:10Z

Thanks for opening this @zmt !

This race is ultimately a deficiency in the way the TPMSimulator wrapper has been designed. When the test executes, the expected error condition is met, and the test closes the TPMSimulator. Unfortunately, the AidAttestation handler is still being executed concurrently and closes its session, causing the data race. The TPM simulator is not designed to be accessed from more than one goroutine. The plugin is "unloaded" before the TPM simulator is torn down, however, the AidAttestation handler is still running is because the catalog only calls Stop() on the plugin gRPC server, which does not wait for pending RPCs to complete before returning.

We can and should restructure the tpm simulator wrapper and the way it is used that it is only handled by a single goroutine, but either way, this race can be solved by changing the catalog to perform a GracefulStop(), which I think is probably the right thing to do anyway.

I'll put up a PR.

SPIRE currently calls `Stop` on the plugin servers. This immediately closes the server and cancels outstanding RPCs. However, `Stop` does not wait for RPCs to complete (i.e. respond to the cancelation). This change updates SPIRE to instead call `GracefulStop`, which will allow outstanding RPCs a chance to complete. There is _some_ risk with this change, i.e. that a stalled plugin could prevent the server from shutting down. However, this can (and should) be mitigated by instituting timeouts on plugin requests, which we have already added for all KeyManager requests. Other timeouts can be added if necessary. In order to facilitate debugging this kind of scenario, I've added logging when plugins are being unloaded. There is an additional benefit to this change, which is that unit tests can know that when a plugin is unloaded that there is no longer any executing plugin code. Not having this feature makes it harder, in certain circumstances, to construct stable (and data-race free) tests (see spiffe#2711). Signed-off-by: Andrew Harding <aharding@vmware.com>

zmt · 2022-02-01T20:38:56Z

The TPM simulator is not designed to be accessed from more than one goroutine.

I noticed it wasn't safe for concurrency and that the wrapper followed suit. My instinct was to add a mutex to the wrapper, but wasn't sure that was the right direction. I was also looking at a hack to skip the t.Cleanup() for the racy subtest in favor of a defer sim.Close(). When I changed everything to a defer sim.Close() other tests broke so was going to try a janky conditional thing. Your PR is def. a big step up from the hack I was considering to avoid the race.

SPIRE currently calls `Stop` on the plugin servers. This immediately closes the server and cancels outstanding RPCs. However, `Stop` does not wait for RPCs to complete (i.e. respond to the cancelation). This change updates SPIRE to instead call `GracefulStop`, which will allow outstanding RPCs a chance to complete. There is _some_ risk with this change, i.e. that a stalled plugin could prevent the server from shutting down. However, this can (and should) be mitigated by instituting timeouts on plugin requests, which we have already added for all KeyManager requests. Other timeouts can be added if necessary. In order to facilitate debugging this kind of scenario, I've added logging when plugins are being unloaded. There is an additional benefit to this change, which is that unit tests can know that when a plugin is unloaded that there is no longer any executing plugin code. Not having this feature makes it harder, in certain circumstances, to construct stable (and data-race free) tests (see #2711). Signed-off-by: Andrew Harding <aharding@vmware.com>

zmt · 2022-03-11T01:29:09Z

I can no longer reproduce this. I guess #2722 fixed it.

azdagron mentioned this issue Feb 1, 2022

Gracefully stop plugin servers #2722

Merged

rturner3 added the flaky-test label Feb 1, 2022

zmt closed this as completed Mar 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data race in TestAidAttestationFailiures/AidAttestation_fails_if_server_does_not_sends_a_challenge #2711

data race in TestAidAttestationFailiures/AidAttestation_fails_if_server_does_not_sends_a_challenge #2711

zmt commented Jan 31, 2022 •

edited

Loading

zmt commented Jan 31, 2022

zmt commented Jan 31, 2022

zmt commented Jan 31, 2022 •

edited

Loading

zmt commented Jan 31, 2022

azdagron commented Feb 1, 2022

zmt commented Feb 1, 2022

zmt commented Mar 11, 2022

data race in TestAidAttestationFailiures/AidAttestation_fails_if_server_does_not_sends_a_challenge #2711

data race in TestAidAttestationFailiures/AidAttestation_fails_if_server_does_not_sends_a_challenge #2711

Comments

zmt commented Jan 31, 2022 • edited Loading

zmt commented Jan 31, 2022

zmt commented Jan 31, 2022

zmt commented Jan 31, 2022 • edited Loading

zmt commented Jan 31, 2022

azdagron commented Feb 1, 2022

zmt commented Feb 1, 2022

zmt commented Mar 11, 2022

zmt commented Jan 31, 2022 •

edited

Loading

zmt commented Jan 31, 2022 •

edited

Loading