Windows test instability - hard to reproduce #163

brotskydotcom · 2024-01-02T07:07:31Z

I've noticed that the tests will occasionally fail on Windows. This is not repeatable (sometime it happens twice in a row, and sometimes it doesn't happen once in a hundred runs), and I'm not getting reports of failures from the field.

It most often happens in the explicitly multi-threaded tests (test_simultaneous_create_set_move and test_simultaneous_independent_create_set). I've seen it happen in the single-threaded tests as well (test_empty_service_and_user and test_update), but only when the single-threaded tests are being run in a multi-thread test environment, never when --test-threads=1

Almost all the tests set a password and then immediately read it back and delete it. The nature of the failure is that the read or delete gets a NoEntry failure. (When the delete gets a NoEntry failure, there is a following read that also fails because it expects to get no entry but finds one.)

The most common failure situation is that an entry is created and set on one thread and then passed into another where it is read and deleted. If I run the tests single-threaded then I can consistently get test_simultaneous_create_set_move to fail at threading.rs line 75 where the first thing that happens in the spawned thread is the read. But if I run the tests multi-threaded this rarely fails.

It feels like there is some sort of race condition between the threads in the keyring process and the threads in the windows credential manager, so that when the keyring process makes a call there's a thread in wincred that gets created to handle that call, and threads in wincred may run out of order relative to when they are started (so the read is serialized before the set even though they weren't spawned that way).

I don't see this happen on platforms other than windows. All help appreciated.

The text was updated successfully, but these errors were encountered:

brotskydotcom · 2024-01-02T07:43:59Z

@russellbanks I would be interested in your take on this. It certainly seems to be thread-related, because if I move the set of threading.rs line 73 one line down into the closure (see this link) and I run single-threaded I never get errors. So that suggests that calls made on the same entry from different threads to the windows credential manager are not synchronized with the execution of the calling threads.

But then again: I also get errors in basic tests from time to time if I run the tests multi-threaded. But the basic tests never make calls to the same entry from different threads.

I find the whole situation very puzzling.

brotskydotcom · 2024-01-02T07:45:29Z

It's probably worth noting that I'm running my windows tests on a simulation (Parallels Windows 11 on a Mac M1 machine). So it would be good to know if these tests also fail on windows hardware.

russellbanks · 2024-01-02T13:45:36Z

It's probably worth noting that I'm running my windows tests on a simulation (Parallels Windows 11 on a Mac M1 machine). So it would be good to know if these tests also fail on windows hardware.

I've been able to reproduce this on a Windows machine by rerunning the tests until it fails. It's confusing because they pass most of the time.

ReactorScram · 2024-01-23T18:39:50Z

https://github.com/ReactorScram/keyring-rs-repro/actions/runs/7630294543/job/20785648983

It fails a little different in CI than on my laptop. Maybe the CI CPUs are faster or slower or something, but it does seem to fail pretty often.

brotskydotcom · 2024-01-26T18:09:41Z

It's pretty clear that the Windows test failures are all related to setting a credential on one thread and then immediately accessing that credential on a different thread. While this feels, to me, like it's clearly a bug in the Windows credential manager, it's also one that's unlikely to affect keyring clients adversely, as this should be a very uncommon usage pattern and certainly one that clients can avoid.

I have submitted PR #164 which removes the problematic behavior from the Windows integration tests, and adds a feature which can be used to run the tests. I would appreciate if @ReactorScram and @russellbanks could retry the tests in their environment to confirm that this does "fix" the tests.

I will see if I can report the underlying behavior as a bug in the Window credential manager - any help in how to do that appreciated! :)

Once I've heard back from some others that this fixes the tests for them, I will release it and re-close this bug.

Fixes hwchen#163.

ReactorScram · 2024-01-26T18:42:10Z

@brotskydotcom But in my case I do set_password and get_password on the same credential and within the same thread, and it still failed https://github.com/ReactorScram/keyring-rs-repro/blob/main/src/main.rs

I think this only changes how the bug gets reported to Microsoft, and how to document it. If it can happen in a single thread, then two library authors could read the keyring-rs documentation and do everything right within their own code and their own thread, and then an app using both libraries in separate threads could cause them to fail even though they're both individually correct.

I speculated in a deleted comment that maybe Windows disables its page cache for the vault files for security, which can cause it to read stale data from disk in some scenarios. (AIUI most OS page caches take responsibility for making files look consistent even if they aren't flushed out to disk, and I'm guessing disabling the page cache is a code path that's almost never exercised)

I have no idea how to report bugs to Microsoft, though. If nothing else I'd say open an issue on the windows-rs Github and let them redirect you, at least Github is probably checked regularly :P

Fixes hwchen#163.

brotskydotcom · 2024-01-27T20:38:42Z

But in my case I do set_password and get_password on the same credential and within the same thread, and it still failed ReactorScram/keyring-rs-repro@main/src/main.rs

Your tests are doing a bit more than just set, get, and delete on the same credential; they are actually re-creating the (rust-level) entry on every call. I'm not sure why this would make a difference, but in my testing it absolutely does. If I simply use the same keyring entry I can easily set/get/delete/get thousands of times on the same thread, and even simultaneously on different threads, without failures.

I've updated the tests on all platforms to do multiple set/get/delete/get cycles on each platform (fewer on linux, because there are capacity limits in the inter-process calls to dbus for secret-service, and to the kernel for keyutils). I'm going to release this version and I'd be very interested in what happens if you update your regression tests not to create a new entry on each call.

ReactorScram · 2024-01-29T16:58:49Z

@brotskydotcom That's interesting. I hoisted the new_with_target call and they still fail on my dev laptop and on CI. (I'm still on keyring 2.3.1 for these) https://github.com/ReactorScram/keyring-rs-repro/actions/runs/7699494265

I ran the official tests for keyring 252f43c3e74 on the same laptop and they passed.

Just for science, I tried putting a SeqCst fence in front of each call, but it still failed.

brotskydotcom · 2024-01-31T02:10:01Z

@ReactorScram Thanks for spending so much time looking at this! Why is it always Windows where this stuff happens? I'm very glad to hear the tests are working for you in 2.3.2. I will try to report this problem to Microsoft.

brotskydotcom added the help wanted label Jan 2, 2024

This was referenced Jan 22, 2024

bug(windows): test_keyring unit test is flaky firezone/firezone#3256

Closed

wip(windows): #3256 replication firezone/firezone#3344

Closed

brotskydotcom mentioned this issue Jan 26, 2024

Track down and avoid the windows test failures #164

Merged

brotskydotcom closed this as completed in #164 Jan 26, 2024

brotskydotcom reopened this Jan 26, 2024

brotskydotcom self-assigned this Jan 26, 2024

brotskydotcom added a commit to brotskydotcom/keyring-rs that referenced this issue Jan 26, 2024

Prepare to release Windows test fixes.

0054e43

Fixes hwchen#163.

brotskydotcom mentioned this issue Jan 26, 2024

Release Windows test fixes. #165

Merged

brotskydotcom added a commit to brotskydotcom/keyring-rs that referenced this issue Jan 27, 2024

Prepare to release Windows test fixes.

de0e019

Fixes hwchen#163.

brotskydotcom closed this as completed in #165 Jan 27, 2024

ReactorScram mentioned this issue Jan 29, 2024

build(deps): Bump keyring from 2.3.1 to 2.3.2 in /rust firezone/firezone#3419

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows test instability - hard to reproduce #163

Windows test instability - hard to reproduce #163

brotskydotcom commented Jan 2, 2024

brotskydotcom commented Jan 2, 2024

brotskydotcom commented Jan 2, 2024

russellbanks commented Jan 2, 2024

ReactorScram commented Jan 23, 2024

brotskydotcom commented Jan 26, 2024

ReactorScram commented Jan 26, 2024

brotskydotcom commented Jan 27, 2024

ReactorScram commented Jan 29, 2024

brotskydotcom commented Jan 31, 2024

Windows test instability - hard to reproduce #163

Windows test instability - hard to reproduce #163

Comments

brotskydotcom commented Jan 2, 2024

brotskydotcom commented Jan 2, 2024

brotskydotcom commented Jan 2, 2024

russellbanks commented Jan 2, 2024

ReactorScram commented Jan 23, 2024

brotskydotcom commented Jan 26, 2024

ReactorScram commented Jan 26, 2024

brotskydotcom commented Jan 27, 2024

ReactorScram commented Jan 29, 2024

brotskydotcom commented Jan 31, 2024