-
-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows test instability - hard to reproduce #163
Comments
@russellbanks I would be interested in your take on this. It certainly seems to be thread-related, because if I move the set of threading.rs line 73 one line down into the closure (see this link) and I run single-threaded I never get errors. So that suggests that calls made on the same entry from different threads to the windows credential manager are not synchronized with the execution of the calling threads. But then again: I also get errors in basic tests from time to time if I run the tests multi-threaded. But the basic tests never make calls to the same entry from different threads. I find the whole situation very puzzling. |
It's probably worth noting that I'm running my windows tests on a simulation (Parallels Windows 11 on a Mac M1 machine). So it would be good to know if these tests also fail on windows hardware. |
I've been able to reproduce this on a Windows machine by rerunning the tests until it fails. It's confusing because they pass most of the time. |
It fails a little different in CI than on my laptop. Maybe the CI CPUs are faster or slower or something, but it does seem to fail pretty often. |
It's pretty clear that the Windows test failures are all related to setting a credential on one thread and then immediately accessing that credential on a different thread. While this feels, to me, like it's clearly a bug in the Windows credential manager, it's also one that's unlikely to affect keyring clients adversely, as this should be a very uncommon usage pattern and certainly one that clients can avoid. I have submitted PR #164 which removes the problematic behavior from the Windows integration tests, and adds a feature which can be used to run the tests. I would appreciate if @ReactorScram and @russellbanks could retry the tests in their environment to confirm that this does "fix" the tests. I will see if I can report the underlying behavior as a bug in the Window credential manager - any help in how to do that appreciated! :) Once I've heard back from some others that this fixes the tests for them, I will release it and re-close this bug. |
@brotskydotcom But in my case I do I think this only changes how the bug gets reported to Microsoft, and how to document it. If it can happen in a single thread, then two library authors could read the I speculated in a deleted comment that maybe Windows disables its page cache for the vault files for security, which can cause it to read stale data from disk in some scenarios. (AIUI most OS page caches take responsibility for making files look consistent even if they aren't flushed out to disk, and I'm guessing disabling the page cache is a code path that's almost never exercised) I have no idea how to report bugs to Microsoft, though. If nothing else I'd say open an issue on the |
Your tests are doing a bit more than just set, get, and delete on the same credential; they are actually re-creating the (rust-level) entry on every call. I'm not sure why this would make a difference, but in my testing it absolutely does. If I simply use the same keyring entry I can easily set/get/delete/get thousands of times on the same thread, and even simultaneously on different threads, without failures. I've updated the tests on all platforms to do multiple set/get/delete/get cycles on each platform (fewer on linux, because there are capacity limits in the inter-process calls to dbus for secret-service, and to the kernel for keyutils). I'm going to release this version and I'd be very interested in what happens if you update your regression tests not to create a new entry on each call. |
@brotskydotcom That's interesting. I hoisted the I ran the official tests for keyring Just for science, I tried putting a SeqCst fence in front of each call, but it still failed. |
@ReactorScram Thanks for spending so much time looking at this! Why is it always Windows where this stuff happens? I'm very glad to hear the tests are working for you in 2.3.2. I will try to report this problem to Microsoft. |
I've noticed that the tests will occasionally fail on Windows. This is not repeatable (sometime it happens twice in a row, and sometimes it doesn't happen once in a hundred runs), and I'm not getting reports of failures from the field.
It most often happens in the explicitly multi-threaded tests (
test_simultaneous_create_set_move
andtest_simultaneous_independent_create_set
). I've seen it happen in the single-threaded tests as well (test_empty_service_and_user
andtest_update
), but only when the single-threaded tests are being run in a multi-thread test environment, never when--test-threads=1
Almost all the tests set a password and then immediately read it back and delete it. The nature of the failure is that the read or delete gets a
NoEntry
failure. (When the delete gets a NoEntry failure, there is a following read that also fails because it expects to get no entry but finds one.)The most common failure situation is that an entry is created and set on one thread and then passed into another where it is read and deleted. If I run the tests single-threaded then I can consistently get
test_simultaneous_create_set_move
to fail at threading.rs line 75 where the first thing that happens in the spawned thread is the read. But if I run the tests multi-threaded this rarely fails.It feels like there is some sort of race condition between the threads in the keyring process and the threads in the windows credential manager, so that when the keyring process makes a call there's a thread in wincred that gets created to handle that call, and threads in wincred may run out of order relative to when they are started (so the read is serialized before the set even though they weren't spawned that way).
I don't see this happen on platforms other than windows. All help appreciated.
The text was updated successfully, but these errors were encountered: