-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] FileSettingsServiceTests testProcessFileChanges failing #115280
Comments
Pinging @elastic/es-core-infra (Team:Core/Infra) |
This has been muted on branch 8.x Mute Reasons:
Build Scans: |
… testProcessFileChanges #115280
I've cherry-picked the mute back to the |
… testProcessFileChanges #115280
This has been muted on branch main Mute Reasons:
Build Scans:
|
… testProcessFileChanges #115280
@mark-vieira thanks! I got a fix in progress for this. |
… testProcessFileChanges elastic#115280
Fix unit test on windows: it looks like the replace-existing flag is necessary to avoid AccessDeniedExceptions like this [example failure](https://gradle-enterprise.elastic.co/s/4tjgx5vzblv36/tests/task/:server:test/details/org.elasticsearch.reservedstate.service.FileSettingsServiceTests/testProcessFileChanges?top-execution=1). Resolves: elastic#115280
Fix unit test on windows: it looks like the replace-existing flag is necessary to avoid AccessDeniedExceptions like this [example failure](https://gradle-enterprise.elastic.co/s/4tjgx5vzblv36/tests/task/:server:test/details/org.elasticsearch.reservedstate.service.FileSettingsServiceTests/testProcessFileChanges?top-execution=1). Resolves: elastic#115280
Fix unit test on windows: it looks like the replace-existing flag is necessary to avoid AccessDeniedExceptions like this [example failure](https://gradle-enterprise.elastic.co/s/4tjgx5vzblv36/tests/task/:server:test/details/org.elasticsearch.reservedstate.service.FileSettingsServiceTests/testProcessFileChanges?top-execution=1). Resolves: elastic#115280
… testProcessFileChanges #115280
This has been muted on branch 8.x Mute Reasons:
Build Scans: |
Ughhhhh. Okay windows. I shall try again. |
… testProcessFileChanges elastic#115280
Fix unit test on windows: it looks like the replace-existing flag is necessary to avoid AccessDeniedExceptions like this [example failure](https://gradle-enterprise.elastic.co/s/4tjgx5vzblv36/tests/task/:server:test/details/org.elasticsearch.reservedstate.service.FileSettingsServiceTests/testProcessFileChanges?top-execution=1). Resolves: elastic#115280
@thecoop sorry -- this is not completed yet. It's failing after the initial round of fixes, and I'm still investigating. |
Yeah, you're right I don't think two concurrent calls can happen. It might still be the case that |
… testProcessFileChanges elastic#115280
Fix unit test on windows: it looks like the replace-existing flag is necessary to avoid AccessDeniedExceptions like this [example failure](https://gradle-enterprise.elastic.co/s/4tjgx5vzblv36/tests/task/:server:test/details/org.elasticsearch.reservedstate.service.FileSettingsServiceTests/testProcessFileChanges?top-execution=1). Resolves: elastic#115280
This PR addresses some of the failure causes tracked under elastic#115280 and elastic#115725: the latch-await setup was rather convoluted and the move command not always correctly invoked in the right order. This PR cleans up latching by separating awaiting the first processing call (on start) from waiting on the subsequent call. Also, it makes writing the file more robust w.r.t. OS'es where `atomic_move` may not be available. This should address failures around the timeout await, and the assertion failures around invoked methods tracked here: elastic#115725 (comment) But will likely require another round of changes to address the failures to delete files. Relates: elastic#115280 Relates: elastic#115725
… testProcessFileChanges elastic#115280
Does not reproduce for me, even after many many iterations on a buildkite instance. I'm going with more logging (#116192)... |
#115280 still mystifies me. It does not reproduce on a buildkite instance either. This PR turns up logging for the file watching service steps (in particular the polled events should be useful) and adds some logging to the test.
This has been muted on branch main Mute Reasons:
Build Scans: |
… testProcessFileChanges #115280
Failure was expected (I only included logging to get more info in the last PR). Let's see what the logs tell us... |
Interesting, based on the latest failure, on the second file write (for the update):
The exception is thrown by So I think what we're hitting here is:
And WindowsFS fails this step because you can't overwrite a file while it's being read. So the fix should be to avoid this whole business of creating temp files in the operator directory and moving; instead, we can create the settings file once, then when we want an update, we can just "touch" it (e.g., set a new timestamp), which should avoid the complexity of multiple file operations and still test what we actually want to test. |
This is great @n1v0lg! So your fix will prevent the error from happening on Windows by addressing the underlying root cause (the undesired race), correct? |
@ldematte I really really hope so 🤞 |
This test resolves two race conditions in `FileSettingsServiceTests#testProcessFileChanges`: 1. The test used `writeTestFile` to update the settings.json file. It did so in multiple steps: first it created a temp file in the operator directory, then moved that file to replace the existing settings.json file. The first step (creating the temp file) triggered the watcher thread in the file settings service to access the settings.json file to check for changes. When this access happened concurrently with the `move` call inside `writeTestFile` the test would throw on a Windows file-system (mocked or real), since you can't move a file while it's open. To fix this, the PR changes `writeTestFile` to creating a temp file elsewhere and simplifies the method. Instead of relying on this method (and multiple file operations) to update the file, the PR instead simply "touches" the settings file with a timestamp update to trigger file processing (more details also in this [comment](#115280 (comment))). 2. The test awaited latches that would count down when `ReservedClusterStateService#process` was invoked. However, at this point in the file settings processing flow, the settings.json file is still open and would therefore likewise block subsequent writes that fall into the small window of the file still being open. This PR instead adds latches based on file-changed listeners which are reliably invoked _after_ the file is closed. Resolves: #115280
Re-opening until I backport to 8.16 and 8.x |
Okay backports done. Closing. Hope it's fixed for real this time... |
This test resolves two race conditions in `FileSettingsServiceTests#testProcessFileChanges`: 1. The test used `writeTestFile` to update the settings.json file. It did so in multiple steps: first it created a temp file in the operator directory, then moved that file to replace the existing settings.json file. The first step (creating the temp file) triggered the watcher thread in the file settings service to access the settings.json file to check for changes. When this access happened concurrently with the `move` call inside `writeTestFile` the test would throw on a Windows file-system (mocked or real), since you can't move a file while it's open. To fix this, the PR changes `writeTestFile` to creating a temp file elsewhere and simplifies the method. Instead of relying on this method (and multiple file operations) to update the file, the PR instead simply "touches" the settings file with a timestamp update to trigger file processing (more details also in this [comment](elastic#115280 (comment))). 2. The test awaited latches that would count down when `ReservedClusterStateService#process` was invoked. However, at this point in the file settings processing flow, the settings.json file is still open and would therefore likewise block subsequent writes that fall into the small window of the file still being open. This PR instead adds latches based on file-changed listeners which are reliably invoked _after_ the file is closed. Resolves: elastic#115280
#115280 still mystifies me. It does not reproduce on a buildkite instance either. This PR turns up logging for the file watching service steps (in particular the polled events should be useful) and adds some logging to the test.
This test resolves two race conditions in `FileSettingsServiceTests#testProcessFileChanges`: 1. The test used `writeTestFile` to update the settings.json file. It did so in multiple steps: first it created a temp file in the operator directory, then moved that file to replace the existing settings.json file. The first step (creating the temp file) triggered the watcher thread in the file settings service to access the settings.json file to check for changes. When this access happened concurrently with the `move` call inside `writeTestFile` the test would throw on a Windows file-system (mocked or real), since you can't move a file while it's open. To fix this, the PR changes `writeTestFile` to creating a temp file elsewhere and simplifies the method. Instead of relying on this method (and multiple file operations) to update the file, the PR instead simply "touches" the settings file with a timestamp update to trigger file processing (more details also in this [comment](#115280 (comment))). 2. The test awaited latches that would count down when `ReservedClusterStateService#process` was invoked. However, at this point in the file settings processing flow, the settings.json file is still open and would therefore likewise block subsequent writes that fall into the small window of the file still being open. This PR instead adds latches based on file-changed listeners which are reliably invoked _after_ the file is closed. Resolves: #115280
elastic#115280 still mystifies me. It does not reproduce on a buildkite instance either. This PR turns up logging for the file watching service steps (in particular the polled events should be useful) and adds some logging to the test.
… testProcessFileChanges elastic#115280
This test resolves two race conditions in `FileSettingsServiceTests#testProcessFileChanges`: 1. The test used `writeTestFile` to update the settings.json file. It did so in multiple steps: first it created a temp file in the operator directory, then moved that file to replace the existing settings.json file. The first step (creating the temp file) triggered the watcher thread in the file settings service to access the settings.json file to check for changes. When this access happened concurrently with the `move` call inside `writeTestFile` the test would throw on a Windows file-system (mocked or real), since you can't move a file while it's open. To fix this, the PR changes `writeTestFile` to creating a temp file elsewhere and simplifies the method. Instead of relying on this method (and multiple file operations) to update the file, the PR instead simply "touches" the settings file with a timestamp update to trigger file processing (more details also in this [comment](elastic#115280 (comment))). 2. The test awaited latches that would count down when `ReservedClusterStateService#process` was invoked. However, at this point in the file settings processing flow, the settings.json file is still open and would therefore likewise block subsequent writes that fall into the small window of the file still being open. This PR instead adds latches based on file-changed listeners which are reliably invoked _after_ the file is closed. Resolves: elastic#115280
Build Scans:
Reproduction Line:
Applicable branches:
main
Reproduces locally?:
N/A
Failure History:
See dashboard
Failure Message:
Issue Reasons:
Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.
The text was updated successfully, but these errors were encountered: