Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] FileSettingsServiceTests testProcessFileChanges failing #115280

Closed
elasticsearchmachine opened this issue Oct 21, 2024 · 24 comments · Fixed by #115234 or #116309
Closed

[CI] FileSettingsServiceTests testProcessFileChanges failing #115280

elasticsearchmachine opened this issue Oct 21, 2024 · 24 comments · Fixed by #115234 or #116309
Assignees
Labels
:Core/Infra/Settings Settings infrastructure and APIs low-risk An open issue or test failure that is a low risk to future releases Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Oct 21, 2024

Build Scans:

Reproduction Line:

gradlew ":server:test" --tests "org.elasticsearch.reservedstate.service.FileSettingsServiceTests.testProcessFileChanges" -Dtests.seed=E175BB524F09E5B4 -Dtests.locale=ky -Dtests.timezone=CNT -Druntime.java=23

Applicable branches:
main

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.io.UncheckedIOException: could not write file [C:\bk\server\build\testrun\test\temp\org.elasticsearch.reservedstate.service.FileSettingsServiceTests_E175BB524F09E5B4-001\tempDir-018\config\operator\settings.json]

Issue Reasons:

  • [main] 2 failures in test testProcessFileChanges (6.3% fail rate in 32 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Core/Infra/Settings Settings infrastructure and APIs >test-failure Triaged test failures from CI labels Oct 21, 2024
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Oct 21, 2024
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 8.x

Mute Reasons:

  • [8.x] 2 failures in test testProcessFileChanges (5.0% fail rate in 40 executions)

Build Scans:

@mark-vieira
Copy link
Contributor

@n1v0lg I think this is related to #114295. This is only failing on Windows as well so it's something specific to that platform.

@mark-vieira
Copy link
Contributor

I've cherry-picked the mute back to the 8.16 branch as well.

mark-vieira pushed a commit that referenced this issue Oct 22, 2024
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 6 failures in test testProcessFileChanges (1.5% fail rate in 413 executions)
  • [main] 3 failures in step part-1-windows (60.0% fail rate in 5 executions)
  • [main] 3 failures in pipeline elasticsearch-periodic-platform-support (50.0% fail rate in 6 executions)
  • [main] 3 failures in pipeline elasticsearch-pull-request (1.7% fail rate in 177 executions)

Build Scans:

@n1v0lg n1v0lg self-assigned this Oct 23, 2024
@n1v0lg n1v0lg added the low-risk An open issue or test failure that is a low risk to future releases label Oct 23, 2024
@elasticsearchmachine elasticsearchmachine removed the needs:risk Requires assignment of a risk label (low, medium, blocker) label Oct 23, 2024
@n1v0lg
Copy link
Contributor

n1v0lg commented Oct 23, 2024

@mark-vieira thanks! I got a fix in progress for this.

smalyshev pushed a commit to smalyshev/elasticsearch that referenced this issue Oct 23, 2024
smalyshev pushed a commit to smalyshev/elasticsearch that referenced this issue Oct 23, 2024
n1v0lg added a commit to n1v0lg/elasticsearch that referenced this issue Oct 24, 2024
n1v0lg added a commit to n1v0lg/elasticsearch that referenced this issue Oct 24, 2024
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 8.x

Mute Reasons:

  • [8.x] 3 failures in test testProcessFileChanges (6.4% fail rate in 47 executions)
  • [8.x] 2 failures in pipeline elasticsearch-periodic-platform-support (100.0% fail rate in 2 executions)

Build Scans:

@n1v0lg
Copy link
Contributor

n1v0lg commented Oct 24, 2024

Ughhhhh. Okay windows. I shall try again.

georgewallace pushed a commit to georgewallace/elasticsearch that referenced this issue Oct 25, 2024
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this issue Oct 25, 2024
@n1v0lg n1v0lg reopened this Nov 1, 2024
@n1v0lg
Copy link
Contributor

n1v0lg commented Nov 1, 2024

@thecoop sorry -- this is not completed yet. It's failing after the initial round of fixes, and I'm still investigating.

@n1v0lg
Copy link
Contributor

n1v0lg commented Nov 4, 2024

@rjernst

Why would process be called concurrently?

Yeah, you're right I don't think two concurrent calls can happen. It might still be the case that process gets called a second time by another iteration of the watcher loop though that would mean that watchedFileChanged evaluates to true again. I'm a bit stuck on this. I'll spent some more time today (maybe with a buildkite VM) and if I don't make headway, I'll bump log levels for the test to get the polled events here and go from there.

jfreden pushed a commit to jfreden/elasticsearch that referenced this issue Nov 4, 2024
jfreden pushed a commit to jfreden/elasticsearch that referenced this issue Nov 4, 2024
jfreden pushed a commit to jfreden/elasticsearch that referenced this issue Nov 4, 2024
This PR addresses some of the failure causes tracked under
elastic#115280 and
elastic#115725: the latch-await
setup was rather convoluted and the move command not always correctly
invoked in the right order. This PR cleans up latching by separating
awaiting the first processing call (on start) from waiting on the
subsequent call. Also, it makes writing the file more robust w.r.t.
OS'es where `atomic_move` may not be available.

This should address failures around the timeout await, and the assertion
failures around invoked methods tracked here:

elastic#115725 (comment)

But will likely require another round of changes to address the failures
to delete files.

Relates: elastic#115280 
Relates: elastic#115725
jfreden pushed a commit to jfreden/elasticsearch that referenced this issue Nov 4, 2024
@n1v0lg
Copy link
Contributor

n1v0lg commented Nov 4, 2024

Does not reproduce for me, even after many many iterations on a buildkite instance. I'm going with more logging (#116192)...

elasticsearchmachine pushed a commit that referenced this issue Nov 5, 2024
#115280 still mystifies
me. It does not reproduce on a buildkite instance either. This PR turns
up logging for the file watching service steps (in particular the polled
events should be useful) and adds some logging to the test.
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 2 failures in test testProcessFileChanges (6.3% fail rate in 32 executions)

Build Scans:

@n1v0lg
Copy link
Contributor

n1v0lg commented Nov 5, 2024

Failure was expected (I only included logging to get more info in the last PR). Let's see what the logs tell us...

@n1v0lg
Copy link
Contributor

n1v0lg commented Nov 5, 2024

Interesting, based on the latest failure, on the second file write (for the update):

  1. We create the temp file (inside the operator directory)
  2. This gets us an ENTRY_CREATE:settings.json6060652565707102444tmp event
  3. We move the temp file to replace operator/settings.json (to update it) <- this step blows up
  4. Concurrently (at the same exact timestamp), we get a log line from the file settings service that we are processing file changes
  5. We get an ENTRY_MODIFY:operator event
  6. We delete the temp file (in finally block of writeTestFile)
  7. We start shutting down the file watcher service (in the test clean up method)
  8. We get another ENTRY_MODIFY:operator (presumably because we deleted the temp file)

The exception is thrown by move operation: "java.io.UncheckedIOException: could not write file: ...".

So I think what we're hitting here is:

  1. The temp file creation prompts the file-settings service to check the settings file
  2. Concurrently, writeTestFile attempts to perform the move

And WindowsFS fails this step because you can't overwrite a file while it's being read.

So the fix should be to avoid this whole business of creating temp files in the operator directory and moving; instead, we can create the settings file once, then when we want an update, we can just "touch" it (e.g., set a new timestamp), which should avoid the complexity of multiple file operations and still test what we actually want to test.

@ldematte
Copy link
Contributor

ldematte commented Nov 7, 2024

This is great @n1v0lg! So your fix will prevent the error from happening on Windows by addressing the underlying root cause (the undesired race), correct?

@n1v0lg
Copy link
Contributor

n1v0lg commented Nov 7, 2024

@ldematte I really really hope so 🤞

elasticsearchmachine pushed a commit that referenced this issue Nov 7, 2024
This test resolves two race conditions in
`FileSettingsServiceTests#testProcessFileChanges`:

1. The test used `writeTestFile` to update the settings.json file. It did so in multiple steps: first it created a temp file in the operator directory, then moved that file to replace the existing settings.json file. The first step (creating the temp file) triggered the watcher thread in the file settings service to access the settings.json file to check for changes. When this access happened concurrently with the `move` call inside `writeTestFile` the test would throw on a Windows file-system (mocked or real), since you can't move a file while it's open. To fix this, the PR changes `writeTestFile` to creating a temp file elsewhere and simplifies the method. Instead of relying on this method (and multiple file operations) to update the file, the PR instead simply "touches" the settings file with a timestamp update to trigger file processing (more details also in this [comment](#115280 (comment))). 
2. The test awaited latches that would count down when `ReservedClusterStateService#process` was invoked. However, at this point in the file settings processing flow, the settings.json file is still open and would therefore likewise block subsequent writes that fall into the small window of the file still being open. This PR instead adds latches based on file-changed listeners which are reliably invoked _after_ the file is closed.  

Resolves: #115280
@n1v0lg n1v0lg reopened this Nov 7, 2024
@n1v0lg
Copy link
Contributor

n1v0lg commented Nov 7, 2024

Re-opening until I backport to 8.16 and 8.x

@n1v0lg
Copy link
Contributor

n1v0lg commented Nov 7, 2024

Okay backports done. Closing. Hope it's fixed for real this time...

@n1v0lg n1v0lg closed this as completed Nov 7, 2024
kderusso pushed a commit to kderusso/elasticsearch that referenced this issue Nov 7, 2024
This test resolves two race conditions in
`FileSettingsServiceTests#testProcessFileChanges`:

1. The test used `writeTestFile` to update the settings.json file. It did so in multiple steps: first it created a temp file in the operator directory, then moved that file to replace the existing settings.json file. The first step (creating the temp file) triggered the watcher thread in the file settings service to access the settings.json file to check for changes. When this access happened concurrently with the `move` call inside `writeTestFile` the test would throw on a Windows file-system (mocked or real), since you can't move a file while it's open. To fix this, the PR changes `writeTestFile` to creating a temp file elsewhere and simplifies the method. Instead of relying on this method (and multiple file operations) to update the file, the PR instead simply "touches" the settings file with a timestamp update to trigger file processing (more details also in this [comment](elastic#115280 (comment))). 
2. The test awaited latches that would count down when `ReservedClusterStateService#process` was invoked. However, at this point in the file settings processing flow, the settings.json file is still open and would therefore likewise block subsequent writes that fall into the small window of the file still being open. This PR instead adds latches based on file-changed listeners which are reliably invoked _after_ the file is closed.  

Resolves: elastic#115280
jozala pushed a commit that referenced this issue Nov 13, 2024
#115280 still mystifies
me. It does not reproduce on a buildkite instance either. This PR turns
up logging for the file watching service steps (in particular the polled
events should be useful) and adds some logging to the test.
jozala pushed a commit that referenced this issue Nov 13, 2024
jozala pushed a commit that referenced this issue Nov 13, 2024
This test resolves two race conditions in
`FileSettingsServiceTests#testProcessFileChanges`:

1. The test used `writeTestFile` to update the settings.json file. It did so in multiple steps: first it created a temp file in the operator directory, then moved that file to replace the existing settings.json file. The first step (creating the temp file) triggered the watcher thread in the file settings service to access the settings.json file to check for changes. When this access happened concurrently with the `move` call inside `writeTestFile` the test would throw on a Windows file-system (mocked or real), since you can't move a file while it's open. To fix this, the PR changes `writeTestFile` to creating a temp file elsewhere and simplifies the method. Instead of relying on this method (and multiple file operations) to update the file, the PR instead simply "touches" the settings file with a timestamp update to trigger file processing (more details also in this [comment](#115280 (comment))). 
2. The test awaited latches that would count down when `ReservedClusterStateService#process` was invoked. However, at this point in the file settings processing flow, the settings.json file is still open and would therefore likewise block subsequent writes that fall into the small window of the file still being open. This PR instead adds latches based on file-changed listeners which are reliably invoked _after_ the file is closed.  

Resolves: #115280
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this issue Nov 28, 2024
elastic#115280 still mystifies
me. It does not reproduce on a buildkite instance either. This PR turns
up logging for the file watching service steps (in particular the polled
events should be useful) and adds some logging to the test.
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this issue Nov 28, 2024
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this issue Nov 28, 2024
This test resolves two race conditions in
`FileSettingsServiceTests#testProcessFileChanges`:

1. The test used `writeTestFile` to update the settings.json file. It did so in multiple steps: first it created a temp file in the operator directory, then moved that file to replace the existing settings.json file. The first step (creating the temp file) triggered the watcher thread in the file settings service to access the settings.json file to check for changes. When this access happened concurrently with the `move` call inside `writeTestFile` the test would throw on a Windows file-system (mocked or real), since you can't move a file while it's open. To fix this, the PR changes `writeTestFile` to creating a temp file elsewhere and simplifies the method. Instead of relying on this method (and multiple file operations) to update the file, the PR instead simply "touches" the settings file with a timestamp update to trigger file processing (more details also in this [comment](elastic#115280 (comment))). 
2. The test awaited latches that would count down when `ReservedClusterStateService#process` was invoked. However, at this point in the file settings processing flow, the settings.json file is still open and would therefore likewise block subsequent writes that fall into the small window of the file still being open. This PR instead adds latches based on file-changed listeners which are reliably invoked _after_ the file is closed.  

Resolves: elastic#115280
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Settings Settings infrastructure and APIs low-risk An open issue or test failure that is a low risk to future releases Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI
Projects
None yet
5 participants