Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM and Process Hang on Windows 10 in LanguageServerWrapperTest #1103

Closed
sebthom opened this issue Sep 10, 2024 · 11 comments
Closed

OOM and Process Hang on Windows 10 in LanguageServerWrapperTest #1103

sebthom opened this issue Sep 10, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@sebthom
Copy link
Member

sebthom commented Sep 10, 2024

As of PR #1044 (@ava-fred) running integration tests reproducible results in a OOMs and process hangs on at least Windows 10. I tried with different JDKs - same effect. Reverting the commit from #1044 solves the issue.

The issue appears when the LanguageServerWrapperTest.testStopAndActivate() is executed.

Running mvn verify -Dtest=LanguageServerWrapperTest#testStopAndActivate immediately results in OOM within a few seconds and process hangs.

I am attaching a thread dump where one can see that thousands of threads are created.
threaddump-1725956087712.txt

Because of the LSPs async natures I am having difficulties tracking down the root cause.

Can anyone else reproduce that issue? @rubenporras @mickaelistria

Even though this only appears in that one integration test so prominently it probably has an underlying cause that may affect end-users during normal usage.

@sebthom sebthom added the bug Something isn't working label Sep 10, 2024
@ava-fred
Copy link
Contributor

I finally managed to reproduce the OOM exception on my Windows 10 machine, but I had to change the test to make it run longer. I guess my Windows 10 machine is much better at counting to 10_000_000 than yours :->.

As you mentioned, it is hard to track what is going on because of the async nature of things, but I'll see what I can find.

@sebthom
Copy link
Member Author

sebthom commented Sep 10, 2024

I switched to a Ryzen CPU some time ago. Maybe they behave somewhat different making the underlying issue more prominently visible.

@rubenporras
Copy link
Contributor

Hi @sebthom , thanks for looking at this. Since @ava-fred is looking into it, I will not look into this problem myself (Note: in case you did not notice from our activity in the project, we both work for the same company).

Regards

This was referenced Sep 17, 2024
@sebthom
Copy link
Member Author

sebthom commented Sep 17, 2024

This issue now also affects most of the Jenkins builds see https://ci.eclipse.org/lsp4e/job/lsp4e-github/job/main/75/console

@ava-fred
Copy link
Contributor

My analysis shows that LanguageServerWrapperTest.testStopAndActivate() leaks threads executing ConcurrentMessageProcessor.run() because the thread started in MockConnectionProviderMultiRootFolders.start() is never shut down. This was so before #1044, the only thing that apparently changed with this PR is that the balance between the 2 loops in the test has changed slightly so that the thread leak now causes actual problems.
I have been working on fixing the test, but I have also struggled with the asynchronicity of LSP4E. On reflection, I am not sure if fixing the test is worthwhile. I don't see what it actually tests and in what scenario it would fail (as opposed to crashing because it runs out of resources). The comments in the test are not helpful: the PR mentioned in the Javadoc was abandoned, and I don't see how LanguageServerWrapper.isActive() would ever throw, synchronisation or not.
@rubenporras , do you think the test should be kept and fixed or should it be removed?

@rubenporras
Copy link
Contributor

Hi @ava-fred ,

I see that the test was added as part of #691. It is true that it can be seen by simple code inspection than isActive and start are both synchronized, but on the other hand, it is preventing a race condition that happened before, so it could happen again if we make a mistake.

Do I understand correctly that the problem is that wrapper.stop(); does not wait for the process in the background to be finished, and then by calling wrapper.stop()/wrapper.start() in a loop, we just potentially create many threads?

I wonder if we can fix the test by waiting after wrapper.stop(); for the processing inside the future to be really finished before calling wrapper.start(); Would something as simple as adding a manual sleep in the test work? If so I think we could also call something after wrapper.stop() and before wrapper.start() that uses the future LanguageServerWrapper#initializeFuture, like LanguageServerWrapper#getServerCapabilities. What do you think of that?

@ava-fred
Copy link
Contributor

Hi @rubenporras

I'm afraid I did not express myself clearly. The primary issue is that the MockConnectionProviderMultiRootFolders class starts a LSP4J message processor and never stops it.
Changing that is not hard as such, but if I do that, then I run into problems caused by the fact that the test calls LanguageServerWrapper.stop() and LanguageServerWrapper.start() in such a tight loop. The 2 methods are synchronized, but they delegate significant work to 2 Futures which are executed asynchronously. If the semantics of the test are not affected by waiting in between these 2 calls, as you seem to suggest, then that is certainly an approach I can take.

@rubenporras
Copy link
Contributor

I think we can go the rout of waiting a bit. I will not stress test any race conditions as much as now, but I think it would be a good compromise.

@ava-fred
Copy link
Contributor

I think I know why the test is still crashing even when I fix the MockConnectionProviderMultiRootFolders shutdown. I also think I know how to fix the problem in LanguageServerWrapper which causes the crash. I discussed this with @rubenporras offline and we agreed that we would like to keep the test in principle, but I will change it so that it becomes more useful for detecting such problems.

ava-fred added a commit to ava-fred/lsp4e that referenced this issue Oct 1, 2024
In eclipse#1103, it was noted that
LanguageServerWrapperTest#testStopAndActivate causes OOM errors. On
investigation, it was discovered that there are multiple causes for
this.

1) The MockConnectionProviderMultiRootFolder used in the test created a
message processor thread in LSP4J on each call to start() but did not
shut them down.
2) The termination of the start/stop loop in the test was wrong: the
loop ran for as long as the VM running tests was alive.
3) The "already stopping" logic in LanguageServerWrapper#shutdown was
wrong: while the shutdown of a LanguageServerWrapper was processed,
further calls to shutdown were ignored.
4) There was a race between the initializationFuture of the
LanguageServerWrapper and the future used for shutdown.

With this commit, the test is rewritten to avoid crashes and to properly
test that calling stop / start repeatedly on LanguageServerWrapper does
not leave any connection providers running. The LanguageServerWrapper
class is refactored to make the test pass.
ava-fred added a commit to ava-fred/lsp4e that referenced this issue Oct 2, 2024
In eclipse#1103, it was noted that
LanguageServerWrapperTest#testStopAndActivate causes OOM errors. On
investigation, it was discovered that there are multiple causes for
this.

1) The MockConnectionProviderMultiRootFolder used in the test created a
message processor thread in LSP4J on each call to start() but did not
shut them down.
2) The termination of the start/stop loop in the test was wrong: the
loop ran for as long as the VM running tests was alive.
3) The "already stopping" logic in LanguageServerWrapper#shutdown was
wrong: while the shutdown of a LanguageServerWrapper was processed,
further calls to shutdown were ignored.
4) There was a race between the initializationFuture of the
LanguageServerWrapper and the future used for shutdown.

With this commit, the test is rewritten to avoid crashes and to properly
test that calling stop / start repeatedly on LanguageServerWrapper does
not leave any connection providers running. The LanguageServerWrapper
class is refactored to make the test pass.
ava-fred added a commit to ava-fred/lsp4e that referenced this issue Oct 2, 2024
In eclipse#1103, it was noted that
LanguageServerWrapperTest#testStopAndActivate causes OOM errors. On
investigation, it was discovered that there are multiple causes for
this.

1) The MockConnectionProviderMultiRootFolder used in the test created a
message processor thread in LSP4J on each call to start() but did not
shut them down.
2) The termination of the start/stop loop in the test was wrong: the
loop ran for as long as the VM running tests was alive.
3) The "already stopping" logic in LanguageServerWrapper#shutdown was
wrong: while the shutdown of a LanguageServerWrapper was processed,
further calls to shutdown were ignored.
4) There was a race between the initializationFuture of the
LanguageServerWrapper and the future used for shutdown.

With this commit, the test is rewritten to avoid crashes and to properly
test that calling stop / start repeatedly on LanguageServerWrapper does
not leave any connection providers running. The LanguageServerWrapper
class is refactored to make the test pass.
rubenporras pushed a commit that referenced this issue Oct 2, 2024
In #1103, it was noted that
LanguageServerWrapperTest#testStopAndActivate causes OOM errors. On
investigation, it was discovered that there are multiple causes for
this.

1) The MockConnectionProviderMultiRootFolder used in the test created a
message processor thread in LSP4J on each call to start() but did not
shut them down.
2) The termination of the start/stop loop in the test was wrong: the
loop ran for as long as the VM running tests was alive.
3) The "already stopping" logic in LanguageServerWrapper#shutdown was
wrong: while the shutdown of a LanguageServerWrapper was processed,
further calls to shutdown were ignored.
4) There was a race between the initializationFuture of the
LanguageServerWrapper and the future used for shutdown.

With this commit, the test is rewritten to avoid crashes and to properly
test that calling stop / start repeatedly on LanguageServerWrapper does
not leave any connection providers running. The LanguageServerWrapper
class is refactored to make the test pass.
@rubenporras
Copy link
Contributor

The test should run now without problems

@sebthom
Copy link
Member Author

sebthom commented Oct 2, 2024

I can confirm that the test works now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants