-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes fail to start in tests with "port out of range" #44134
Comments
Pinging @elastic/es-distributed |
We have another failure with the same root cause in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.3+multijob-darwin-compatibility/15/console (build id: Error message:
Expand full tracejava.lang.RuntimeException: failed to start nodes at __randomizedtesting.SeedInfo.seed([877116860D5FA36B:7885F42059360F20]:0) at org.elasticsearch.test.InternalTestCluster.startAndPublishNodesAndClients(InternalTestCluster.java:1685) at org.elasticsearch.test.InternalTestCluster.reset(InternalTestCluster.java:1221) at org.elasticsearch.test.InternalTestCluster.beforeTest(InternalTestCluster.java:1117) at org.elasticsearch.discovery.single.SingleNodeDiscoveryIT.testCannotJoinNodeWithSingleNodeDiscovery(SingleNodeDiscoveryIT.java:178) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750) at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938) at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974) at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49) at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368) at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817) at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468) at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947) at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832) at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883) at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894) at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53) at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54) at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.concurrent.ExecutionException: BindTransportException[Failed to bind to [65535-65539]]; nested: IllegalArgumentException[port out of range:65539]; at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.elasticsearch.test.InternalTestCluster.startAndPublishNodesAndClients(InternalTestCluster.java:1680) ... 40 more Caused by: BindTransportException[Failed to bind to [65535-65539]]; nested: IllegalArgumentException[port out of range:65539]; at org.elasticsearch.transport.TcpTransport.bindToPort(TcpTransport.java:389) at org.elasticsearch.transport.TcpTransport.bindServer(TcpTransport.java:355) at org.elasticsearch.transport.nio.MockNioTransport.doStart(MockNioTransport.java:118) at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:59) at org.elasticsearch.transport.TransportService.doStart(TransportService.java:230) at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:59) at org.elasticsearch.node.Node.start(Node.java:698) at org.elasticsearch.test.InternalTestCluster$NodeAndClient.startNode(InternalTestCluster.java:960) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more Caused by: java.lang.IllegalArgumentException: port out of range:65539 Reproduction line (does no reproduce):
Frequency: According to build-stats this is the only build that has failed with this error so far. |
Pinging @elastic/es-core-infra |
We identified the root cause of this with @original-brownbear and @mark-vieira . The worker Id keeps increasing for the life of the build daemon so for non ephemeral workers this will grow too large eventually and generate a port range that is outside of the valid range. To fix this we should take modulo |
@atorok, in As a short term workaround your suggestion sounds fine, but the modulo should be something like 380 instead to avoid hitting the ephemeral port range. I think we could also explore a different port allocation algorithm using a locking scheme to prevent other VMs from using the same port. Unless there is an easy way to provide a better worker id from gradle to the tests? I.e., one that goes from 0/1-number-of-parallel-tests like the old junit worker id? |
We did explore the possibility of providing our own Id, but there's no way to do that because Gradle doesn't offer any hooks to configure the test workers, so there's no way to pass per worker properties. The old way would have eventually break down with The locking scheme would get fairly complex, as it has to work across JVM boundaries and needs to make sure allocations are freed when not used. |
I think the locking scheme would not be too complicated and will be cleaned up when the JVM exits. Any problem in just hanging on to the lock for the duration of the JVM, thus allocation a stable worker ID or port? |
This imo. Can't we just do the handling of the port allocation from within the test JVM (so we can use the onExit hook) and simply pass a path to a directory that contains a set of lock files for each port range we use via Gradle (and set a sensible default behavior for when running from outside of Gradle)? |
I was thinking about just binding to a port in a range (10200-10299?) and using the first available as the worker id. That avoids leaving files around. |
Relates to elastic#43983 The IDs gradle uses are incremented for the lifetime of the daemon which can result in port ranges that are outside the valid range. This change implements a modulo based formula to wrap the port ranges when the IDs get too large. Adresses elastic#44134 but elastic#44157 is also required to be able to close it.
Simplifies AbstractSimpleTransportTestCase to use JVM-local ports and also adds an assertion so that cases like #44134 can be more easily debugged. The likely reason for that one is that a test, which was repeated again and again while always spawning a fresh Gradle worker (due to Gradle daemon) kept increasing Gradle worker IDs, causing an overflow at some point.
Simplifies AbstractSimpleTransportTestCase to use JVM-local ports and also adds an assertion so that cases like #44134 can be more easily debugged. The likely reason for that one is that a test, which was repeated again and again while always spawning a fresh Gradle worker (due to Gradle daemon) kept increasing Gradle worker IDs, causing an overflow at some point.
Simplifies AbstractSimpleTransportTestCase to use JVM-local ports and also adds an assertion so that cases like #44134 can be more easily debugged. The likely reason for that one is that a test, which was repeated again and again while always spawning a fresh Gradle worker (due to Gradle daemon) kept increasing Gradle worker IDs, causing an overflow at some point.
I'm inclined to go with the technically simpler solution (i.e. using modulo) for now and only if we see this causing issues create a more elaborate solution as outlined by @henningandersen. |
Yes, let us see how the modulo solution goes and then decide if this is necessary. |
* Fix port range allocation with large worker IDs Relates to #43983 The IDs gradle uses are incremented for the lifetime of the daemon which can result in port ranges that are outside the valid range. This change implements a modulo based formula to wrap the port ranges when the IDs get too large. Adresses #44134 but #44157 is also required to be able to close it.
Both PRs have been merged, closing the issue. |
* Fix port range allocation with large worker IDs Relates to #43983 The IDs gradle uses are incremented for the lifetime of the daemon which can result in port ranges that are outside the valid range. This change implements a modulo based formula to wrap the port ranges when the IDs get too large. Adresses #44134 but #44157 is also required to be able to close it.
* Fix port range allocation with large worker IDs Relates to #43983 The IDs gradle uses are incremented for the lifetime of the daemon which can result in port ranges that are outside the valid range. This change implements a modulo based formula to wrap the port ranges when the IDs get too large. Adresses #44134 but #44157 is also required to be able to close it.
This just happened to me on
master
inorg.elasticsearch.transport.netty4.SimpleNetty4TransportTests.testTcpHandshake
:Doesn't reproduce, but I suspect that this might have something to do with the number of VMs the tests execute causing random in
MockTransportService
to be selected from outside the valid port range.The text was updated successfully, but these errors were encountered: