Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes fail to start in tests with "port out of range" #44134

Closed
original-brownbear opened this issue Jul 9, 2019 · 12 comments
Closed

Nodes fail to start in tests with "port out of range" #44134

original-brownbear opened this issue Jul 9, 2019 · 12 comments
Assignees
Labels
:Delivery/Build Build or test infrastructure :Distributed Coordination/Network Http and internode communication implementations Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI

Comments

@original-brownbear
Copy link
Member

original-brownbear commented Jul 9, 2019

This just happened to me on master in org.elasticsearch.transport.netty4.SimpleNetty4TransportTests.testTcpHandshake:

  1> [2019-07-09T18:18:18,000][INFO ][o.e.t.TransportService   ] [testTcpHandshake] publish_address {127.0.0.1:43051}, bound_addresses {[::1]:48024}, {127.0.0.1:43051}
  1> [2019-07-09T18:18:18,049][INFO ][o.e.t.TransportService   ] [testTcpHandshake] publish_address {127.0.0.1:57712}, bound_addresses {[::1]:45963}, {127.0.0.1:57712}
  1> [2019-07-09T18:18:18,814][INFO ][o.e.t.n.SimpleNetty4TransportTests] [testTcpHandshake] after test
  2> REPRODUCE WITH: ./gradlew :modules:transport-netty4:test --tests "org.elasticsearch.transport.netty4.SimpleNetty4TransportTests.testTcpHandshake" -Dtests.seed=CEB10792715A5ADB -Dtests.security.manager=true -Dtests.locale=nl-BE -Dtests.timezone=Africa/Lome -Dcompiler.java=12 -Druntime.java=12
  2> BindTransportException[Failed to bind to [66000-66100]]; nested: IllegalArgumentException[port out of range:66100];

        Caused by:
        java.lang.IllegalArgumentException: port out of range:66100

Doesn't reproduce, but I suspect that this might have something to do with the number of VMs the tests execute causing random in MockTransportService to be selected from outside the valid port range.

@original-brownbear original-brownbear added :Distributed Coordination/Network Http and internode communication implementations >test-failure Triaged test failures from CI labels Jul 9, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@danielmitterdorfer
Copy link
Member

We have another failure with the same root cause in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.3+multijob-darwin-compatibility/15/console (build id: 20190710041850-A5CC4B88):

Error message:

BindTransportException[Failed to bind to [65535-65539]]; nested: IllegalArgumentException[port out of range:65539]
Expand full trace

java.lang.RuntimeException: failed to start nodes
	at __randomizedtesting.SeedInfo.seed([877116860D5FA36B:7885F42059360F20]:0)
	at org.elasticsearch.test.InternalTestCluster.startAndPublishNodesAndClients(InternalTestCluster.java:1685)
	at org.elasticsearch.test.InternalTestCluster.reset(InternalTestCluster.java:1221)
	at org.elasticsearch.test.InternalTestCluster.beforeTest(InternalTestCluster.java:1117)
	at org.elasticsearch.discovery.single.SingleNodeDiscoveryIT.testCannotJoinNodeWithSingleNodeDiscovery(SingleNodeDiscoveryIT.java:178)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: BindTransportException[Failed to bind to [65535-65539]]; nested: IllegalArgumentException[port out of range:65539];
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.elasticsearch.test.InternalTestCluster.startAndPublishNodesAndClients(InternalTestCluster.java:1680)
	... 40 more
Caused by: BindTransportException[Failed to bind to [65535-65539]]; nested: IllegalArgumentException[port out of range:65539];
	at org.elasticsearch.transport.TcpTransport.bindToPort(TcpTransport.java:389)
	at org.elasticsearch.transport.TcpTransport.bindServer(TcpTransport.java:355)
	at org.elasticsearch.transport.nio.MockNioTransport.doStart(MockNioTransport.java:118)
	at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:59)
	at org.elasticsearch.transport.TransportService.doStart(TransportService.java:230)
	at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:59)
	at org.elasticsearch.node.Node.start(Node.java:698)
	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.startNode(InternalTestCluster.java:960)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
Caused by: java.lang.IllegalArgumentException: port out of range:65539

Reproduction line (does no reproduce):

./gradlew :server:integTest --tests "org.elasticsearch.discovery.single.SingleNodeDiscoveryIT.testCannotJoinNodeWithSingleNodeDiscovery" \
  -Dtests.seed=877116860D5FA36B \
  -Dtests.security.manager=true \
  -Dtests.locale=lt \
  -Dtests.timezone=America/Monterrey \
  -Dcompiler.java=12 \
  -Druntime.java=8

Frequency: According to build-stats this is the only build that has failed with this error so far.

@danielmitterdorfer danielmitterdorfer changed the title Failure in org.elasticsearch.transport.netty4.SimpleNetty4TransportTests.testTcpHandshake Nodes fail to start in tests with "port out of range" Jul 10, 2019
@alpar-t alpar-t added the :Delivery/Build Build or test infrastructure label Jul 10, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@alpar-t alpar-t self-assigned this Jul 10, 2019
@alpar-t
Copy link
Contributor

alpar-t commented Jul 10, 2019

We identified the root cause of this with @original-brownbear and @mark-vieira . The worker Id keeps increasing for the life of the build daemon so for non ephemeral workers this will grow too large eventually and generate a port range that is outside of the valid range.

To fix this we should take modulo 650 to make sure that we generate a valid range.
This will work for as long as we don't have that many workers within a build.
I'll prepare a PR

@henningandersen
Copy link
Contributor

@atorok, in MockTransportService, the port range starts from 10300. Additionally, we need to stay out of the ephemeral port range 49152-65535.

As a short term workaround your suggestion sounds fine, but the modulo should be something like 380 instead to avoid hitting the ephemeral port range.

I think we could also explore a different port allocation algorithm using a locking scheme to prevent other VMs from using the same port.

Unless there is an easy way to provide a better worker id from gradle to the tests? I.e., one that goes from 0/1-number-of-parallel-tests like the old junit worker id?

@alpar-t
Copy link
Contributor

alpar-t commented Jul 11, 2019

We did explore the possibility of providing our own Id, but there's no way to do that because Gradle doesn't offer any hooks to configure the test workers, so there's no way to pass per worker properties.

The old way would have eventually break down with --parallel too because tests from different projects can go in parallel the allocation of IDs needs to be aware of that and I don't think the old implementation was.

The locking scheme would get fairly complex, as it has to work across JVM boundaries and needs to make sure allocations are freed when not used.

@henningandersen
Copy link
Contributor

I think the locking scheme would not be too complicated and will be cleaned up when the JVM exits. Any problem in just hanging on to the lock for the duration of the JVM, thus allocation a stable worker ID or port?

@original-brownbear
Copy link
Member Author

original-brownbear commented Jul 11, 2019

I think the locking scheme would not be too complicated and will be cleaned up when the JVM exits.

This imo. Can't we just do the handling of the port allocation from within the test JVM (so we can use the onExit hook) and simply pass a path to a directory that contains a set of lock files for each port range we use via Gradle (and set a sensible default behavior for when running from outside of Gradle)?

@henningandersen
Copy link
Contributor

I was thinking about just binding to a port in a range (10200-10299?) and using the first available as the worker id. That avoids leaving files around.

alpar-t added a commit to alpar-t/elasticsearch that referenced this issue Jul 11, 2019
Relates to elastic#43983

The IDs gradle uses are incremented for the lifetime of the daemon which
can result in port ranges that are outside the valid range.
This change implements a modulo based formula to wrap the port ranges
when the IDs get too large.

Adresses elastic#44134 but elastic#44157 is also required to be able to close it.
ywelsch added a commit that referenced this issue Jul 11, 2019
Simplifies AbstractSimpleTransportTestCase to use JVM-local ports  and also adds an assertion so
that cases like #44134 can be more easily debugged. The likely reason for that one is that a test,
which was repeated again and again while always spawning a fresh Gradle worker (due to Gradle
daemon) kept increasing Gradle worker IDs, causing an overflow at some point.
ywelsch added a commit that referenced this issue Jul 11, 2019
Simplifies AbstractSimpleTransportTestCase to use JVM-local ports  and also adds an assertion so
that cases like #44134 can be more easily debugged. The likely reason for that one is that a test,
which was repeated again and again while always spawning a fresh Gradle worker (due to Gradle
daemon) kept increasing Gradle worker IDs, causing an overflow at some point.
ywelsch added a commit that referenced this issue Jul 11, 2019
Simplifies AbstractSimpleTransportTestCase to use JVM-local ports  and also adds an assertion so
that cases like #44134 can be more easily debugged. The likely reason for that one is that a test,
which was repeated again and again while always spawning a fresh Gradle worker (due to Gradle
daemon) kept increasing Gradle worker IDs, causing an overflow at some point.
@ywelsch
Copy link
Contributor

ywelsch commented Jul 11, 2019

I'm inclined to go with the technically simpler solution (i.e. using modulo) for now and only if we see this causing issues create a more elaborate solution as outlined by @henningandersen.

@henningandersen
Copy link
Contributor

Yes, let us see how the modulo solution goes and then decide if this is necessary.

@alpar-t alpar-t closed this as completed Jul 12, 2019
alpar-t added a commit that referenced this issue Jul 12, 2019
* Fix port range allocation with large worker IDs

Relates to #43983

The IDs gradle uses are incremented for the lifetime of the daemon which
can result in port ranges that are outside the valid range.
This change implements a modulo based formula to wrap the port ranges
when the IDs get too large.

Adresses #44134 but #44157 is also required to be able to close it.
@alpar-t
Copy link
Contributor

alpar-t commented Jul 12, 2019

Both PRs have been merged, closing the issue.

alpar-t added a commit that referenced this issue Jul 12, 2019
* Fix port range allocation with large worker IDs

Relates to #43983

The IDs gradle uses are incremented for the lifetime of the daemon which
can result in port ranges that are outside the valid range.
This change implements a modulo based formula to wrap the port ranges
when the IDs get too large.

Adresses #44134 but #44157 is also required to be able to close it.
alpar-t added a commit that referenced this issue Jul 12, 2019
* Fix port range allocation with large worker IDs

Relates to #43983

The IDs gradle uses are incremented for the lifetime of the daemon which
can result in port ranges that are outside the valid range.
This change implements a modulo based formula to wrap the port ranges
when the IDs get too large.

Adresses #44134 but #44157 is also required to be able to close it.
@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
fixmebot bot referenced this issue in VectorXz/elasticsearch Apr 22, 2021
fixmebot bot referenced this issue in VectorXz/elasticsearch May 28, 2021
fixmebot bot referenced this issue in VectorXz/elasticsearch Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure :Distributed Coordination/Network Http and internode communication implementations Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

7 participants