-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce port range re-use in tests #85777
Reduce port range re-use in tests #85777
Conversation
596cf03
to
1de1f5a
Compare
1de1f5a
to
669bb28
Compare
Today we select the port range to use in tests by taking the Gradle worker ID mod 223. We now use significantly more than 223 Gradle workers in each test run which means port ranges are re-used, resulting in spurious failures. This commit removes this lenience, insisting that every worker now gets a distinct port range. It reduces the size of the port range available to each worker to avoid running out of ports. Future changes that add more workers may need to reduce the port range still further, or find a different solution to keeping the tests separate, but at least will not see spurious failures due to cross-test interactions.
669bb28
to
a71cb0a
Compare
Pinging @elastic/es-distributed (Team:Distributed) |
* Defines the size of the port range assigned to each worker, which must be large enough to supply enough ports to run the tests, but | ||
* not so large that we run out of ports. See also [NOTE: Port ranges for tests]. | ||
*/ | ||
private static final int PORTS_PER_WORKER = 30; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposing 30 here because empirically it works in CI, whereas 40 didn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will not work well with the gradle daemon? Perhaps we can still do the modulus and assert something around the worker id in the build somewhere?
// use a non-default base port otherwise some cluster in this JVM might reuse a port | ||
|
||
// We rely on Gradle implementation details here, the worker IDs are long values incremented by one for the | ||
// lifespan of the daemon this means that they can get larger than the allowed port range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this old comment is still true, I think this change would no longer work when running gradle check multiple times locally without restarting the gradle daemon? I think that could be annoying. Perhaps it works differently now?
I am not sure if CI restarts the daemon either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Darn it, yes. I've asked @elastic/es-delivery for some input here. I would hope we don't use the daemon in CI, at least we didn't seem to in the runs this PR has had so far, but it'd be good to confirm this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@henningandersen is right. The id generator we rely on here is created as part of a shared service in gradle. (see https://github.com/gradle/gradle/blob/394c3176354b8bb6565a0c92f1ea93175608f163/subprojects/core/src/main/java/org/gradle/internal/service/scopes/GradleUserHomeScopeServices.java#L185) These services are reused across multiple builds and operations scoped to a particular gradle user home directory.
My understanding is that we use the daemon locally AND on CI these days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we capture an offset very early in the build? I.e., capture the initial_worker_id
and use worker_id - initial_worker_id
for the port range?
Could we just use an ephemeral port? There are only a handful of places we use |
I think this falls down (nondeterministically) in tests that have multiple transport services in play if they shut one of them down while the other ones are still trying to reconnect. In that case another test may come along and grab the just-released port, receive a load of unexpected handshakes and/or other traffic, and fail. |
Ok, I understand how that might be needed for multi node tests (ie ESIntegTestCase). But it seems most of the places using this getPortRange() method are just constructing a single transport service directly. Could we at least isolate this just to ESIntegTestCase, so that test writers are not tempted to use this shared method that really shouldn't be necessary except in this one special case? |
(Note that I understand my request is unrelated to the original issue here, it is just something I noticed while reviewing, and would at least reduce the re-use a bit). |
Chatting Slack a bit about this but wanted to share a summary here for posterity. The concerns around the daemon are somewhat irrelevant, the core problem exists even with a single-use daemon in a single test invocation. That is, we now have a strict upper limit on the number of test workers that can get launched by a single build. Eventually we will hit this limit, and in fact, we might already as our PR checks run subsets of our entire test suite in parallel for efficiency and later on we have jobs that might fail because they run all tests in a single build execution. So essentially we're adding a timebomb to our build which we have no idea when it might go off. Not ideal. Instead we'll go forward with this change but add back the leniency bit from before. We still get the benefit of narrower port ranges so the likelyhood of overlap reduces. We can then log when we're reusing a port range that had been used before, at least giving us some insight as to why we might be seeing funny failures. Long term we'll need a better solution here which I've described in #85786. |
Ok, I reinstated the wraparound behaviour but left |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Today we select the port range to use in tests by taking the Gradle worker ID mod 223. We now use significantly more than 223 Gradle workers in each test run which means port ranges are re-used, resulting in spurious failures. This commit replaces the magic number 223 with a computation defined in terms of the available ports and the number of ports to allocate per worker. It also reduces the number of ports per worker to 30 to try and reduce the frequency of port clashes.
Today we select the port range to use in tests by taking the Gradle worker ID mod 223. We now use significantly more than 223 Gradle workers in each test run which means port ranges are re-used, resulting in spurious failures. This commit replaces the magic number 223 with a computation defined in terms of the available ports and the number of ports to allocate per worker. It also reduces the number of ports per worker to 30 to try and reduce the frequency of port clashes.
💔 Backport failed
You can use sqren/backport to manually backport by running |
Today we select the port range to use in tests by taking the Gradle worker ID mod 223. We now use significantly more than 223 Gradle workers in each test run which means port ranges are re-used, resulting in spurious failures. This commit replaces the magic number 223 with a computation defined in terms of the available ports and the number of ports to allocate per worker. It also reduces the number of ports per worker to 30 to try and reduce the frequency of port clashes.
Today we select the port range to use in tests by taking the Gradle worker ID mod 223. We now use significantly more than 223 Gradle workers in each test run which means port ranges are re-used, resulting in spurious failures. This commit replaces the magic number 223 with a computation defined in terms of the available ports and the number of ports to allocate per worker. It also reduces the number of ports per worker to 30 to try and reduce the frequency of port clashes.
Today we select the port range to use in tests by taking the Gradle worker ID mod 223. We now use significantly more than 223 Gradle workers in each test run which means port ranges are re-used, resulting in spurious failures. This commit replaces the magic number 223 with a computation defined in terms of the available ports and the number of ports to allocate per worker. It also reduces the number of ports per worker to 30 to try and reduce the frequency of port clashes.
…n/elasticsearch into datastream-reuse-pipeline-source * 'datastream-reuse-pipeline-source' of github.com:weizijun/elasticsearch: (28 commits) Add JDK 19 to Java testing matrix [ML] add nlp config update serialization tests (elastic#85867) [ML] A text categorization aggregation that works like ML categorization (elastic#80867) [ML] Fix serialisation of text embedding updates (elastic#85863) TSDB: fix wrong initial value of tsidOrd in TimeSeriesIndexSearcher (elastic#85713) Enforce external id uniqueness during DesiredNode construction (elastic#84227) Fix Intellij integration (elastic#85866) Upgrade Azure SDK to version 12.14.4 (elastic#83884) [discovery-gce] Fix initialisation of transport in FIPS mode (elastic#85817) Remove unnecessary docs/changelog/85534.yaml Prevent ThreadContext header leak when sending response (elastic#68649) Add support for impact_areas to health impacts (elastic#85830) Reduce port range re-use in tests (elastic#85777) Fix TranslogTests#testStats (elastic#85828) Remove hppc from cat allocation api (elastic#85842) Fix BuildTests serialization (elastic#85827) Use urgent priority for node shutdown cluster state update (elastic#85838) Remove Task classes from HLRC (elastic#85835) Remove unused migration classes (elastic#85834) Remove uses of Charset name parsing (elastic#85795) ...
Today we select the port range to use in tests by taking the Gradle
worker ID mod 223. We now use significantly more than 223 Gradle workers
in each test run which means port ranges are re-used, resulting in
spurious failures.
This commit replaces the magic number 223 with a computation defined in
terms of the available ports and the number of ports to allocate per
worker. It also reduces the number of ports per worker to 30 to try and
reduce the frequency of port clashes.