Skip to content

Conversation

@DiannaHohensee
Copy link
Contributor

@DiannaHohensee DiannaHohensee commented Jan 5, 2026

The work adding not-preferred to the allocation explain output
(#137228) resulted in shards not evacuating a shutting down node
during release testing. There is some unidentified bug in play. Since
a transport version was added with that code change, we wish to
retain the enum changes necessitating the transport version, but
undo / revert all logic changes associated with the commit.

Relates ES-13903

@DiannaHohensee DiannaHohensee self-assigned this Jan 5, 2026
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.4.0 labels Jan 5, 2026
@DiannaHohensee DiannaHohensee force-pushed the 2026/01/05/fix-allocation branch from 4947525 to 974cb98 Compare January 5, 2026 22:53
@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Jan 5, 2026
Copy link
Contributor

@prdoyle prdoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified:

  • All the (non-test) code changes are putting the code back the way it was before #137228

Not checked:

  • Test code
  • That all the changes from #137228 were reverted. I don't understand the changes, so I assume the changes not reverted were omitted deliberately.

Remarks:

  • I don't see a new Transport Version yet, nor the corresponding checks.
  • We're not changing the AllocationDecision IDs back here; I believe that's deliberate.(FWIW, if we're changing them anyway, I'd consider spacing them out (say, counting by fives) so there's space to add new ones inbetween.)

Comment on lines +534 to +537
// rebalance to the node, only will get overwritten if the decision here is to
// THROTTLE and we get a decision with YES on another node
bestRebalanceCanAllocateDecisionType = canAllocate.type();
if (canAllocate.type().higherThan(Type.NOT_PREFERRED)) {
// Movement is only allowed to THROTTLE/YES nodes. NOT_PREFERRED is the same as no for rebalancing, since
// rebalancing aims to distribute resource usage and NOT_PREFERRED means the move could cause hot-spots.
targetNode = node;
}
targetNode = node;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here

// NOT_PREFERRED, as opposed to NO.
bestDecision = Type.NOT_PREFERRED;
// Relocating a shard from one NOT_PREFERRED node to another NOT_PREFERRED node would not improve the situation.
// Relocating a shard from one NOT_PREFERRED node to another would not improve the situation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here

} else if (bestDecision == Type.THROTTLE) {
assert allocation.isSimulating() == false;
// THROTTLE is better than NOT_PREFERRED, we just need to wait for a YES.
targetNode = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here

// but NOT_PREFERRED and THROTTLED are generally not comparable.
// NOT_PREFERRED and THROTTLED should not be compared.
THROTTLE,
NOT_PREFERRED,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these get serialized over Transport? If so, are they sent by name or by ordinal?

I ask because if it's by ordinal, then reordering these would be a breaking change.

(From here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about those specifics, but I do think the original change had a backwards compatibility bug: I should have changed the writeTo / readFrom methods, needing a transport version gate.

*/
public boolean assignmentAllowed() {
return this == NOT_PREFERRED || this == YES;
return this.compareTo(NOT_PREFERRED) >= 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here

case YES -> YES;
case NOT_PREFERRED -> NOT_PREFERRED;
// TODO: this
case YES, NOT_PREFERRED -> YES;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here.

The TODO comment isn't right, but presumably this is temporary.

// the final decision is NO (no node to move the shard to) and we are not in explain mode, return a cached version
return CACHED_CANNOT_MOVE_DECISION;
} else {
assert ((targetNode == null) == (moveDecision != AllocationDecision.YES));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here

: Explanations.Rebalance.CANNOT_REBALANCE_CANNOT_ALLOCATE;
case THROTTLE -> Explanations.Rebalance.CLUSTER_THROTTLE;
case YES -> {
case YES, NOT_PREFERRED -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here

case NO -> canRemainNotPreferred() ? Explanations.Move.NOT_PREFERRED_TO_NO : Explanations.Move.NO;
case YES, NOT_PREFERRED -> Explanations.Move.YES;
case THROTTLED -> Explanations.Move.THROTTLED;
case NO -> Explanations.Move.NO;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here

@DiannaHohensee DiannaHohensee added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team and removed needs:triage Requires assignment of a team area label labels Jan 5, 2026
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@DiannaHohensee
Copy link
Contributor Author

We decided to fix the problem instead of rolling it back / undoing it. Closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue serverless-linked Added by automation, don't add manually Team:Distributed Coordination Meta label for Distributed Coordination team v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants