-
Notifications
You must be signed in to change notification settings - Fork 623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error on retrieving write buffer from log stream #10141
Comments
Triage: It shouldn't be a severe issue, the cluster should recover from this. It might indicate an issue still. |
This occurred again on |
occurred again in Medic benchmark 45 https://console.cloud.google.com/errors/detail/COaxsKTpuvHoOQ;service=zeebe;version=;time=P7D?project=zeebe-io |
Occurred again in Medic Benchmark 49 We can clearly see in the logs, that it first becomes Leader, and during transition to the leader it gets notified that Broker 1 is the leader. It has to step down. It starts follower transition, and canceled the leader transition. Cancelation doesn't work really, at least it looks to me like that. The partition transition fails with an exception since the actor is closed, for the logstream. |
Happened in 8.1.13 during shutdown. No impact. Shutdown was completed successfully. |
Here's a possible reproducer: /*
* Copyright Camunda Services GmbH and/or licensed to Camunda Services GmbH under
* one or more contributor license agreements. See the NOTICE file distributed
* with this work for additional information regarding copyright ownership.
* Licensed under the Camunda License 1.0. You may not use this file
* except in compliance with the Camunda License 1.0.
*/
package io.camunda.zeebe.broker.transport.commandapi;
import static org.assertj.core.api.Assertions.assertThat;
import io.camunda.zeebe.broker.system.configuration.QueryApiCfg;
import io.camunda.zeebe.engine.state.QueryService;
import io.camunda.zeebe.logstreams.impl.flowcontrol.RateLimit;
import io.camunda.zeebe.logstreams.util.ListLogStorage;
import io.camunda.zeebe.logstreams.util.SyncLogStream;
import io.camunda.zeebe.scheduler.future.ActorFuture;
import io.camunda.zeebe.scheduler.testing.ControlledActorSchedulerExtension;
import io.camunda.zeebe.stream.api.StreamClock;
import io.camunda.zeebe.transport.ServerTransport;
import java.time.Duration;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.RegisterExtension;
import org.mockito.Mock;
import org.mockito.junit.jupiter.MockitoSettings;
import org.mockito.quality.Strictness;
@MockitoSettings(strictness = Strictness.STRICT_STUBS)
final class CommandApiServiceImplTest {
@RegisterExtension
private final ControlledActorSchedulerExtension scheduler =
new ControlledActorSchedulerExtension();
@Mock private ServerTransport transport;
@Mock private QueryService queryService;
@Test
void shouldThrowException() {
// given
final var logStream =
SyncLogStream.builder()
.withActorSchedulingService(scheduler.scheduler())
.withClock(StreamClock.system())
.withMaxFragmentSize(1024 * 1024)
.withLogStorage(new ListLogStorage())
.withPartitionId(1)
.withWriteRateLimit(RateLimit.disabled())
.withLogName("test")
.build();
final var service =
new CommandApiServiceImpl(transport, scheduler.scheduler(), new QueryApiCfg());
scheduler.submitActor(service);
scheduler.workUntilDone();
// when - `onBecomingLeader` enqueues a call to the underlying actor, but the log stream
// is closed before the actor is scheduled (via workUntilDone), so an exception is thrown
final ActorFuture<?> result = service.onBecomingLeader(1, 1, logStream, queryService);
logStream.close();
scheduler.workUntilDone();
// then
assertThat(result).failsWithin(Duration.ofSeconds(1));
}
} As for solving the issue, we have two options:
Additionally, in |
Probably solution 2 is the most general and will be probably useful for other issues, while solution 1 is more or less valid only for this specific situation. However, solution 2 will require be more careful with which exceptions are thrown ecc.
Regarding the last point, the current code creates a public ActorFuture<Void> onBecomingLeader(
final int partitionId,
final long term,
final LogStream logStream,
final QueryService queryService) {
final CompletableActorFuture<Void> future = new CompletableActorFuture<>();
actor.call(
() -> {
leadPartitions.add(partitionId);
queryHandler.addPartition(partitionId, queryService);
serverTransport.subscribe(partitionId, RequestType.QUERY, queryHandler);
final var logStreamWriter = logStream.newLogStreamWriter();
commandHandler.addPartition(partitionId, logStreamWriter);
serverTransport.subscribe(partitionId, RequestType.COMMAND, commandHandler);
future.complete(null);
});
return future;
} Eliminating the first future, the second one is always completed in case of exceptional errors (as it is done by the actor). |
While there a transition to a leader is cancelled, the logStream will be closed, throwing an exception when creating a Writer. The returned future is now failed when such error happens instead of non-completing. Moreover, the writer is instantiated as a first step, so in case of errors no subscriptions to the other components is done. closes #10141
While there a transition to a leader is cancelled, the logStream will be closed, throwing an exception when creating a Writer. The returned future is now failed when such error happens instead of non-completing. Moreover, the writer is instantiated as a first step, so in case of errors no subscriptions to the other components is done. closes #10141
While there a transition to a leader is cancelled, the logStream will be closed, throwing an exception when creating a Writer. The returned future is now failed when such error happens instead of non-completing. Moreover, the writer is instantiated as a first step, so in case of errors no subscriptions to the other components is done. closes #10141
While there a transition to a leader is cancelled, the logStream will be closed, throwing an exception when creating a Writer. The returned future is now failed when such error happens instead of non-completing. Moreover, the writer is instantiated as a first step, so in case of errors no subscriptions to the other components is done. closes #10141
…23371) ## Description While there a transition to a leader is cancelled, the logStream will be closed, throwing an exception when creating a Writer. To avoid concurrency issues, the `CommandServiceApiImpl` is not a `PartitionListener` anymore, but it's a `PartitionTransitionStep`: this should avoid it as they would be executed serially. ## Related issues closes #10141
While there a transition to a leader is cancelled, the logStream will be closed, throwing an exception when creating a Writer. The returned future is now failed when such error happens instead of non-completing. Moreover, the writer is instantiated as a first step, so in case of errors no subscriptions to the other components is done. closes #10141 (cherry picked from commit a00267f)
… change is cancelled (#23775) # Description Backport of #23371 to `stable/8.6`. relates to #10141 original author: @entangled90
… change is cancelled (#23774) # Description Backport of #23371 to `stable/8.5`. relates to #10141 original author: @entangled90
… change is cancelled (#23824) # Description Backport of #23774 to `stable/8.4`. relates to #23371 #10141 original author: @backport-action
Describe the bug
On becoming a leader the LogStream seems to be closed and thrown an exception if the CommandAPI wants to create a new writer.
To Reproduce
Not 100% sure, but it looks like it is related to concurrent transitions, becoming leader and going into inactive.
In general, I can't see really an impact/effect, since the cluster was almost immediately deleted afterward.
We see after the error happening also a lot of #8606 might be related.
Expected behavior
No error, and handling the case more gracefully.
Log/Stacktrace
Full Stacktrace
Logs: https://drive.google.com/drive/u/0/folders/1DKM8gPL92xcdWeVZNfLQYWEdCHKGij4r
Error group: https://console.cloud.google.com/errors/detail/COaxsKTpuvHoOQ;service=zeebe;time=P7D?project=camunda-cloud-240911
Environment:
The text was updated successfully, but these errors were encountered: