[server][da-vinci-client] Mark a helix replica to error state during ingestion error #1337

majisourav99 · 2024-11-22T17:30:43Z

Mark a helix replica to error state during ingestion error

If ingestion task encounters any error, it just marks the CV to error state which never remediates. This PR makes user of a new API helix added to annotate a replica to ERROR state. This will later be picked by the error replica reset task ErrorPartitionResetTask which will attempt to recover those replicas.

How was this PR tested?

GHCI

Does this PR introduce any user-facing changes?

No. You can skip the rest of this section.
Yes. Make sure to explain your proposed changes and call out the behavior change.

sushantmane · 2024-11-25T17:41:44Z

Can we ensure that not all replicas for the current version are marked as ERROR, so we can still serve traffic with some replicas (even if it's stale)

majisourav99 · 2024-11-26T17:37:30Z

Can we ensure that not all replicas for the current version are marked as ERROR, so we can still serve traffic with some replicas (even if it's stale)

Discussed offline, this PR will reset the replica when we are marking CV to error as well, so there will not be any extra risk to read path not having any replica to server reads.

gaojieliu

Left a few comments.

gaojieliu · 2024-12-03T19:57:22Z

clients/da-vinci-client/src/main/java/com/linkedin/davinci/helix/HelixParticipationService.java

@@ -79,7 +79,7 @@ public class HelixParticipationService extends AbstractVeniceService
  private final String clusterName;
  private final String participantName;
  private final String zkAddress;
-  private final StoreIngestionService ingestionService;
+  private StoreIngestionService ingestionService = null;


Why this change?

gaojieliu · 2024-12-03T19:59:42Z

...ts/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTask.java

@@ -525,6 +533,14 @@ public StoreIngestionTask(
    }
    this.batchReportIncPushStatusEnabled = !isDaVinciClient && serverConfig.getBatchReportEOIPEnabled();
    this.parallelProcessingThreadPool = builder.getAAWCWorkLoadProcessingThreadPool();
+    this.zkHelixAdmin = Lazy.of(() -> new ZKHelixAdmin(zkAddress));


This seems to be very expensive to have one ZKHelixAdmin per SIT, I think we should use a shared ZKHelixAdmin among all the SITs.
I bet each ZKHelixAdmin instance will create one ZK client internally.

gaojieliu · 2024-12-03T21:12:52Z

...ts/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTask.java

+    // Set the replica state to ERROR so that the controller can attempt to reset the partition.
+    if (!isDaVinciClient) {
+      zkHelixAdmin.get()
+          .setPartitionsToError(


This is to set EV state as ERROR, right?

I saw you mentioned that this logic would only mark EV state as ERROR when CV is in ERROR state, so how can we achieve that by the current logic?
Only the following logic today decides whether to propagate ERROR to CV/EV IIUC.

majisourav99 force-pushed the helixError branch from 287ec66 to 038191f Compare November 25, 2024 04:14

Sourav Maji added 8 commits December 2, 2024 16:01

[da-vinci-client] Mark replica to error state on ingestion error

9a7e2b8

fix tests

d1b8b0c

added integ test

7185f45

added integ test

daffe27

added integ test

470e13a

added integ test

cf7a265

added integ test

ef5f5f9

fixed compile error

2ab455a

majisourav99 force-pushed the helixError branch from 3330617 to 2ab455a Compare December 3, 2024 00:09

gaojieliu reviewed Dec 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[server][da-vinci-client] Mark a helix replica to error state during ingestion error #1337

[server][da-vinci-client] Mark a helix replica to error state during ingestion error #1337

majisourav99 commented Nov 22, 2024 •

edited

Loading

sushantmane commented Nov 25, 2024 •

edited

Loading

majisourav99 commented Nov 26, 2024

gaojieliu left a comment

gaojieliu Dec 3, 2024

gaojieliu Dec 3, 2024

gaojieliu Dec 3, 2024

gaojieliu Dec 3, 2024

[server][da-vinci-client] Mark a helix replica to error state during ingestion error #1337

Are you sure you want to change the base?

[server][da-vinci-client] Mark a helix replica to error state during ingestion error #1337

Conversation

majisourav99 commented Nov 22, 2024 • edited Loading

Mark a helix replica to error state during ingestion error

How was this PR tested?

Does this PR introduce any user-facing changes?

sushantmane commented Nov 25, 2024 • edited Loading

majisourav99 commented Nov 26, 2024

gaojieliu left a comment

Choose a reason for hiding this comment

gaojieliu Dec 3, 2024

Choose a reason for hiding this comment

gaojieliu Dec 3, 2024

Choose a reason for hiding this comment

gaojieliu Dec 3, 2024

Choose a reason for hiding this comment

gaojieliu Dec 3, 2024

Choose a reason for hiding this comment

majisourav99 commented Nov 22, 2024 •

edited

Loading

sushantmane commented Nov 25, 2024 •

edited

Loading