Search indices can become out-of-sync with the SQL state #8295

tkdrahn · 2023-06-23T18:43:51Z

Describe the bug

There seem to be a couple scenarios where the search/graph indexes can become out-of-sync with the SQL database:

If the MCL publish fails, it seems entityService increments a metric and moves on without rolling back the SQL persist
- https://github.com/datahub-project/datahub/blob/master/metadata-dao-impl/kafka-producer/src/main/java/com/linkedin/metadata/dao/producer/KafkaHealthChecker.java#L34
If something fails during elasticsearch updates, the mae-consumer increments a metric and skips past the MCL event
- https://github.com/datahub-project/datahub/blob/c074d2fdc1fefec1369a9af49da5b713e270b7da/metadata-jobs/mae-consumer/src/main/java/com/linkedin/metadata/kafka/MetadataChangeLogProcessor.java#L87C19-L87C19

To Reproduce
These scenarios seem like they could happen anytime there is a transient/network level error with the ElasticSearch or the Kafka cluster, which is not uncommon

Expected behavior

In the 1st case, I would expect entityService to rollback the SQL update and return a failure code to the client. If the client was the mce-consumer, I would expect mce-consumer to keep retrying the failed message (dont commit offsets) until the transient issue goes away and processing succeeds. Ideally the entire persist+MCL publish is atomic (both succeed or both fail)
In the 2nd case, I would expect the mae-consumer to keep retrying the failed message (dont commit offsets) until the transient issue goes away and processing succeeds
Alternatively, an easier solution might be to automatically call the restoreIndices GMS endpoint when one of these failures is detected

Screenshots
N/A

Desktop (please complete the following information):
N/A

Additional context
N/A

david-leifker · 2023-07-19T17:28:18Z

For the first case, the kafka producer callback was a community submission to track/log these failures, but it is an async callback after the transaction was committed. Rollback at that point is too late, immediately calling restoreIndices (if there is a kafka issue) would not succeed since it ultimately also writes to kafka. Logging to a failure topic would not succeed if there is a kafka issue as well. I agree that ideally the database transaction and the production of the message are tied together. With some possible performance implications, the transaction could be rolled back if producing the message was synchronous. - @RyanHolstien thoughts?

Second case, there is retry logic with exponential backoff options for the write to elasticsearch, however they are not infinite so ultimately there may be missing document updates in elasticsearch. Each hook is responsible for handling their own exceptions prior to throwing them. It is possible that there is a mix of hooks that succeed and fail and each one may need to be handled differently. An argument could be made that the hooks should be separate consumers and then this is better handled.

RyanHolstien · 2023-07-19T19:10:12Z

I think as long as it's configurable it will work to make the producer function synchronously and rollback the transaction. Note: it's still not perfect as if we're planning for uncommon scenarios the rollback itself could also fail.

tkdrahn · 2023-07-19T19:52:40Z

Thanks for taking a look at this...you both bring up good points.

For #1 - brainstorming here and this could start to get complex... one idea is to write an extra row/column as part of the SQL transaction indicating there is a Pending MCL publish to complete. This way we can ensure the entity and pending flag are both persisted atomically. Once the Kafka produce succeeds, it could remove that pending flag. If the Kafka produce fails, then some background process could periodically check for entities in a pending status and restore indices on those?

For #2 - I suppose configuring numRetries to a really large number effectively accomplishes what we want. Would you have any concerns with allowing a -1 value in that field to indicate we want to keep retrying indefinitely until successful ?

github-actions · 2023-08-19T01:46:40Z

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

tkdrahn · 2023-08-19T01:56:04Z

I believe this is still an issue

github-actions · 2023-09-19T01:49:03Z

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions · 2023-10-20T01:48:45Z

This issue was closed because it has been inactive for 30 days since being marked as stale.

tkdrahn added the bug Bug report label Jun 23, 2023

tkdrahn changed the title ~~A short description of the bug~~ Search indices can become out-of-sync with the SQL state Jun 23, 2023

github-actions bot added the stale label Aug 19, 2023

github-actions bot removed the stale label Aug 20, 2023

github-actions bot added the stale label Sep 19, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search indices can become out-of-sync with the SQL state #8295

Search indices can become out-of-sync with the SQL state #8295

tkdrahn commented Jun 23, 2023

david-leifker commented Jul 19, 2023

RyanHolstien commented Jul 19, 2023

tkdrahn commented Jul 19, 2023

github-actions bot commented Aug 19, 2023

tkdrahn commented Aug 19, 2023

github-actions bot commented Sep 19, 2023

github-actions bot commented Oct 20, 2023

Search indices can become out-of-sync with the SQL state #8295

Search indices can become out-of-sync with the SQL state #8295

Comments

tkdrahn commented Jun 23, 2023

david-leifker commented Jul 19, 2023

RyanHolstien commented Jul 19, 2023

tkdrahn commented Jul 19, 2023

github-actions bot commented Aug 19, 2023

tkdrahn commented Aug 19, 2023

github-actions bot commented Sep 19, 2023

github-actions bot commented Oct 20, 2023