Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search indices can become out-of-sync with the SQL state #8295

Closed
tkdrahn opened this issue Jun 23, 2023 · 7 comments
Closed

Search indices can become out-of-sync with the SQL state #8295

tkdrahn opened this issue Jun 23, 2023 · 7 comments
Labels
bug Bug report stale

Comments

@tkdrahn
Copy link
Contributor

tkdrahn commented Jun 23, 2023

Describe the bug

There seem to be a couple scenarios where the search/graph indexes can become out-of-sync with the SQL database:

  1. If the MCL publish fails, it seems entityService increments a metric and moves on without rolling back the SQL persist
  2. If something fails during elasticsearch updates, the mae-consumer increments a metric and skips past the MCL event

To Reproduce
These scenarios seem like they could happen anytime there is a transient/network level error with the ElasticSearch or the Kafka cluster, which is not uncommon

Expected behavior

  • In the 1st case, I would expect entityService to rollback the SQL update and return a failure code to the client. If the client was the mce-consumer, I would expect mce-consumer to keep retrying the failed message (dont commit offsets) until the transient issue goes away and processing succeeds. Ideally the entire persist+MCL publish is atomic (both succeed or both fail)

  • In the 2nd case, I would expect the mae-consumer to keep retrying the failed message (dont commit offsets) until the transient issue goes away and processing succeeds

  • Alternatively, an easier solution might be to automatically call the restoreIndices GMS endpoint when one of these failures is detected

Screenshots
N/A

Desktop (please complete the following information):
N/A

Additional context
N/A

@tkdrahn tkdrahn added the bug Bug report label Jun 23, 2023
@tkdrahn tkdrahn changed the title A short description of the bug Search indices can become out-of-sync with the SQL state Jun 23, 2023
@david-leifker
Copy link
Collaborator

For the first case, the kafka producer callback was a community submission to track/log these failures, but it is an async callback after the transaction was committed. Rollback at that point is too late, immediately calling restoreIndices (if there is a kafka issue) would not succeed since it ultimately also writes to kafka. Logging to a failure topic would not succeed if there is a kafka issue as well. I agree that ideally the database transaction and the production of the message are tied together. With some possible performance implications, the transaction could be rolled back if producing the message was synchronous. - @RyanHolstien thoughts?

Second case, there is retry logic with exponential backoff options for the write to elasticsearch, however they are not infinite so ultimately there may be missing document updates in elasticsearch. Each hook is responsible for handling their own exceptions prior to throwing them. It is possible that there is a mix of hooks that succeed and fail and each one may need to be handled differently. An argument could be made that the hooks should be separate consumers and then this is better handled.

@RyanHolstien
Copy link
Collaborator

I think as long as it's configurable it will work to make the producer function synchronously and rollback the transaction. Note: it's still not perfect as if we're planning for uncommon scenarios the rollback itself could also fail.

@tkdrahn
Copy link
Contributor Author

tkdrahn commented Jul 19, 2023

Thanks for taking a look at this...you both bring up good points.

For #1 - brainstorming here and this could start to get complex... one idea is to write an extra row/column as part of the SQL transaction indicating there is a Pending MCL publish to complete. This way we can ensure the entity and pending flag are both persisted atomically. Once the Kafka produce succeeds, it could remove that pending flag. If the Kafka produce fails, then some background process could periodically check for entities in a pending status and restore indices on those?

For #2 - I suppose configuring numRetries to a really large number effectively accomplishes what we want. Would you have any concerns with allowing a -1 value in that field to indicate we want to keep retrying indefinitely until successful ?

@github-actions
Copy link

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

@github-actions github-actions bot added the stale label Aug 19, 2023
@tkdrahn
Copy link
Contributor Author

tkdrahn commented Aug 19, 2023

I believe this is still an issue

@github-actions github-actions bot removed the stale label Aug 20, 2023
@github-actions
Copy link

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

@github-actions github-actions bot added the stale label Sep 19, 2023
@github-actions
Copy link

This issue was closed because it has been inactive for 30 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report stale
Projects
None yet
Development

No branches or pull requests

3 participants