-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search indices can become out-of-sync with the SQL state #8295
Comments
For the first case, the kafka producer callback was a community submission to track/log these failures, but it is an async callback after the transaction was committed. Rollback at that point is too late, immediately calling restoreIndices (if there is a kafka issue) would not succeed since it ultimately also writes to kafka. Logging to a failure topic would not succeed if there is a kafka issue as well. I agree that ideally the database transaction and the production of the message are tied together. With some possible performance implications, the transaction could be rolled back if producing the message was synchronous. - @RyanHolstien thoughts? Second case, there is retry logic with exponential backoff options for the write to elasticsearch, however they are not infinite so ultimately there may be missing document updates in elasticsearch. Each hook is responsible for handling their own exceptions prior to throwing them. It is possible that there is a mix of hooks that succeed and fail and each one may need to be handled differently. An argument could be made that the hooks should be separate consumers and then this is better handled. |
I think as long as it's configurable it will work to make the producer function synchronously and rollback the transaction. Note: it's still not perfect as if we're planning for uncommon scenarios the rollback itself could also fail. |
Thanks for taking a look at this...you both bring up good points. For #1 - brainstorming here and this could start to get complex... one idea is to write an extra row/column as part of the SQL transaction indicating there is a Pending MCL publish to complete. This way we can ensure the entity and pending flag are both persisted atomically. Once the Kafka produce succeeds, it could remove that pending flag. If the Kafka produce fails, then some background process could periodically check for entities in a pending status and restore indices on those? For #2 - I suppose configuring numRetries to a really large number effectively accomplishes what we want. Would you have any concerns with allowing a |
This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io |
I believe this is still an issue |
This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io |
This issue was closed because it has been inactive for 30 days since being marked as stale. |
Describe the bug
There seem to be a couple scenarios where the search/graph indexes can become out-of-sync with the SQL database:
To Reproduce
These scenarios seem like they could happen anytime there is a transient/network level error with the ElasticSearch or the Kafka cluster, which is not uncommon
Expected behavior
In the 1st case, I would expect entityService to rollback the SQL update and return a failure code to the client. If the client was the mce-consumer, I would expect mce-consumer to keep retrying the failed message (dont commit offsets) until the transient issue goes away and processing succeeds. Ideally the entire persist+MCL publish is atomic (both succeed or both fail)
In the 2nd case, I would expect the mae-consumer to keep retrying the failed message (dont commit offsets) until the transient issue goes away and processing succeeds
Alternatively, an easier solution might be to automatically call the restoreIndices GMS endpoint when one of these failures is detected
Screenshots
N/A
Desktop (please complete the following information):
N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered: