-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update by query and max_docs and conficts=proceed #63671
Comments
Pinging @elastic/es-distributed (:Distributed/Reindex) |
I agree, there looks to be an issue when we truncate the hits due to I am a bit torn on what the right semantics should be though. I think the choice is between:
My intuition was towards the 1st option before reading your description here. I am curious if you have more input to this? I believe the kibana issue should resolve itself, since I assume they retry and run that update periodically and the version conflicts should then no longer occur since the tasks that conflicted on the previous run would no longer appear in the search result on a second run? |
I think the second choice is what we meant to do when implementing it. I think I made a mistake when I implemented it and sort of bumbled into doing the first thing instead. |
I think it will indeed resolve itself without us. They have some other issue around sorting by score that mean they are a bit stuck, but i think that is not related to this. |
Thanks for looking into this y'all!
It's an interesting issue, as it has different effects in different situations. Long term we're hoping to solve this by adding coordination between Kibana nodes, so that they can split the work in a more reliable and sustainable fashion. Sadly though, we're still quite far from that (as Kibana instances aren't aware of each other atm), and have to rely on ES to coordinate.
This would definitely be our preferred approach, but we only represent one POV, and I can totally see the 1st also being valid. |
I think it would violate the principle of least astonishment if there are cases where a given update by query call could have updated max_docs documents but does not because of conflicts. So if this can happen as the code currently stands, I consider that to be a bug. |
…laiming process (#89415) This is a first step in attempting to address the over zealous shifting we've identified in TM. It [turns out](elastic/elasticsearch#63671) `version_conflicts` don't always count against `max_docs`, so in this PR we correct the `version_conflicts` returned by updateByQuery in TaskManager to only count the conflicts that _may_ have counted against `max_docs`. This correction isn't necessarily accurate, but it will ensure we don't shift if we are in fact managing to claim tasks.
…laiming process (elastic#89415) This is a first step in attempting to address the over zealous shifting we've identified in TM. It [turns out](elastic/elasticsearch#63671) `version_conflicts` don't always count against `max_docs`, so in this PR we correct the `version_conflicts` returned by updateByQuery in TaskManager to only count the conflicts that _may_ have counted against `max_docs`. This correction isn't necessarily accurate, but it will ensure we don't shift if we are in fact managing to claim tasks.
…laiming process (#89415) (#89540) This is a first step in attempting to address the over zealous shifting we've identified in TM. It [turns out](elastic/elasticsearch#63671) `version_conflicts` don't always count against `max_docs`, so in this PR we correct the `version_conflicts` returned by updateByQuery in TaskManager to only count the conflicts that _may_ have counted against `max_docs`. This correction isn't necessarily accurate, but it will ensure we don't shift if we are in fact managing to claim tasks.
Hey team, We have now merged a mechanism into Task Manager which essentially tries to shift the time at which a Kibana node polls for tasks if it's experiencing a high rate of There are two things to keep in mind about this solution:
Longer term we hope to introduce some smarter coordination (such as long term ownership of tasks reducing conflicts, or even a leader node that coordinates the work), but these are still far off. Addressing this bug should reduce the conflicts we experience, enabling us to push that glass ceiling higher.... which is why addressing this would be very valuable to us. I hope this helps :) |
I've been taking a look into this and I have a couple of tests that reproduce the issue. A second scenario that leads to this problem is when all the matching documents are updated concurrently, this is trickier to solve, we could retry the search request, but AFAIK we don't have any guarantee that the process would make progress in scenarios with high contention. I'm not sure if we want to tackle the second scenario, wdyt @henningandersen? |
@fcofdez about the first scenario, I believe we already fetch the docs from source - keeping them around for a while does not seem problematic (in that case they should lower the scroll size). About the second scenario, I think we should not repeat the search. We should do one scroll search only and if we get to the end we are done. The "guarantee" we give is that we try to process docs that were present when we receive the operation. But any docs that appear concurrently with the update by query are not guaranteed to be included. We can therefore safely return if we get to the end. The client will not know the difference between whether extra docs appeared before or after ES sent back the response anyway. Repeating the search could lead to seeing the same docs multiple times and if the update script is not idempotent (like a counter or add to an array), that could accumulate infinitely. |
…equests In update by query requests where max_docs < size and conflicts=proceed we weren't using the remaining documents from the scroll response in cases where there were conflicts and in the first bulk request the successful updates < max_docs. This commit address that problem and use the remaining documents from the scroll response instead of requesting a new page. Closes elastic#63671
…equests (#71430) In update by query requests where max_docs < size and conflicts=proceed we weren't using the remaining documents from the scroll response in cases where there were conflicts and in the first bulk request the successful updates < max_docs. This commit address that problem and use the remaining documents from the scroll response instead of requesting a new page. Closes #63671
…equests In update by query requests where max_docs < size and conflicts=proceed we weren't using the remaining documents from the scroll response in cases where there were conflicts and in the first bulk request the successful updates < max_docs. This commit address that problem and use the remaining documents from the scroll response instead of requesting a new page. Closes elastic#63671 Backport of elastic#71430
…equests (#71931) In update by query requests where max_docs < size and conflicts=proceed we weren't using the remaining documents from the scroll response in cases where there were conflicts and in the first bulk request the successful updates < max_docs. This commit address that problem and use the remaining documents from the scroll response instead of requesting a new page. Closes #63671 Backport of #71430
elastic/kibana#80371 seems to be having a problem with update_by_query and max_docs. I'm not 100% sure exactly what is going on, but they expect the combination of max_docs=10 and conflicts=proceed to continue running update by query until it manages to update 10 documents. This seems pretty reasonable and it seems like something the code is trying to do. I might have written that code, but I honestly don't remember at this point. Any way, it looks to me like we at least have an issue where we'll only attempt the first
max_docs
of each bulk response. Which would be fine withoutconflicts=proceed
, but with it we probably should check if we should move on to the next few docs in the bulk response.The text was updated successfully, but these errors were encountered: