-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReadNextAsync goes crazy and load all documents from collection, against executing query. #2313
Comments
Hi @darthkurak , can you try upgrading to 3.17? There is the following fix which I believe will address your problem. If you still see the issue can you please capture the Fyi: @timsander1 |
We did page.Diagnostics.ToString and it blowed up our production 🗡️
Will let you know about SDK update. |
This PR seems to fix the concurrent bug. #2314 Is it still broken in 3.17? |
Looks like it. We made the change at the same time. We bumped up the SDK and added diagnostics to the logs. |
Even on this PR it looks like it wasn't completed. |
@bchong95 Could you comment on the concurrency issue on the query diagnostics? Based on #2097 (comment) it looks like this is not behaving as expected? |
@darthkurak If you have the full stack trace of the concurrency error, it would greatly help to debug or try to reproduce it. |
@ealsur
@j82w We did update to 3.17, but still observe the same issue. |
@darthkurak can you please contact support so a support ticket is created and will ensure the issue is properly prioritized. |
I'v created ticket. How can i pass you request ID? |
Anecdotal: I don't have any analytics at the level using 3.17.0 but am observing per degradation over time. More significantly, calls start to freeze after about 2.5h of running within an app that has 2 V3 clients (pointed at different containers) and a V2 CFP client I've deployed I've good confidence this hang and perf degradation was not happening when using V2. 🤔 the problem is I'll need to backport a lib to v2 to prove that, and that's not happening this week or next. |
@darthkurak I was able to find the ticket. Can you try using 3.18.0-preview and capture the CosmosDiagnostics? It fixes the dictionary concurrency issue. There will be a 3.17.1 release later today which will also contain the fix. @bartelink you are correct about the fixes. Here is the PR to update the changelog. Is there any way you can share a repro of the issue? Is it a memory leak? |
@j82w Thanks. Just to be clear: my 'report' should be considered wild speculation for now. I'm not currently running the ingestion process so can't even chart memory growth atm. The other thing that's impeding my progress toward doing proper issue reporting is that I'm having to mix V2 CFP usage with usage of the V3 client as we wait on #1765 |
Query: "SELECT TOP 10 VALUE root FROM root WHERE ((true OR (root[\"User\"][\"IsDeleted\"] = false)) AND StringEquals(root[\"User\"][\"AllegroData\"][\"Email\"], \"some@mail\", true)) ORDER BY root[\"_ts\"] DESC"} Here is the diagnostics formatted. I only see the SDK doing 2 page requests in direct mode.
|
Is it related to #1486 ? |
This should already be fixed by #2168 |
@neildsh The user is on 3.17.1, which already contains that PR, could this be then a different case? |
Hi @ealsur , the symptom of loading all documents was seen when the user was on 3.16. Then the user upgraded to 3.17.1 for which we see only two requests being sent out from the SDK. However, they ran into the concurrent modification of diagnostics issue, which led @j82w to recommend that they use the 3.18.0-preview |
Sorry, but im bit confused with your conversation. What is the conclusion? |
Hi @darthkurak , I think a few things are happening here: The search performed is very expensive since it is case insensitive. Also, on the portal, it looks like we are loading only one document and not actually draining the continuation fully. That would explain why it is cheaper... it could be that the portal is getting "lucky" when it hits the partitions in sequence and the very first partition has a matching result, and we dont drain the continuation after that. OTOH, the SDK hits all partitions in parallel, so you have to pay for all those calls. I would recommend trying an exact match on the email instead of the case insensitive string comparison. I think that we should see a significant improvement here. |
I know that search is very expensive. Normally it costs us around 800-900 RU. But there are few search (exactly the same) which costs 50K ru's. And from client sides SDK generates 31 calls to Cosmos. So, it doesn't looks like normal behavior. |
@darthkurak the diagnostics you provided only shows doing 2 network Direct + TCP calls.
|
Hi @darthkurak , as @j82w suggested, can you please gather the full diagnostics for these requests that you mentioned? When you say the searches are "exactly the same", do mean that exactly the same query (including parameters) costs 800-900RU sometimes and 50k RUs at others ? Did you consider my recommendation above for using an exact match instead of a case insensitive match ? That will help a lot to reduce the latency and cost. |
@j82w We can't change to singleton right now. We have 3 instances for purpose as each of it manage separate resource. Maybe one of them is redundant, so we will fix this soon. Good to know, we will optimize it. But this is another case. I see no connection with reported problem. |
You can see that ReadNextAsync for first page takes 72s, but actually prefetch takes 4.5 s and there is nothing more in diagnostics logs which would show where the rest of 68s s is spent. I think that your diagnostics logs somehow do not log this edge case, because our independent logging based on "https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.cosmos.requesthandler?view=azure-dotnet", which screenshot from AppInsights is included above, report many more requests which takes also 4.5 seconds and sum up to this 72s. Also, formatted by you diagnostics logs for first page do not match the one which i pasted (wrong copy-paste?) root|root |
If they each have unique resources then it's not an issue if there is 3 clients. The scenario that is a problem is if the SDK client is getting re-created for the same resource. We seen users create a new client for each Cosmos operation which causes issues since the caches have to get warmed for each new instance. Just to clarify the times in the diagnostics are in Milliseconds, not seconds. |
I know, but 59751.2997 ms give us 60 seconds, right? :) |
Any update? :) |
Hi @darthkurak , thank you for following up. The backend logs will always show us all the requests. However, we do need the activityId to look them up. For these two activityIds that you shared initially:
We did see a bunch of continuations on the backend because the search is case insensitive so we were not able to push it to the index as efficiently, this would result in more documents being loaded on the backend and hence the high RU cost. The backend is combing through all the documents until it finds the one match. This does not seem like an SDK issue. Since you already have an ICM open on the backend, let's follow up there to make the query more efficient. OTOH, if the claim is that this only happens some of the time for exactly one query, we can investigate that well. However, I am going to close this issue here, since it does not look like an SDK issue at this point. |
@neildsh Where could we (community) track the fix/improvement done on backend side if it's not in this PR? |
@neildsh yes, exactly the same query (with the same parameters) results most of the time with 700-800 rus and one call to backend. But sometimes it takes 50k rus and multiple calls to backend. It seems like here is still miss understanding of the problem. I have evident logs which tell exactly what calls were made. If this is on backend side - why I see in AppInsights logs 31 calls to Cosmos instance produced by 2 calls of ReadNextAsync? |
Hi @YohanSciubukgian , backend improvements aren't tracked here but we'll blog about significant ones here: https://devblogs.microsoft.com/cosmosdb/ @darthkurak , if you're still seeing the 31 calls to Cosmos DB for that query, could you share an example diagnostics that show the 31 calls? Based on the diagnostics shared it above, I only see 2 calls. For improving string equals performance, have you tried this tip here? https://docs.microsoft.com/azure/cosmos-db/troubleshoot-query-performance#improve-string-system-function-execution |
Hi everyone.
I have nasty case for you, which we observe on our production code. It's random, it's crazy and we are out of ideas why this happening. We need help from you as it seems to have source inside SDK itself.
Take a seat and enjoy.
Executed query:
SELECT TOP 10 VALUE root FROM root WHERE ((true OR (root["User"]["IsDeleted"] = false)) AND StringEquals(root["User"]["AllegroData"]["Email"], "somestandard_nothingspecial@email.here", true)) ORDER BY root["_ts"] DESC
Executed in portal:
Executed on our production:
FeedIterator.ToListAsync: PageCount: "1", RuUsage: "1971.7099999999998", PageNumber: 0
FeedIterator.ToListAsync: PageCount: 0, RuUsage: 46456.38, PageNumber: 1
FeedIterator make more than one roundtrip! RuUsage: 48428.09, Query: "query from point 1"
So, one ReadNextAsync call, produce most of this dependencies, resulting with 46k RuUsage.
This wrong execution, has sum of dependencies on Metric-retrievedDocumentCount equal documents count in collection - it means, all documents from collection are loaded on cosmo server. On other hand, Metric-outputDocumentCount, has always one or zero results (as expected from query)
This happens randomly, couple times at day, always for the same query, but different parameters.
It takes time (see picture from point 2)
SDK versions
Microsoft.Azure.Cosmos (3.16)
Microsoft.Azure.DocumentDB.Core (2.13.1)
ToListAsync code:
Do you have some ideas why this happening? It is possible that something is wrong with StringEquals function? Some edge case where it doesn't work properly?
The text was updated successfully, but these errors were encountered: