Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design: Query ServiceInterop dependency fallback #3225

Closed
ealsur opened this issue May 26, 2022 · 10 comments · Fixed by #3226
Closed

Design: Query ServiceInterop dependency fallback #3225

ealsur opened this issue May 26, 2022 · 10 comments · Fixed by #3226
Assignees
Labels
discussion-wanted Need a discussion on an area

Comments

@ealsur
Copy link
Member

ealsur commented May 26, 2022

Currently the V3 SDK, when running on Windows x64, will attempt to use the ServiceInterop.DLL (reference https://docs.microsoft.com/en-us/azure/cosmos-db/sql/performance-tips-query-sdk?tabs=v3&pivots=programming-language-csharp#use-local-query-plan-generation).

If the DLL is not present or one of its dependencies, the SDK throws:

Unhandled exception. System.DllNotFoundException: Unable to load DLL 'Microsoft.Azure.Cosmos.ServiceInterop.dll' or one of its dependencies: The specified module could not be found. (0x8007007E)
   at Microsoft.Azure.Documents.ServiceInteropWrapper.CreateServiceProvider(String configJsonString, IntPtr& serviceProvider)
   at Microsoft.Azure.Cosmos.Query.Core.QueryPlan.QueryPartitionProvider.Initialize()
   at Microsoft.Azure.Cosmos.Query.Core.QueryPlan.QueryPartitionProvider.TryGetPartitionedQueryExecutionInfoInternal(String querySpecJsonString, PartitionKeyDefinition partitionKeyDefinition, Boolean requireFormattableOrderByQuery, Boolean isContinuationExpected, Boolean allowNonValueAggregateQuery, Boolean hasLogicalPartitionKey, Boolean allowDCount)
   at Microsoft.Azure.Cosmos.Query.Core.QueryPlan.QueryPartitionProvider.TryGetPartitionedQueryExecutionInfo(String querySpecJsonString, PartitionKeyDefinition partitionKeyDefinition, Boolean requireFormattableOrderByQuery, Boolean isContinuationExpected, Boolean allowNonValueAggregateQuery, Boolean hasLogicalPartitionKey, Boolean allowDCount)
   at Microsoft.Azure.Cosmos.CosmosQueryClientCore.TryGetPartitionedQueryExecutionInfoAsync(SqlQuerySpec sqlQuerySpec, ResourceType resourceType, PartitionKeyDefinition partitionKeyDefinition, Boolean requireFormattableOrderByQuery, Boolean isContinuationExpected, Boolean allowNonValueAggregateQuery, Boolean hasLogicalPartitionKey, Boolean allowDCount, CancellationToken cancellationToken)
   at Microsoft.Azure.Cosmos.Query.Core.QueryPlan.QueryPlanHandler.TryGetQueryPlanAsync(SqlQuerySpec sqlQuerySpec, ResourceType resourceType, PartitionKeyDefinition partitionKeyDefinition, QueryFeatures supportedQueryFeatures, Boolean hasLogicalPartitionKey, CancellationToken cancellationToken)
   at Microsoft.Azure.Cosmos.Query.Core.QueryPlan.QueryPlanRetriever.GetQueryPlanWithServiceInteropAsync(CosmosQueryClient queryClient, SqlQuerySpec sqlQuerySpec, ResourceType resourceType, PartitionKeyDefinition partitionKeyDefinition, Boolean hasLogicalPartitionKey, ITrace trace, CancellationToken cancellationToken)
   at Microsoft.Azure.Cosmos.Query.Core.ExecutionContext.CosmosQueryExecutionContextFactory.TryCreateCoreContextAsync(DocumentContainer documentContainer, CosmosQueryContext cosmosQueryContext, InputParameters inputParameters, ITrace trace, CancellationToken cancellationToken)

This would normally be fixed by making sure the deployment process correctly copies all DLLs included in the SDK Nuget pacakge.

The problem is, some customers have solutions that run on Windows x64 and cannot add the ServiceInterop DLL (for whatever reason), in those cases, they currently have no work-around.

There are two alternatives to this address this:

Automatic fallback

If the DLL is not available, fallback to Gateway, but in order to allow customers to understand the problem, leave something in the Diagnostics only in the case where the application is running on Windows x64 (not on Linux or x86).

If the app runs on Windows x64, but the DLL cannot be used, a new node in the Diagnostics will help customers understand the problem in latency, and we automatically fallback to Gateway.

Example of Diagnostics node that states that a Query Plan from Gateway was done due to Service Interop not being available:

{
	"name": "Gateway QueryPlan",
	"id": "1ffd15dd-6d0f-418d-93e6-9fbbd0d75065",
	"start time": "09:36:54:025",
	"duration in milliseconds": 27.1973,
	"data": {
		"ServiceInterop unavailable": "True"
	}
},

PROs:

  • Customers are used to leverage Diagnostics already, they can see the effect of the missing DLL on Windows.
  • Automatic analysis can discover it and produce insight.
  • No more exceptions

CONs:

  • Potentially bloats Diagnostics with for each Query.
  • Discovery is on the Diagnostics, not on an Exception that might be more visible.

Continue to throw but provide options

Provide a CosmosClientOptions configuration that users can leverage to select how they want to interact with ServiceInterop. Something similar to what we have in V2 ConnectionPolicy.QueryPlanGenerationMode, reference: https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.documents.client.connectionpolicy.queryplangenerationmode?view=azure-dotnet#microsoft-azure-documents-client-connectionpolicy-queryplangenerationmode.

Default behavior would be to throw, users can opt-in to automatically allow Gateway fallback.

PROs:

  • Users can opt-in the automatic fallback explicitly

CONs:

  • The configuration can be set even on environments that don't apply (like Linux), so the question of "Should I change this setting or not?" will need explaining.
  • Documentation to understand which is the correct value to use is required

Related to #2366

@ealsur ealsur self-assigned this May 26, 2022
@ealsur ealsur added the discussion-wanted Need a discussion on an area label May 26, 2022
@sourabh1007
Copy link
Contributor

  1. IMO, automatic fallback is better way. We can catch this exception and add a message saying "Hence falling back to gateway" and along with it, we can have a flag somewhere is diagnostics where it will say if this fallback happened or not.

  2. We can leave it as it is, have a proper documentation which will say what all dependencies are required in order to use serviceInterop and catch and rethrow this exception with the link of that documentation.

@ealsur
Copy link
Member Author

ealsur commented May 26, 2022

The problem is, some customers have solutions that run on Windows x64 and cannot add the ServiceInterop DLL (for whatever reason), in those cases, throwing and pointing to documentation does not help them.

@sourabh1007
Copy link
Contributor

I don't know if it makes sense or not...Can we identify during client initialization if service interop is available or not (may be by running some metadata query). and then set a flag in SDK which will say, don't even try to load serviceinterop...and in request diagnostics client configuration will say this instance is running without serviceinterop library.
Like this, we won't need this failover mechanism at all.

@ealsur
Copy link
Member Author

ealsur commented May 26, 2022

Exactly the solution 1. In that case, we detect the ServiceInterop is not available before hand, and we don't even attempt to use it, and we fall back to Gateway (automatic fallback) but leave a "stamp" on Diagnostics to explain why.

@j82w
Copy link
Contributor

j82w commented May 27, 2022

I believe the explicit model is the better customer solution.

  1. Implicit model is what the v2 SDK had. It caused a lot of CRIs because people did not know they should copy the service interop and other dlls. It would fail silently in the background and cause high latency CRIs. That is the reason v3 SDK always verifies it exists. Reverting back to the implicit model seems like we are going back to model that we know causes CRIs.
  2. It guarantees that the service interop will not be used. There is no need to worry about if the assembly load, but doesn't execute or other edge cases. It also avoids the unnecessary overhead and logs of failing to load the service interop when the customer knows by design it will not work.
  3. It gives an option if there is an issue with the service interop to bypass it as a mitigation like when there is the memory leak.
  4. Customers that want the service interop can require it so there application will fail so they can avoid silent regression where the SDK goes back to gateway.

@ealsur
Copy link
Member Author

ealsur commented May 27, 2022

There are scenarios where this configuration option might not be viable to take. Integrations like Logic Apps where the customer simply has no access to them or Functions.

Adding an option where the default is the current behavior could potentially break those customers without a workaround.

Implicit model is what the v2 SDK had. It caused a lot of CRIs because people did not know they should copy the service interop and other dlls. It would fail silently in the background and cause high latency CRIs. That is the reason v3 SDK always verifies it exists. Reverting back to the implicit model seems like we are going back to model that we know causes CRIs.

This is mainly because we did not expose any information about what was going on + V2 SDK had multiple bugs regarding Query Plan (like executing the Query on GW if Query Plan was obtained from GW).

I propose adding the Diagnostics to indicate why we went to GW for the Query Plan, Diagnostics can be analyzed by automation and they can also be read and explained by users.

@j82w
Copy link
Contributor

j82w commented May 31, 2022

I'm in favor of automatic fall back with the option to change the behavior for the additional flexibility. It gives customers the ability to verify it's loading correctly without bloating/parsing the diagnostics.

@ealsur
Copy link
Member Author

ealsur commented May 31, 2022

But even on automatic fallback scenario (if the customer does not change the option) how do they discover why they went to GW? In V2 we had to ask Customers to enable tracing to find out, why not leverage the Diagnostics?

@j82w
Copy link
Contributor

j82w commented May 31, 2022

The option should not really be needed once the query pass through work is done. All the single pk queries won't need or use the service interop. They will just send query to backend. Only cross partition queries will need to get query plan call. My main concern is we don't want customer's parsing and relying on the parsed json as it might have breaking changes.

@ealsur
Copy link
Member Author

ealsur commented May 31, 2022

I agree, I wouldn't expect them to parse the diagnostics either. But the intent is the same as any other scenario where we use the Diagnostics: If the customer is experiencing higher than expected latency, they share the Diagnostics, we analyze and can say (like the cases where the problem is network latency), by looking at the extra Datum, if they are failing to load the DLL and that is the cause. Similar to what we would do analyzing Traces, but Traces is not a reliable source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion-wanted Need a discussion on an area
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants