Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Internal] Design Docs: Adds Design Document for Client Telemetry #3590

Merged
22 commits merged into from
Jun 9, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 72 additions & 1 deletion docs/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,75 @@ flowchart TD
OtherLogic --> GetResponse(Get Response for the request)
SendResponse --> OperationCall

```
```

## Send telemetry from SDK to service (Private Preview)

### Introduction
When opted-in CosmosDB SDK collects below aggregated telemetry data every 10 minutes to Azure CosmosDB service.
1. Operation(CRUD APIs) Latencies and Request Units (RUs).
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
2. Metadata caches (ex: CollectionCache) miss statistics
3. Client System Usage (during an operation) :
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
* CPU usage
* Memory Usage
* Thread Starvation
* Network Connections Opened (only TCP Connections)
4. TOP 10 slower network interactions
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved

> Note: We don't collect any PII data as part of this feature.

### Benefits
Enabling this feature provides numerous benefits. The telemetry data collected will allow us to identify and address potential issues. This results in a superior support experience and ensures that some issues can even be resolved before they impact your application. In short, customers with this feature enabled can expect a smoother and more reliable experience.
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved

### Impact of this feature enabled
* _Latency_: Customer should not see any impact on latency.
* _Total RPS_: It depends on the infrastructure the application using SDK is hosted on among other factors but the impact should not exceed 10%.
* _Any other impact_: Collector needs around 18MB of in-memory storage to hold the data and this storage is always constant (it means it doesn't grow, no matter how much data we have)
* Benchmark Numbers: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Performance.Tests/Contracts/BenchmarkResults.json

### Components

**Telemetry Job:** Background task which collects the data and sends it to a Azure CosmosDB service every 10 minutes.

**Collectors:** In-memory storage which keeps the telemetry data collected during an operation. There are 3 types of collectors including:
* _Operational Data Collector_: It keeps operation level latencies and request units.
* _Network Data Collector_: It keeps all the metrics related to network or TCP calls. It has its own Sampler which sample-in only slowest TCP calls for a particular replica.
* _Cache Data Collector_: It keeps all the cache call latencies. Right now, only collection cache is covered.

**Get VM Information**:

- Azure VM: [Azure Instance Metadata](https://learn.microsoft.com/azure/virtual-machines/instance-metadata-service?tabs=windows) call.
- Non-Azure VM: We don't collect any other information except VMID which will a Guid or Hashed Machine Name.

**Processor**: Its responsibility is to get all the data and divide it into small chunks (<2MB) and send each chunk to the Azure CosmosDB service.

```mermaid
flowchart TD
subgraph TelemetryJob[Telemetry Background Job]
subgraph Storage[In Memory Storage or Collectors]
subgraph NetworkDataCollector[Network Data Collector]
TcpDatapoint(Network Request Datapoint) --> NetworkHistogram[(Histogram)]
DataSampler(Sampler)
end
subgraph DataCollector[Operational Data Collector]
OpsDatapoint(Operation Datapoint) --> OperationHistogram[(Histogram)]
end
subgraph CacheCollector[Cache Data Collector]
CacheDatapoint(Cache Request Datapoint) --> CacheHistogram[(Histogram)]
end
end
subgraph TelemetryTask[Telemetry Task Every 10 min]
CacheAccountInfo(Cached Account Properties) --> VMInfo
VMInfo(Get VM Information) --> CollectSystemUsage
CollectSystemUsage(Record System Usage Information) --> GetDataFromCollector
end
subgraph Processor
GetDataFromCollector(Fetch Data from Collectors) --> Serializer
Serializer(Serialize and divide the Payload) --> SendCTOverHTTP(Send Data over HTTP to Service)
end
Storage --> |Get Aggregated data|GetDataFromCollector
end
```

### Limitations
1. AAD Support is not available.