Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding diagnostic context #1062

Merged
merged 54 commits into from
Jan 21, 2020
Merged

Adding diagnostic context #1062

merged 54 commits into from
Jan 21, 2020

Conversation

j82w
Copy link
Contributor

@j82w j82w commented Nov 27, 2019

Pull Request Template

Description

This PR adds a diagnostics core object. It has an inner class to create a scope which is labeled. Each scope track the start time and the duration. This allows each layer of the SDK to tracked. The outer diagnostics core also has a basic overview if there is retries.

Key design decisions:

  1. Avoided doing any to string to avoid any cost if the diagnostics is not read
  2. I decided to do a single outer class with a single list. The more detailed version would of been each scope has a list of attributes and a list of inner scopes. This creates a lot of objects when most are not necessary. Having a single out class with a single list seems to give the same amount of information without the extra overhead.

Additional diagnostics being added:

  1. Serialization time
  2. Each handler in the pipeline time
  3. Partition key extraction time
  4. Retries
  5. UserAgent with system info and SDK version

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
{
    "Summary": {
        "StartUtc": "2020-01-16T14:56:58.8108353Z",
        "ElapsedTime": "00:00:00.5060634",
        "UserAgent": "cosmos-netstandard-sdk/3.5.1|3.4.2|34440|X64|Microsoft Windows 10.0.18363 |.NET Core 4.6.28008.01|"
    },
    "Context": [
        {
            "Id": "ItemStream",
            "ElapsedTime": "00:00:00.5060634"
        },
        {
            "Id": "ItemSerialize",
            "ElapsedTime": "00:00:00.0105719"
        },
        {
            "Id": "ExtractPkValue",
            "ElapsedTime": "00:00:00.0837094"
        },
        {
            "Id": "RequestInvokerHandler",
            "ElapsedTime": "00:00:00.4104417"
        },
        {
            "Id": "Microsoft.Azure.Cosmos.Handlers.RetryHandler",
            "ElapsedTime": "00:00:00.3957540"
        },
        {
            "Id": "Microsoft.Azure.Cosmos.Handlers.RouterHandler",
            "ElapsedTime": "00:00:00.3957334"
        },
        {
            "Id": "TransportHandler",
            "ElapsedTime": "00:00:00.3957263"
        },
        {
            "Id": "PointOperationStatistics",
            "ActivityId": "2092bc00-74a0-4232-b4e4-f6ee78e5094d",
            "StatusCode": 201,
            "SubStatusCode": 0,
            "RequestCharge": 11.05,
            "RequestUri": "dbs/02642de4-61b4-448a-a43f-33fab121507a/colls/fd68dd23-fda9-4e95-8804-7b24bd43c2b4",
            "RequestSessionToken": null,
            "ResponseSessionToken": "0:-1#139",
            "ClientRequestStats": {
                "RequestStartTimeUtc": "2020-01-16T14:56:58.9211256Z",
                "RequestEndTimeUtc": "2020-01-16T14:56:59.2953629Z",
                "RequestLatency": "00:00:00.3742373",
                "IsCpuOverloaded": false,
                "NumberRegionsAttempted": 1,
                "ResponseStatisticsList": [
                    {
                        "ResponseTime": "2020-01-16T14:56:59.2953629Z",
                        "ResourceType": 2,
                        "OperationType": 0,
                        "StoreResult": "StorePhysicalAddress: rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer5/partitions/a4cb4951-38c8-11e6-8106-8cdcd42c33be/replicas/1p/, LSN: 139, GlobalCommittedLsn: -1, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 201, SubStatusCode: 0, RequestCharge: 11.05, ItemLSN: -1, SessionToken: -1#139, UsingLocalLSN: False, TransportException: null"
                    }
                ],
                "AddressResolutionStatistics": [
                    {
                        "StartTime": "2020-01-16T14:56:59.0865654Z",
                        "EndTime": "2020-01-16T14:56:59.1111528Z",
                        "TargetEndpoint": "https://127.0.0.1:8081//addresses/?$resolveFor=dbs%2f1Z0-AA%3d%3d%2fcolls%2f1Z0-AKb4bUM%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0"
                    }
                ],
                "SupplementalResponseStatistics": [],
                "FailedReplicas": [],
                "RegionsContacted": [
                    "https://127.0.0.1:8081/"
                ],
                "ContactedReplicas": [
                    "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer5/partitions/a4cb4951-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
                    "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer5/partitions/a4cb4951-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
                    "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer5/partitions/a4cb4951-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
                    "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer5/partitions/a4cb4951-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
                    "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer5/partitions/a4cb4951-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
                    "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer5/partitions/a4cb4951-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
                    "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer5/partitions/a4cb4951-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
                    "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer5/partitions/a4cb4951-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
                    "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer5/partitions/a4cb4951-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
                    "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer5/partitions/a4cb4951-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
                    "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer5/partitions/a4cb4951-38c8-11e6-8106-8cdcd42c33be/replicas/1p/",
                    "rntbd://127.0.0.1:10253/apps/DocDbApp/services/DocDbServer5/partitions/a4cb4951-38c8-11e6-8106-8cdcd42c33be/replicas/1p/"
                ]
            }
        }
    ]
}

Performance results:
// Original no diagnostics
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-9700 CPU 3.00GHz, 1 CPU, 8 logical and 8 physical cores
.NET Core SDK=3.0.100
[Host] : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), X64 RyuJIT
MediumRun : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), X64 RyuJIT

Job=MediumRun IterationCount=15 LaunchCount=2
WarmupCount=10

Method Mean Error StdDev StdErr Min Q1 Median Q3 Max Op/s Gen 0 Gen 1 Gen 2 Allocated
CreateItem 96.65 us 1.271 us 1.863 us 0.346 us 93.68 us 95.22 us 96.51 us 97.64 us 102.5 us 10,346.4 4.8828 1.7090 - 30.88 KB

// Original with Diagnostics
Job=MediumRun IterationCount=15 LaunchCount=2
WarmupCount=10

Method Mean Error StdDev StdErr Min Q1 Median Q3 Max Op/s Gen 0 Gen 1 Gen 2 Allocated
CreateItem 101.0 us 1.77 us 2.47 us 0.48 us 98.14 us 98.89 us 100.8 us 102.5 us 108.0 us 9,896.8 5.3711 2.1973 - 32.92 KB

// With changes no diganostics
Job=MediumRun IterationCount=15 LaunchCount=2
WarmupCount=10

Method Mean Error StdDev StdErr Min Q1 Median Q3 Max Op/s Gen 0 Gen 1 Gen 2 Allocated
CreateItem 97.04 us 1.169 us 1.713 us 0.318 us 94.10 us 96.01 us 96.91 us 97.65 us 102.1 us 10,305.5 5.1270 1.9531 - 31.62 KB

// With changes and diagnostics
Job=MediumRun IterationCount=15 LaunchCount=2
WarmupCount=10

Method Mean Error StdDev StdErr Min Q1 Median Q3 Max Op/s Gen 0 Gen 1 Gen 2 Allocated
CreateItem 108.4 us 1.27 us 1.82 us 0.34 us 105.0 us 107.1 us 108.6 us 109.5 us 113.3 us 9,221.4 7.0801 2.6855 - 43.33 KB

@j82w j82w added the Diagnostics Issues around diagnostics and troubleshooting label Nov 27, 2019
@j82w j82w self-assigned this Nov 27, 2019
@kirankumarkolli
Copy link
Member

Can you also please include CP sample as well?

@j82w j82w mentioned this pull request Jan 16, 2020
Jake Willey and others added 2 commits January 16, 2020 06:47
* Wired batch and bulk with diagnostic context.

* fixed comments

* Fixed bulk test
@j82w j82w changed the title Adding initial diagnostics scope Adding diagnostic context Jan 16, 2020
@kirankumarkolli
Copy link
Member

        this.ConnectionPolicy = ConnectionPolicy.Default;

Lets please create a new policy every time.
Static instance might hold the references of all clients in-case of multi client scenarios.


Refers to: Microsoft.Azure.Cosmos/src/DocumentClient.cs:699 in ccdaf97. [](commit_id = ccdaf97, deletion_comment = False)


if (this.ElapsedTime.HasValue)
{
jsonWriter.WriteValue(this.ElapsedTime.Value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elapsed is set only on Dispose(), which Write is consumed before. How will it be set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the possibility of async background tasks that may still be running or possibility of a scope not getting disposed of correctly in an exception scenario. Either way it's better to gracefully handle the scenario than throw an exception.


internal override void WriteJsonObject(JsonWriter jsonWriter)
{
foreach (CosmosDiagnosticWriter writer in this.contextList)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For queries this might get really high. What's the V2 story?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this becomes to large HA will need to provide a way to summarize the content. The current design to v2 has 2 different properties diagnostic string and query metrics. Query metrics does have some ability to be summarized, but diagnostic string does not.

@j82w j82w merged commit 36307e8 into master Jan 21, 2020
@j82w j82w deleted the users/jawilley/diagnostics/scope branch January 21, 2020 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Diagnostics Issues around diagnostics and troubleshooting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants