Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: store well defined metrics as times-series data streams #9730

Closed
wants to merge 20 commits into from

Conversation

kruskall
Copy link
Member

@kruskall kruskall commented Dec 2, 2022

Motivation/summary

Checklist

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

Related issues

Closes #9649

@mergify
Copy link
Contributor

mergify bot commented Dec 2, 2022

This pull request does not have a backport label. Could you fix it @kruskall? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.x is the label to automatically backport to the 7.x branch.
  • backport-7./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Dec 2, 2022
@apmmachine
Copy link
Contributor

apmmachine commented Dec 2, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-02-13T08:12:01.033+0000

  • Duration: 21 min 49 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate and publish the docker images.

  • /test windows : Build & tests on Windows.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@apmmachine
Copy link
Contributor

apmmachine commented Dec 2, 2022

📚 Go benchmark report

Diff with the main branch

goos: linux
goarch: amd64
pkg: github.com/elastic/apm-server/internal/agentcfg
cpu: 12th Gen Intel(R) Core(TM) i5-12500
                                  │ build/main/bench.out │              bench.out              │
                                  │        sec/op        │    sec/op     vs base               │
FetchAndAdd/FetchFromCache-12               46.15n ± ∞ ¹   41.15n ± ∞ ¹  -10.83% (p=0.008 n=5)
geomean                                     69.01n         62.27n         -9.77%
¹ need >= 6 samples for confidence interval at level 0.95

                                  │ build/main/bench.out │              bench.out              │
                                  │         B/op         │    B/op      vs base                │
geomean                                                ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

                                  │ build/main/bench.out │              bench.out              │
                                  │      allocs/op       │  allocs/op   vs base                │
geomean                                                ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/internal/beater/request
                                             │ build/main/bench.out │              bench.out              │
                                             │        sec/op        │    sec/op     vs base               │
ContextResetContentEncoding/empty-12                   136.1n ± ∞ ¹   122.1n ± ∞ ¹  -10.29% (p=0.008 n=5)
ContextResetContentEncoding/uncompressed-12            161.5n ± ∞ ¹   145.4n ± ∞ ¹   -9.97% (p=0.008 n=5)
geomean                                                915.8n         968.4n         +5.74%
¹ need >= 6 samples for confidence interval at level 0.95

                                             │ build/main/bench.out │               bench.out               │
                                             │         B/op         │     B/op       vs base                │
geomean                                                           ³                  +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

                                             │ build/main/bench.out │              bench.out              │
                                             │      allocs/op       │  allocs/op   vs base                │
geomean                                                           ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/internal/publish
             │ build/main/bench.out │          bench.out           │
             │        sec/op        │   sec/op     vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

             │ build/main/bench.out │           bench.out            │
             │         B/op         │     B/op       vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

             │ build/main/bench.out │           bench.out           │
             │      allocs/op       │  allocs/op    vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

pkg: github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics
                 │ build/main/bench.out │           bench.out           │
                 │        sec/op        │    sec/op     vs base         │
¹ need >= 6 samples for confidence interval at level 0.95

                 │ build/main/bench.out │            bench.out             │
                 │         B/op         │     B/op       vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                 │ build/main/bench.out │           bench.out            │
                 │      allocs/op       │  allocs/op   vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics
                        │ build/main/bench.out │             bench.out              │
                        │        sec/op        │    sec/op     vs base              │
AggregateTransaction-12           82.79n ± ∞ ¹   77.36n ± ∞ ¹  -6.56% (p=0.008 n=5)
¹ need >= 6 samples for confidence interval at level 0.95

                        │ build/main/bench.out │           bench.out            │
                        │         B/op         │    B/op      vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                        │ build/main/bench.out │           bench.out            │
                        │      allocs/op       │  allocs/op   vs base           │
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
               │ build/main/bench.out │             bench.out              │
               │        sec/op        │    sec/op     vs base              │
geomean                  624.6n         593.3n        -5.01%
¹ need >= 6 samples for confidence interval at level 0.95

               │ build/main/bench.out │               bench.out               │
               │         B/op         │     B/op       vs base                │
Process-12              9.245Ki ± ∞ ¹   9.176Ki ± ∞ ¹  -0.75% (p=0.016 n=5)
geomean                             ³                  -0.38%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

               │ build/main/bench.out │              bench.out              │
               │      allocs/op       │  allocs/op   vs base                │
geomean                             ³                +0.00%               ³
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage
                                            │ build/main/bench.out │              bench.out              │
                                            │        sec/op        │    sec/op     vs base               │
WriteTransaction/json_codec_big_tx-12                 9.292µ ± ∞ ¹   4.894µ ± ∞ ¹  -47.33% (p=0.008 n=5)
ReadEvents/json_codec/0_events-12                     352.7n ± ∞ ¹   310.3n ± ∞ ¹  -12.02% (p=0.008 n=5)
ReadEvents/json_codec_big_tx/0_events-12              346.6n ± ∞ ¹   315.5n ± ∞ ¹   -8.97% (p=0.016 n=5)
ReadEvents/nop_codec/0_events-12                      339.1n ± ∞ ¹   308.6n ± ∞ ¹   -8.99% (p=0.008 n=5)
ReadEvents/nop_codec_big_tx/0_events-12               336.5n ± ∞ ¹   306.9n ± ∞ ¹   -8.80% (p=0.016 n=5)
ReadEvents/nop_codec_big_tx/1000_events-12            978.0µ ± ∞ ¹   893.8µ ± ∞ ¹   -8.61% (p=0.032 n=5)
IsTraceSampled/sampled-12                             76.76n ± ∞ ¹   68.49n ± ∞ ¹  -10.77% (p=0.008 n=5)
IsTraceSampled/unsampled-12                           79.13n ± ∞ ¹   71.05n ± ∞ ¹  -10.21% (p=0.008 n=5)
IsTraceSampled/unknown-12                             414.2n ± ∞ ¹   373.3n ± ∞ ¹   -9.87% (p=0.008 n=5)
geomean                                               30.58µ         29.36µ         -3.99%
¹ need >= 6 samples for confidence interval at level 0.95

                                            │ build/main/bench.out │               bench.out                │
                                            │         B/op         │      B/op       vs base                │
WriteTransaction/json_codec_big_tx-12                3.687Ki ± ∞ ¹    3.686Ki ± ∞ ¹  -0.03% (p=0.008 n=5)
ReadEvents/nop_codec_big_tx/100_events-12            244.5Ki ± ∞ ¹    244.7Ki ± ∞ ¹  +0.05% (p=0.032 n=5)
geomean                                              31.39Ki          31.43Ki        +0.16%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                                            │ build/main/bench.out │              bench.out               │
                                            │      allocs/op       │  allocs/op    vs base                │
geomean                                                144.7          144.7        +0.00%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the index_mode needs to be set per index template, and the fields are used to create dimensions. We need to look into which fields should be used for a dimension, and which metrics should have a time_series_metric definition.

Other things such as look-ahead time might also need to be considered.

This is supposed to be a PoC and we need to test implications on the APM UI. Did you mean to create this PR to be ready for review? I suggest to put it into draft until everything is figured out.

@kruskall
Copy link
Member Author

kruskall commented Dec 5, 2022

My understanding is that the index_mode needs to be set per index template, and the fields are used to create dimensions. We need to look into which fields should be used for a dimension, and which metrics should have a time_series_metric definition.

Other things such as look-ahead time might also need to be considered.

This is supposed to be a PoC and we need to test implications on the APM UI. Did you mean to create this PR to be ready for review? I suggest to put it into draft until everything is figured out.

Thanks for sharing! 🙇

I think I misunderstood how this should work, I'll read up more docs about it and udpate the PR

@kruskall kruskall marked this pull request as draft December 5, 2022 12:55
@kruskall kruskall force-pushed the feat/tsds-metric branch 2 times, most recently from bb72beb to a9d756a Compare December 15, 2022 08:57
@mergify
Copy link
Contributor

mergify bot commented Dec 15, 2022

This pull request is now in conflicts. Could you fix it @kruskall? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feat/tsds-metric upstream/feat/tsds-metric
git merge upstream/main
git push upstream feat/tsds-metric

Remove index.sort.field for internal metrics:
illegal_argument_exception: [illegal_argument_exception] Reason: [index.mode=time_series]
is incompatible with [index.sort.field]
@kruskall kruskall marked this pull request as ready for review December 16, 2022 01:08
Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kruskall lmk once this should be reviewed again. I know that there is currently a Kibana blocker, but we also discussed focusing on the dimensions.

@simitt
Copy link
Contributor

simitt commented Jan 17, 2023

@kruskall we discussed that (almost) all of the fields that are part of the transaction metrics aggregation key should be part of the dimensions. I find following fields from the key missing in this PR:

faasColdstart          
faasName               
hostOSPlatform         
kubernetesPodName      
cloudRegion            
cloudAvailabilityZone  
cloudAccountID         
cloudAccountName       
cloudMachineType       
cloudProjectID         
cloudProjectName       	
serviceNodeName        
transactionName        
transactionResult      
transactionType        
eventOutcome           
faasTriggerType        
hostHostname           
hostName               
containerID            
traceRoot 

Can you explain why you excluded these fields from the TSDB key? If in doubt, I'd start out with adding all the fields from the aggregation key to the dimensions and see if that causes any issues or performance issues.

Since a goal of this PoC is not to merge the change to TSDB but evaluate potential issues on the UI or on performance, please update as soon as possible, so that the work on the evaluation can start.

@kruskall
Copy link
Member Author

@kruskall we discussed that (almost) all of the fields that are part of the transaction metrics aggregation key should be part of the dimensions. I find following fields from the key missing in this PR:

faasColdstart          
faasName               
hostOSPlatform         
kubernetesPodName      
cloudRegion            
cloudAvailabilityZone  
cloudAccountID         
cloudAccountName       
cloudMachineType       
cloudProjectID         
cloudProjectName       	
serviceNodeName        
transactionName        
transactionResult      
transactionType        
eventOutcome           
faasTriggerType        
hostHostname           
hostName               
containerID            
traceRoot 

Can you explain why you excluded these fields from the TSDB key? If in doubt, I'd start out with adding all the fields from the aggregation key to the dimensions and see if that causes any issues or performance issues.

Since a goal of this PoC is not to merge the change to TSDB but evaluate potential issues on the UI or on performance, please update as soon as possible, so that the work on the evaluation can start.

@simitt I was trying to test the changes out progressively, unfortunately rally takes is uploading the corpora which takes an absurd amount of time on slow connections. I've added all the dimensions but I'm unable to get some numbers at the moment.

@simitt
Copy link
Contributor

simitt commented Feb 20, 2023

Closing this for now until the blocker with limiting the number of dimensions is closed.

@simitt simitt closed this Feb 20, 2023
@StephanErb
Copy link

Closing this for now until the blocker with limiting the number of dimensions is closed.

@simitt @salvatore-campagna @felixbarny now that elastic/elasticsearch#93564 is solved and the TSDB dimension limit is gone, is it possible to reopen this and bring TSBD-support to APM?

@kruskall kruskall deleted the feat/tsds-metric branch April 15, 2024 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PoC: store well defined metrics as times-series data streams
4 participants