-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support distributed trace sampling #52
base: master
Are you sure you want to change the base?
Support distributed trace sampling #52
Conversation
Hey! Thanks a lot for all the effort you put into this PR! ❤️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started to look through this PR before realizing that I should focus on the spandex
PR first. Just committing the comments that I had already collected here.
@@ -52,9 +52,11 @@ defmodule SpandexDatadog.MixProject do | |||
defp deps do | |||
[ | |||
{:msgpax, "~> 2.2.1 or ~> 2.3"}, | |||
{:spandex, "~> 3.0"}, | |||
{:spandex, github: "TheRealReal/spandex", branch: "distributed-sampling-priority"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I should focus on getting https://github.com/spandex-project/spandex/pull/142/files merged first, eh? 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the spandex changes are simpler and enable spandex_datadog. All the dirty work is done here.
* Change default `priority` to `nil` (not set) instead of 1 | ||
* Do not send metric for `priority` unless set | ||
* Do not set `x-datadog-sampling-priority` header unless priority is set | ||
* Use `sample_rate` tag to set `_dd1.sr.eausr` value | ||
* Use `rule_sample_rate` tag to set `_dd.rule_psr` without default (previously hard coded to 1.0) | ||
* Use `rate_limiter_rate` tag to set `_dd.limit_psr`, without default (previously hard coded to 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if you know whether these breaking changes are necessary or if you're just proposing them. If we could avoid making breaking changes in Spandex, that would obviously be preferred, but if the current behavior is incorrect in terms of what Datadog expects, then we can definitely move toward fixing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are necessary for us to have control so we can reduce our Datadog bill :-).
Forcing the span analytics on results in broken traces for us, i.e. we are sending (and paying for) spans even when the original trace was sampled out and not retained. The span analytics and rule settings are also obsolete with the new Datadog server-based ingestion controls.
Co-authored-by: Greg Mefford <greg@gregmefford.com>
Trying this out, but it doesn't seem to work. Not sure why, but we're still seeing 100% of spans being ingested for "unknown" reason which I assume is that the agent didn't think it could sample. @reachfh does this work for you? |
@zachdaniel I've verified the changes in this PR to see how it would affect our trace ingestion and created a demo app for testing. Here's how to set it up and documentation on some findings (including a bug): https://github.com/kamilkowalski/spandex_sampling_test#readme So it seems the sampling works when viewing live traces in Datadog, but it also seems Ingestion Control constantly reports 100% of traces being ingested even when the rate is 0.5. Maybe we're not sending some necessary statistics or headers? The library is not being recognized by Datadog - could that be a factor? |
Okay, Ingestion Control reporting weird numbers is actually documented behavior1:
Footnotes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your awesome work @reachfh! I'm not one of the maintainers but I went through the changes and left some comments. We're interested in getting some form of sampling working in Spandex so if there's any way I can help just let me know.
- name: Run tests | ||
run: | | ||
mix test | ||
mix format --check-formatted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mix format
doesn't need deps to be installed and could even run in a separate job.
mix local.rebar --force | ||
mix local.hex --force |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebar3 and Hex are installed by erlef/setup-beam
by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
uses: actions/cache@v2 | ||
with: | ||
path: deps/ | ||
key: deps-${{ runner.os }}-${{ matrix.otp }}-${{ matrix.elixir }}-${{ hashFiles('**/mix.lock') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deps
directory only contains dependency source files, no need to depend on anything else than mix.lock
:
key: deps-${{ runner.os }}-${{ matrix.otp }}-${{ matrix.elixir }}-${{ hashFiles('**/mix.lock') }} | |
key: deps-${{ hashFiles('mix.lock') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
@@ -1,7 +1,7 @@ | |||
# Elixir CircleCI 2.0 configuration file | |||
# | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be removed now that there's a GitHub Actions workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't use CircleCI, so it's fine with me.
`priority` value `AUTO_REJECT` (0) indicates that the trace has not been | ||
selected for sampling. | ||
""" | ||
def auto_reject, do: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def auto_reject, do: 1 | |
def auto_reject, do: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
@@ -0,0 +1,44 @@ | |||
defmodule SpandexDatadog.RateSampler do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking this could implement a more generic Sampler
behaviour, and it should be possible to set this sampler as the one used for all traces when doing application-side sampling. This is how dd-trace-rb
does it. We could then implement more sampling strategies. Otherwise, every time you start a trace, you will need to run this sampler and set the priority. In our services, traces can be started in HTTP endpoints, gRPC endpoints and Kafka consumers - that's a lot of places to update. Even if it's a middleware, I don't see a reason why there shouldn't be a way to configure a global sampler for all traces starting in the service.
I no longer have access to the TheRealReal repo which was used to create this PR. I can make another PR, or feel free to make whatever changes you like to this one. |
This PR configures distributed trace sampling according to the new Datadog ingestion controls.
Datadog now uses the
x-datadog-sampling-priority
header to determine whether a trace should be sampled.There are four values:
MANUAL_KEEP
(2) indicates that the application wants to ensure that a trace is sampled, e.g. if there is an error.AUTO_KEEP
(1) indicates that a trace has been selected for samplingAUTO_REJECT
(0) indicates that the trace has not been selected for samplingMANUAL_REJECT
(-1) indicates that the application wants a trace to be droppedBefore this PR, when initializing a distributed trace via headers, if the
x-datadog-sampling-priority
header was not set, a default value of 1 was used. This effectively forced all traces to be sampled. Now, if the header is not present, the priority is not set (default isnil
).Distributed traces can set the priority of the trace as follows:
x-datadog-sampling-priority
header on the inbound request, propagating it to downstream http requests (handled by this module).plug
) and set the priority to 1 on the current trace (seeSpandex.Tracer.update_priority/2
)The new module
SpandexDatadog.RateSampler
supports sampling a proportion of traces based on the randomtrace_id
.In addition, this PR changes the sample rate parameters which are set when sending a trace to Datadog.
Most of these are now obsolete with the new Datadog ingest handling, and only the distributed sampling priority should be used. In the past, values were hard coded to 1.0. Now the default is to not send these metrics, but they can be configured via span tags.
The
x-datadog-origin
tag is set when traces are started by Datadog RUM and synthetic traces. This PR propagates that header via baggage.Breaking Changes:
priority
tonil
(not set) instead of 1priority
unless setx-datadog-sampling-priority
header unless priority is setsample_rate
tag to set_dd1.sr.eausr
valuerule_sample_rate
tag to set_dd.rule_psr
without default (previously hard coded to 1.0)rate_limiter_rate
tag to set_dd.limit_psr
, without default (previously hard coded to 1.0)Features:
agent_sample_rate
tag to set_dd.agent_psr
without default (was 1.0)priority
x-datadog-origin
headerlib/spandex_datadog/constants.ex
Config
instead ofMix.Config
inconfig
files