Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using refinery and otel-collector to route traces based on content (add dataset to inmemory collector cache key) #269

Closed
tr-fteixeira opened this issue Jun 9, 2021 · 3 comments

Comments

@tr-fteixeira
Copy link
Contributor

Started the discussion on the pollinators slack, here.

Im trying to use refinery as one of the tools to achieve "trace routing" or "trace multiplexing" capabilities, it is a unusual use case, but here is the context of the ask.

this might be an unconventional use case, and seems like it runs into some problems here, because it uses only traceID to group spans.

Details 👇
Context:
Using istio and a shared ingress gateway for multiple applications/environments/teams (each mapping to its own dataset)
This means i have to split the destination of the traces based on content they carry (not doable in istio alone, or any app)..

Screen Shot 2021-06-06 at 5 58 09 PM

What am i trying to do?
Get those istio traces to the correct dataset, by means of:
Shared istio gateway -> otel-collector (fan out to 2 exporters, one for each dataset) -> refinery (rulesbasedSampler - drop traces of other environments/namespaces) -> HC
This effectively duplicates all traces generated by istio, and adds the meta of different datasets to each,
After that, the thought was to drop based on Refinery rules the ones going to the wrong dataset.

What ends up happening?
For a single exporter, it works great.. when multiple exporters are enabled, traces don’t get evaluated properly, some being kept/dropped, seemly in a random fashion.

What i think the problem is:
(No batching is enabled in otel)
When refinery collect the spans in memory and collate them to form a trace here, only the TraceID is taken into account, so this is combining (sometimes) both exporters data, with different datasets as upstream into one single trace, making the sampling decision look wrong.

To The actual questions
Does this sound right? or did i get it all wrong 😃
Was this a conscious decision or i am trying to use the tool on an unexpected use case?

Possible solutions

  • Adding sp.Dataset to the cache object key as a prefix should fix it..
  • using multiple refinery deployments for each exporter

Would prefer to do 1, but wanted to hear your thoughts before working on a PR 😃

@tr-fteixeira tr-fteixeira changed the title Using refinery and otel-collector to route traces based on content (add dataset to inmemory collectore cache key) Using refinery and otel-collector to route traces based on content (add dataset to inmemory collector cache key) Jun 22, 2021
@paulosman
Copy link

paulosman commented Jul 21, 2021

HI @tr-fteixeira - sorry for the slow response on this.

I'm going to try and summarize your use case. Can you please tell me if I got it right or if I'm still misunderstanding something? I'm going to try and remove components specific to your architecture that don't impact the use case:

You have the same traces going through an OTel Collector. The OTel collector has two exporters, each sending the traces to a Refinery instance, but with different Datasets specified. So it would look something like this:

exporters:
  otlp/honeycombOne:
    endpoint: "refinery.yourco.com:9090"
    headers:
      "x-honeycomb-team": "s3cret"
      "x-honeycomb-dataset": "dataset-one"
  otlp/honeycombTwo:
    endpoint: "refinery.yourco.com:9090"
    headers:
      "x-honeycomb-team": "s3cret"
      "x-honeycomb-dataset": "dataset-two"

So at this point, you're essentially tee'ing your traces, sending the same data to two different datasets.

In Refinery, you want to sample these at different rates or based on different rules, so you'd like to look at the dataset and make a rule-based sampling decision based on that value, for example, you may want to sample all traces sent to dataset-one at 1/5 and all traces sent to dataset-two at 1/10.

Because Refinery makes sampling decisions based on trace IDs alone, you're going to end up with a seemingly arbitrary mix of sampling at 1/5 and 1/10 in each dataset, which will result in broken traces in both.

Is that correct? Please let me know if I've misunderstood or misstated anything.

If that's the case, I'm trying to think of edge cases where encoding the dataset could create a problem. I don't think there are any, but it will require some thought and/or testing. I'm concerned about how traces are sent to other nodes, what ends up being set in Redis, etc, etc.

Sampling on TraceID was an intentional design choice since they're meant to be unique. I'm hesitant to encode Dataset as an arbitrary 2nd attribute just to guarantee uniqueness, but I do understand the need (I think?). This is definitely a "there be dragons" kind of situation, so please do be patient with us as we think it through :-)

@tr-fteixeira
Copy link
Contributor Author

Hey @paulosman, thanks for taking a look at it, and yes, you got it right, that is the use case. At least, thats the only way i could think of achieving what i needed.

By that i mean, If we go up an abstraction level, i am looking for the ability to send specific traces from the same source system to different datasets, based on rules(trace/span content) anyway i can =). There are some discussions on Otel collector routing it, but nothing quite there yet.

If you want any help on testing, or clarifications, let me know (here or on pollinators 👍 )

@vreynolds
Copy link
Contributor

Summary from slack: decided against changing the Refinery caching keys, given that this is a fringe use case for Refinery, and fits better as a collector concern. Potential candidates for achieving this with the collector: trace filter processor or the routing processor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants