Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Otel/STEF project #2492

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

tigrannajaryan
Copy link
Member

Otel/STEF (Sequential Tabular Encoding Format) is a new data format and network protocol for OpenTelemetry data.

For the target use-cases Otel/STEF outperforms both OTLP and Otel Arrow (phase 1): Otel/STEF is smaller and/or faster.

See stef.md for details: benchmarks and comparisons to other formats, links to prototypes and description of project goals.

Otel/STEF (Sequential Tabular Encoding Format) is a new data format and network
protocol for OpenTelemetry data.

For the target use-cases Otel/STEF outperforms both OTLP and Otel Arrow: Otel/STEF is smaller and/or faster.

See stef.md for details: benchmarks and comparisons to other formats,
links to prototypes and description of project goals.
@tigrannajaryan tigrannajaryan force-pushed the feature/tigran/stef-project branch from f9c88e7 to 33208b4 Compare December 16, 2024 20:51
Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is a legitimate need in OpenTelemetry for a stateful protocol (or even something as simple as adding dictionaries to OTLP itself).

As discussed before, I'm supportive of investigating STEF and bringing it to a usable state.

However - I think we need to understand our end-game here. When we evalaute "why not Arrow" or "Why STEF" or "Why not OTLP", I think this proposal is still lacking our primary goal/scope. It has projects, but not implications of delivery of those projects.

I'd like to make sure we align on where this will be used. I called out two areas I think could dramatically improve from some low-level, stateful, efficient protocols. I don't think these are the only targets, but I'm also not sure these were on your radar either.

Let's confirm we agree on end-state then I'm in.

A draft specification and a prototype implementation of Otel/STEF is attached to thi
proposal.

### Goals, objectives, and requirements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be two additional goals here (possibly one if you phrase it well).

  • We should evaluating an efficient file-based protocol for direct-export from API with out-of-band "Collector-style" SDK. That is, imagine an SDK implementation that can serialize events out-of-band quickly and efficiently, and offers more resilience on process-death. This is what I was working my towards with https://github.com/jsuereth/otlp-mmap/ and I think if we invest in a stateless protocol, this should be a use case it can support.
  • Providing guidance to eBPF based telemetry extraction. If we are able to define structures and buffers and efficient stateful communication, we might be able to provide a good set of primitives for eBPF based event-extraction (complementary to my first bullet point). This is an avenue I think may be worth exploring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should evaluating an efficient file-based protocol for direct-export from API with out-of-band "Collector-style" SDK.

That's certainly something we can extend STEF to do. A mmap-ed ring buffer of STEF frames can be that. Since STEF optionally allows full state resets between frames, it essentially has a stateless mode built-in (with resets happening every frame).

Providing guidance to eBPF based telemetry extraction.

I am not sure I understand this one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be two additional goals here (possibly one if you phrase it well).

I added a more generic goal that says that the project should evaluate additional use-cases.


Project non-goals:

- We do not plan to offer Otel/STEF as a general-purpose replacement for OTLP or for Otel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I disagree with this. I understand the concern around limiting investment and engineering resources. However - I think you need to better clarify the target use case in this situation. Is this something only the Go SDK would be able to use with a collector? Is this something for just Collector->Collector communication?

That isn't really answered in this proposal, and I think it's critical.

My $.02 is that if we invest in a stateless protocol It should be an Open Standard (e.g. Arrow/parquet) or have an OpenTelemetry use case net well served by existing open standards and OTLP.

Here, you demonstrate that STEF can outperform arrow in efficiency for transmitting telemetry data. Where in open-telemetry do we use that?

TL;DR; this should have a targeted set of use cases for the protocol.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something only the Go SDK would be able to use with a collector? Is this something for just Collector->Collector communication?

Technically, nothing prevents any language SDK from having a STEF exporter. I am only making this claim to avoid placing additional burden on language maintainers. Should we (Otel) find it desirable to have STEF protocol support in SDKs, that's certainly doable.

I think the highest ROI is going to be in Collector->Collector or Collector->backend communication and that's why I suggest to start with that.

Can we extend the project to allow the use-cases you mentioned? Absolutely, and I would love to see that happen. I am just intentionally defining the initial scope to be small enough that we can deliver it quickly.

I agree that direct-export use case you mentioned is also very interesting (not for just performance but for crash-resilience reasons).

I am happy to extend the scope if you think you (or some other Otel contributor) could invest time in that extended scope.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have an OpenTelemetry use case net well served by existing open standards and OTLP.

This is my motivation. OTLP is inefficient size-wise. Otel Arrow Phase 1 is better size-wise, but even better is possible (as the benchmarks show) and is also quite expensive cpu-wise. STEF significantly advances the performance of our network protocols on multiple dimensions, with relatively small investment.

The first use case is as is described in the proposal: smaller wire size (network cost savings), less cpu consumption (compute cost savings) by the Collector.

Do you think additional explanation is needed for this use case? Or do you want the additional use cases to be described (e.g. the mmap-ed direct-export)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should showcase where stateful protocols are viable (e.g. how does it work through a load balancer?).

While I think stateful protocols have a lot of good use cases, from a broad sense - I have concerns that we may be optimising network overhead at the expense of inflexible network architecture and memory overhead in storing dictionaries from N clients.

I.e. for small, contextual cases this protocol is amazing and should be optimised for such cases. In broad, highly distributed, cases, we may still need to invest in OTLP optimisations.

To be clear - I think STEF shows a lot of potential and is worth investing in. I want to be explicit in this proposal where we think we see the biggest benefit and where the trade-offs fall off. Let's have a target architecture in mind for benchmarks, comparison and 'success'. I think you have that in your head (and were able to elucide when we talked about this), but it's just not written in the proposal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsuereth I completely agree with you, we need answers to that. I think this line in the list of goals touches that:

Publish benchmark-justified guidelines on applicability of Otel/STEF vs OTLP vs Otel Arrow.

If this is not enough I can call it out more explicitly.

Copy link
Member Author

@tigrannajaryan tigrannajaryan Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more in 88ede99

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am supportive, but would like to see more clarity on the goals, design principles, and trade-offs. For example, this could be explicitly limited to and optimized for a wire transmission protocol at the expense of other usage patterns, but I did not find these goals clearly stated.


SIG meeting to be scheduled once the project is approved.

## FAQ
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has there been a comparative analysis done of other existing protocols besides Arrow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarks include comparison with OTLP, Parquet and Otel Arrow. If you think there are other interesting formats to compare to we can add it to the project goals.

SIG meeting to be scheduled once the project is approved.

## FAQ

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I the Spec I did not see a list of design principles for STEF protocol. Does such list exist? What is the protocol optimized for? For instance, this doc illustrates speed and size benchmarks, but do they come with trade-offs? What about memory layout, zero-copy capabilities, ability to append data, efficiency of query execution against the data, etc.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added design principles here 2aa7487

Copy link
Member Author

@tigrannajaryan tigrannajaryan Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you would like more details.

Copy link

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit sceptic, Arrow has extension types, how much have they been explored? It sounds like this format is in philosophy 90% identical to Arrow. This is intended as a mostly internal protocol (since only Go is expected to have an implementation), so even the weirder things we've seen in Arrow would be doable (eg. use a binary array and store custom bytes in them).

Much like @yurishkuro, I'd like to see more on tradeoffs.

projects/stef.md Show resolved Hide resolved
@tigrannajaryan tigrannajaryan force-pushed the feature/tigran/stef-project branch from ca01f2d to 24716f9 Compare December 17, 2024 15:22
@tigrannajaryan
Copy link
Member Author

@tigrannajaryan
Copy link
Member Author

I'm a bit sceptic, Arrow has extension types, how much have they been explored?

I think this is a question for Otel Arrow SIG.

It sounds like this format is in philosophy 90% identical to Arrow. This is intended as a mostly internal protocol (since only Go is expected to have an implementation), so even the weirder things we've seen in Arrow would be doable (eg. use a binary array and store custom bytes in them).

Correct, in many ways it is similar to Arrow.

Much like @yurishkuro, I'd like to see more on tradeoffs.

Added a section with design principles that explains the tradeoffs.

@tigrannajaryan tigrannajaryan force-pushed the feature/tigran/stef-project branch from bd6c898 to 88ede99 Compare December 17, 2024 21:35
Comment on lines +12 to +13
Otel/STEF targets a narrower niche than OTLP or Otel Arrow and is more efficient
for that niche. Otel/STEF is optimized for payload size and fast serialization
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the ecosystem and interoperability?

| batch_size: 1024 | 24764 (total: 1.1 MB) | 5622 (x 4.40) (total: 242 kB) | 10773 (x 2.30) (total: 463 kB) |
| batch_size: 2048 | 39325 (total: 865 kB) | 9209 (x 4.27) (total: 203 kB) | 17808 (x 2.21) (total: 392 kB) |
| batch_size: 4096 | 64824 (total: 713 kB) | 15501 (x 4.18) (total: 170 kB) | 29421 (x 2.20) (total: 324 kB) |
| batch_size: 16384 | 196877 (total: 591 kB) | 38376 (x 5.13) (total: 115 kB) | 86299 (x 2.28) (total: 259 kB) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about updating these to show the result of using OTLP + various compression algorithms. gzip is the default, but I wrote benchmarks for a variety of others and had good results. I wonder how Otel/STEF And OTEL Arrow + Stream Mode stack up.

@austinlparker
Copy link
Member

Heya, super interesting stuff. I have a couple of questions as a GC member --

  1. Could you help me understand how this project fits into the project goals, overall? I appreciate that 'stateful OTLP' is useful for many reasons, and I do see that this offers improvements vs. OTLP/Arrow, but given the balance of work that we're trying to tackle in 2025 I'm trying to understand where this fits in.
  2. Is the existence of this a result of OTLP/Arrow not being suitable? I agree with some points upthread (e.g., having a stateful 'collector-like' sdk buffer), but it also feels like this being scoped down to just Go/Collector would make it more difficult to achieve that goal.

Appreciate any responses to these points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants