Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust the collector semantics #180

Closed
huntc opened this issue Jul 27, 2020 · 28 comments
Closed

Adjust the collector semantics #180

huntc opened this issue Jul 27, 2020 · 28 comments
Labels
release:after-ga Not required before GA release, and not going to work on before GA spec:logs spec:metrics spec:traces

Comments

@huntc
Copy link

huntc commented Jul 27, 2020

Looking at the export function for, say, metrics (same for tracing), we have:

service MetricsService {
  // For performance reasons, it is recommended to keep this RPC
  // alive for the entire life of the application.
  rpc Export(ExportMetricsServiceRequest) returns (ExportMetricsServiceResponse) {}
}

Given that a client in this context is the application producing the metrics and therefore making the request, would it not be more correct to have the request declared as "import" (not export) i.e.:

service ImportMetricsService {
  // For performance reasons, it is recommended to keep this RPC
  // alive for the entire life of the application.
  rpc Import(ImportMetricsServiceRequest) returns (ImportMetricsServiceResponse) {}
}

Additionally, would it be reasonable to also keep the existing MetricsService service (renamed) so that the collector can become the client and the application producing the metrics becomes the server? In this instance, the service would stream responses of metrics:

service ExportMetricsService {
  rpc Export(ExportMetricsServiceRequest) streams (ExportMetricsServiceResponse) {}
}

message ExportMetricsServiceRequest {
}

message ExportMetricsServiceResponse {
  // An stream of ResourceMetrics.
  // For data coming from a single resource this array will typically contain one
  // element. Intermediary nodes (such as OpenTelemetry Collector) that receive
  // data from multiple origins typically batch the data before forwarding further and
  // in that case this array will contain multiple elements.
  repeated opentelemetry.proto.metrics.v1.ResourceMetrics resource_metrics = 1;
}

I've used the above export style scenario for situations where I don't know what collector I'm going to use at development time and prefer to choose at runtime (if any exporter at all). Think on resource-constrained devices deployed at the edge behind poor networks. In this scenario, I'd use tools like buf to query the application by logging onto the edge device and wanting to observe what's going on in real-time.

@bogdandrutu
Copy link
Member

This starts a very interesting discussion, about which service is initiating the connection and which service exposes a port. In the current format the port is exposed only by the collector which is easier for discovery and configuration, I can see where this idea is coming from but need to think about all the ups and downs.

@huntc
Copy link
Author

huntc commented Jul 29, 2020

Happy to jump on a call if it helps the discussion. I'm UTC+10.

@bogdandrutu
Copy link
Member

bogdandrutu commented Jul 29, 2020

Also this has implications as long as we support deltas or raw measurements, because client (application) needs to buffer all data not exported until the collector comes and asks for the data.

Not saying which one is better but all these needs to be considered.

Try to join the metrics SIG meeting when scheduled for Australia time zone friendly, I think is 4 pm PST

@huntc
Copy link
Author

huntc commented Jul 29, 2020

Also this has implications as long as we support deltas or raw measurements, because client (application) needs to buffer all data not exported until the collector comes and asks for the data.

My "opinionated" view is that telemetry is provided on a "best-effort" basis. Drop it, don't buffer it. Here's a sample exporter I'm writing that takes on this view: titanclass/reactive-streams-telemetry#6

Try to join the metrics SIG meeting when scheduled for Australia time zone friendly, I think is 4 pm PST

Will do.

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Jul 29, 2020

In this instance, the service would stream responses of metrics:

gRPC streams are not Load Balancer friendly (are not automatically rebalanced) and this is the reason we use gRPC unary instead of streams.

Additionally we support HTTP transport for OTLP too. A requirement for streaming would make HTTP impossible or we would have to resort either to complicated solutions like long polling or to WebSockets, which is undesirable since plain HTTP 1.1 is beneficial due to its simplicity.

@tigrannajaryan
Copy link
Member

This starts a very interesting discussion, about which service is initiating the connection and which service exposes a port.

On push vs pull metric protocols: https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/

Also this has implications as long as we support deltas or raw measurements, because client (application) needs to buffer all data not exported until the collector comes and asks for the data.

Just to clarify on this. The direction of connection establishment does not necessarily coincide with the direction of request initiator. For transports that support bidirectional full duplex operation either side can be the requesting side regardless of who performed initial connection. This is the case for gRPC stream or for HTTP WebSocket (both are full duplex). However for plain HTTP 1.1 connection it is logically mandatory that the side that establishes the connection is the requesting party (this can be circumvented by perverted usage of HTTP transport but typically is not possible or difficult with most high-level HTTP clients).

@huntc I think this PR is a leading us in a wrong direction for OTLP. We have already declared OTLP traces as stable. OTLP traces already define the direction of connection establishment and the transport to use (gRPC Unary and HTTP 1.1). I believe we need to keep the protocol consistent and metrics should behave the same way.

Unless there is a clear evidence that the current OTLP approach is not working I would refrain from such changes in the OTLP itself. If there is a need to perform collection of metrics in a different, special way then design a specialized protocol and support it in the Collector and in the SDKs. BTW, there is already a polling metrics protocol - Prometheus - which may or may not be good enough for the use case (and I would want to see more about the use case that is driving this proposal).

@huntc
Copy link
Author

huntc commented Jul 29, 2020

At the very least, if we could consider the renaming of the service to correctly reflect the action then that might be useful.

As for streaming and load balancers, I’m not getting that. This interface already declares that long lived connections are required.

Seems to me that having the app initiate a connection is the right thing to do. However, I don’t see any downside in supporting both the import and export service interfaces as possibilities.

I’m actually using the export (push) approach and this interface for an Android Service implementation. Android apps request an export and telemetry is streamed from another app hosted on the same device.

@huntc
Copy link
Author

huntc commented Jul 30, 2020

Thinking of this a bit more... perhaps the connection instigator and requestor should be highlighted as being either one of the same or different, and we keep the interface as-is (although streamed responses should be permissible). For example:

  • connector == app, requester == app (what it is today)
  • connector == app, requester == collector
  • connector == collector, requester == app
  • connector == collector, requester == collector

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Jul 30, 2020

As for streaming and load balancers, I’m not getting that. This interface already declares that long lived connections are required.

I think the comment is outdated and is a remnant from the time we were using streams (it was the case in the very first version of the protocol). gRPC will likely still use long lived connections for unary but we don't need to do anything special about it. It is gRPC's concern, not our.

The problem with load balancers is the following: gRPC-aware load balancers understand gRPC unary requests and are capable of load balancing them (sending to different targets) and they make the balancing decision for each individual unary request. For gRPC streams the load balancing decision is made for the entire stream. This means if you keep the stream open (which you normally do) the individual requests inside that stream will not be load balanced. This can lead to unbalanced loads to destination if you have streams with varying amount of requests. We have seen this actually hapening in production. It can be mitigated by closing/reopening streams but that complicates implementation. Because of this we chose unary gRPC.

@tigrannajaryan
Copy link
Member

Thinking of this a bit more... perhaps the connection instigator and requestor should be highlighted as being either one of the same or different, and we keep the interface as-is (although streamed responses should be permissible). For example:

* connector == app, requester == app (what it is today)

* connector == app, requester == collector

* connector == collector, requester == app

* connector == collector, requester == collector

Only the fist and last are reasonably transport agnostic. The second and third are only possible for a more narrow set of transports and I don't think are a good approach for OTLP that we want to be available everywhere (e.g. anywhere HTTP 1.1 works).

@bogdandrutu bogdandrutu added release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:logs spec:metrics spec:traces labels Jul 30, 2020
@bogdandrutu
Copy link
Member

Marked this as required for ga because a decision is required before GA

@huntc
Copy link
Author

huntc commented Jul 30, 2020

I think the comment is outdated and is a remnant from the time we were using streams (it was the case in the very first version of the protocol). gRPC will likely still use long lived connections for unary but we don't need to do anything special about it. It is gRPC's concern, not our.

I disagree. Getting the types right is important. If a streaming activity is desired, then it should be conveyed in the types. We're talking only about the the gRPC interfaces here too, not the http ones. The IDL pertains to gRPC only.

@huntc
Copy link
Author

huntc commented Jul 31, 2020

The problem with load balancers is the following: gRPC-aware load balancers understand gRPC unary requests and are capable of load balancing them (sending to different targets) and they make the balancing decision for each individual unary request. For gRPC streams the load balancing decision is made for the entire stream. This means if you keep the stream open (which you normally do) the individual requests inside that stream will not be load balanced. This can lead to unbalanced loads to destination if you have streams with varying amount of requests. We have seen this actually hapening in production. It can be mitigated by closing/reopening streams but that complicates implementation. Because of this we chose unary gRPC.

The existing gRPC IDL comment states:

// For performance reasons, it is recommended to keep this RPC
// alive for the entire life of the application.

You're already declaring that you desire long-live connections for gRPC.

Also, I'm keen to understand the role that a load balancer plays in a collector's network topology from your perspective. My experience has been that a collector typically resides on the same host as the processes that it collects data from. I've even seen agents side-car the processes within the same container. I'm keen to understand where load balancers play a role. Having scanned the collector docs, I see no mention of load balancers.

@huntc
Copy link
Author

huntc commented Jul 31, 2020

Only the fist and last are reasonably transport agnostic. The second and third are only possible for a more narrow set of transports and I don't think are a good approach for OTLP that we want to be available everywhere (e.g. anywhere HTTP 1.1 works).

The IDL being discussed here is purely for gRPC, not HTTP.

@tigrannajaryan
Copy link
Member

The existing gRPC IDL comment states:

// For performance reasons, it is recommended to keep this RPC
// alive for the entire life of the application.

You're already declaring that you desire long-live connections for gRPC.

The comment is incorrect. AFAIK there is nothing you need to do to keep unary gRPC connection long-lived. Historical context: this comment is copied from OpenCensus protocol definition which indeed uses streams: https://github.com/census-instrumentation/opencensus-proto/blob/54c9a4c53e1fa422b92e3ae1cd6c72e181f744db/src/opencensus/proto/agent/trace/v1/trace_service.proto#L45
The purpose of the comment was to make sure the stream is not closed. This is not applicable to unary calls because there is nothing to open or close in unary communication. The comment is meaningless for OTLP, because it does not use streams. We simply forgot to delete the comment when we changed from streams to unary.

Also, I'm keen to understand the role that a load balancer plays in a collector's network topology from your perspective. My experience has been that a collector typically resides on the same host as the processes that it collects data from. I've even seen agents side-car the processes within the same container. I'm keen to understand where load balancers play a role. Having scanned the collector docs, I see no mention of load balancers.

Collectors can be used as agents (as you describe) but also as standalone horizontally scaled services that are placed behind a load balancer. We have done this extensively in our production in the past and had a lot of pain with gRPC streaming causing disbalances in the system. The current OTLP design is based on the past learnings, including this particular painpoint.

You can read https://github.com/open-telemetry/oteps/pull/35/files to have more context on how OTLP was designed. There is a section that discusses streams, unary, OpenCensus, etc.

@tigrannajaryan
Copy link
Member

The IDL being discussed here is purely for gRPC, not HTTP.

Correct, however it is desirable to have reasonably consistent behavior for both transports. There can be differences but we should we aim for consistency where we can. Using different connection directions for OTLP/gRPC and OTLP/HTTP in my opinion is highly undesirable.

@huntc
Copy link
Author

huntc commented Jul 31, 2020

The IDL being discussed here is purely for gRPC, not HTTP.

Correct, however it is desirable to have reasonably consistent behavior for both transports. There can be differences but we should we aim for consistency where we can. Using different connection directions for OTLP/gRPC and OTLP/HTTP in my opinion is highly undesirable.

I see differing interfaces for http and gRPC as reasonable. One shouldn’t be shoe-horned into the other as this will lead to compromise. Same for GraphQL, Android IPC or whatever. The IDL here is useful for communicating gRPC semantics only, recognising that the metric and tracing IDL can be used readily with other transports.

@bogdandrutu
Copy link
Member

@huntc I think your use-case seem reasonable in some scenarios, but overall it will not necessary improve usability in the general case. Because of this I would suggest that you come with a proposal for a different protocol and we can support that as well, but I think we should also support the current protocol.

So I think as a follow up you should try to define in experimental a new protocol and maybe prototype it in one language. Does it sound fair?

@bogdandrutu bogdandrutu added release:after-ga Not required before GA release, and not going to work on before GA and removed priority:p1 release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Aug 29, 2020
@huntc
Copy link
Author

huntc commented Aug 29, 2020

@huntc I think your use-case seem reasonable in some scenarios, but overall it will not necessary improve usability in the general case. Because of this I would suggest that you come with a proposal for a different protocol and we can support that as well, but I think we should also support the current protocol.

So I think as a follow up you should try to define in experimental a new protocol and maybe prototype it in one language. Does it sound fair?

Thanks for asking. I’ve probably muddied the waters here by asking for a number of things:

  • Can we rename export to import to more accurately reflect the data flow?
  • Can we introduce a new Export?

In both cases, I feel we should specify the reply as a stream given that stream semantics are available. This is a gRPC interface specification for the request/reply.

How about we look at renaming export to import (the service is issuing a request to the collector to import its telemetry) and declare the reply as a stream? We can look at anything else separately. Would that make sense? I’m pushing this because I don’t think we have specified the existing interface correctly. Perhaps I’m missing something though.

@bogdandrutu
Copy link
Member

bogdandrutu commented Aug 30, 2020

@huntc I think I see things differently. The application (service) is making a request to the collector to export it's telemetry data, I don't see the current protocol as import, because it is not the collector how initiate the request.

So I read the current protocol as Application exports it's telemetry data to the collector.

Can we introduce a new Export?

Yes we can start as mentioned to see a concrete proposal of how this new protocol should look like. We can have this for the moment in experimental directory until we understand better the requirements.

In both cases, I feel we should specify the reply as a stream given that stream semantics are available. This is a gRPC interface specification for the request/reply.

Can you explain why in the current protocol we would benefit from having a response stream? Right now the application makes an export request with the data and just waits for one response to ack the data is received, I don't see a reason to have stream there, maybe I am missing something.

@bogdandrutu
Copy link
Member

Also to set the right expectations, would be very hard to change the current protocol unless we prove something completely wrong (which we may, but we need to have a good argument). Adding a new protocol to support your proposal is much easier.

@huntc
Copy link
Author

huntc commented Aug 30, 2020

@huntc I think I see things differently. The application (service) is making a request to the collector to export it's telemetry data, I don't see the current protocol as import, because it is not the collector how initiate the request.

So I read the current protocol as Application exports it's telemetry data to the collector.

The exporter is asking the collector to collect its telemetry. What the collector then does with the telemetry is the collector’s business, including aggregation, dropping elements etc.

Can we introduce a new Export?

Yes we can start as mentioned to see a concrete proposal of how this new protocol should look like. We can have this for the moment in experimental directory until we understand better the requirements.

I think it'd be difficult to offer a new export request if we don’t rename the existing one to “import”. The new export would be for the purposes of a collector requesting an exporter for its telemetry.

In both cases, I feel we should specify the reply as a stream given that stream semantics are available. This is a gRPC interface specification for the request/reply.

Can you explain why in the current protocol we would benefit from having a response stream? Right now the application makes an export request with the data and just waits for one response to ack the data is received, I don't see a reason to have stream there, maybe I am missing something.

I don't see the need to wait for a response when we can stream until the collector back-pressures by leveraging what’s built into gRPC.

I also see that streaming more accurately states the intent of the API and is stronger than the existing API doc comment to encourage long-lived connects.

@huntc
Copy link
Author

huntc commented Aug 30, 2020

Also to set the right expectations, would be very hard to change the current protocol unless we prove something completely wrong (which we may, but we need to have a good argument). Adding a new protocol to support your proposal is much easier.

I suspect that this is the real reason why my suggestions here are being resisted. ;-)

@tigrannajaryan
Copy link
Member

Also to set the right expectations, would be very hard to change the current protocol unless we prove something completely wrong (which we may, but we need to have a good argument).

@huntc To reinforce this. We cannot change the current protocol the way that you ask because what you ask for is a breaking change. The protocol is already approved, declared stable and signed off for Traces portion. Any changes should be fully backward compatible in the sense that any pair of nodes implementing the old and the new version of the protocol MUST interoperate correctly. Your suggestion does not meet this criteria. As Bogdan says the best way to move forward is to see a complete proposal that takes into account OpenTelemetry's stance that breaking changes are no longer allowed in the Traces portion.

Such proposals are typically done as an OTEP, see https://github.com/open-telemetry/oteps

I suggest that we close this issue and continue the discussion in the scope of the OTEP if/when such an OTEP is created. If you haven't already done so please also read exiting protocol OTEPs before making the proposal:
https://github.com/open-telemetry/oteps/blob/master/text/0035-opentelemetry-protocol.md
https://github.com/open-telemetry/oteps/blob/master/text/0099-otlp-http.md
https://github.com/open-telemetry/oteps/blob/master/text/0122-otlp-http-json.md

@huntc huntc closed this as completed Aug 31, 2020
@huntc
Copy link
Author

huntc commented Aug 31, 2020

Closed then on the basis that these specifications have reached a final status.

May I suggest that we start versioning components as at least a 1.0 when they reach this state. That way, expectations can be set upfront. Thanks.

@tigrannajaryan
Copy link
Member

I believe 1.0 label is going to be applied when we make the official OpenTelemetry GA release. For now the "Stable" label here https://github.com/open-telemetry/opentelemetry-proto#maturity-level serves the purpose and has a specific defined meaning in OpenTelemetry.

@jsuereth
Copy link
Contributor

Not to review a dead thread, but @huntc Would you want to work on a streaming gRPC protocol definition?

I'd like to explore a push variant that would accompany the pull-variant you propose here.

Also, hi! Been a long time, fun to see us both in the same community again.

@tigrannajaryan
Copy link
Member

I posted above, but just to remind, gRPC streams are not load balancer friendly out of the box and LB-friendliness is a design goal for OTLP (see https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/protocol/design-goals.md).

I personally spent quite some time when designing OTLP transport and have several experimental implementations using gRPC streams (including ones which fix the LB-friendliness issue) which I rejected.

Of course I encourage everyone to explore the alternatives. Please feel free to do your own experiments, I will be glad to learn if you discover something new :-)

@jsuereth it may be useful to tell what needs you have that you think will be better fulfilled by a streaming protocol. The solution may not necessarily be gRPC-based.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release:after-ga Not required before GA release, and not going to work on before GA spec:logs spec:metrics spec:traces
Projects
None yet
Development

No branches or pull requests

4 participants