Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A72: OpenTelemetry Tracing #389

Merged
merged 36 commits into from
Jun 20, 2024
Merged
Changes from 2 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
1d5b48c
start
YifeiZhuang Jul 20, 2023
67e34e8
rename to A67
YifeiZhuang Aug 22, 2023
392a533
Merge remote-tracking branch 'upstream/master' into ot-tracing
YifeiZhuang Aug 22, 2023
7f668a7
rename A72
YifeiZhuang Aug 22, 2023
0041641
minor fixes
YifeiZhuang Aug 24, 2023
7a6e596
add email thread discussion, and text changes from ejona
YifeiZhuang Oct 11, 2023
6bc027e
Apply suggestions from code review
YifeiZhuang Dec 19, 2023
5bb3823
Merge branch 'ot-tracing' of https://github.com/YifeiZhuang/proposal …
YifeiZhuang Dec 19, 2023
5dbbc13
minor fix
YifeiZhuang Jan 4, 2024
7b48df4
Meeting AI 01/08/2024
YifeiZhuang Jan 9, 2024
8631cee
Fill up C++ sections
yashykt Jan 17, 2024
528dd82
Reviewer comments
yashykt Jan 18, 2024
cf24140
Merge pull request #2 from yashykt/ot-tracing
YifeiZhuang Jan 18, 2024
517f340
trace info, custom binary header validation
YifeiZhuang Jan 19, 2024
8a0f8f7
re-structure the content
YifeiZhuang Mar 19, 2024
942b9b6
separate c++ comment
YifeiZhuang Mar 19, 2024
eea0571
minor change
YifeiZhuang Mar 19, 2024
a2c420e
rephrase
YifeiZhuang Mar 19, 2024
a39af0d
Merge branch 'master' of https://github.com/grpc/proposal into ot-tra…
YifeiZhuang Mar 20, 2024
50565d6
add go API. Minor fix comment and reference
YifeiZhuang Mar 20, 2024
28d9f51
re-structure
YifeiZhuang Mar 21, 2024
2d89763
improve languages, fix tracing info uncompressed message
YifeiZhuang Mar 21, 2024
032beee
fix compression and message sizes, minor language fixes
YifeiZhuang Mar 22, 2024
0275820
decision on reporting compressed/uncompressed message size, and seq id
YifeiZhuang Mar 27, 2024
3e26373
clarify the second message for compressed/uncompressed message name, …
YifeiZhuang Apr 5, 2024
3885cd8
scheme for the name of uncompressed message size outbound, and compre…
YifeiZhuang Apr 10, 2024
c866de5
Add Python draft API
XuanWang-Amos Apr 10, 2024
a346c24
Fix format issue
XuanWang-Amos Apr 10, 2024
66b40d4
change go/java api, remove mesage type, rename event attributs
YifeiZhuang Jun 4, 2024
41c2ed2
Merge branch 'master' of https://github.com/grpc/proposal into ot-tra…
YifeiZhuang Jun 4, 2024
68443a0
Merge branch 'ot-tracing' of https://github.com/YifeiZhuang/proposal …
YifeiZhuang Jun 4, 2024
a199618
implementation may not know it needs compression whe reporting the fi…
YifeiZhuang Jun 5, 2024
8afa317
improve compression message name and add example
YifeiZhuang Jun 17, 2024
78f3d71
fix typo
YifeiZhuang Jun 17, 2024
9a219d3
refer to event name, update oc/ot field name mapping, and fix typo
YifeiZhuang Jun 18, 2024
06ec130
move use of propagator before c++ example section
YifeiZhuang Jun 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 62 additions & 64 deletions A72-open-telemetry-tracing.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,25 +48,8 @@ tracing instrumentation.
We will add tracing functions in grpc-open-telemetry plugin, along with OpenTelemetry
metrics [gRFC A66][A66]. Internally, the tracing functionality will be implemented
using existing gRPC infrastructure such as interceptors and stream tracers.

#### Propagator Wire Format
gRPC OpenTelemetry will use the existing OpenTelemetry propagators API for context propagation
by encoding them in metadata, for the following benefits:
1. Full integration with OpenTelemetry APIs that is easier for users to reason about.
2. Make it possible to plugin other propagators that the community supports.
3. Flexible API that allows clean and simple migration paths to a different propagator.

In order for the propagator to perform injecting and extracting spanContext value
from the carrier, which is the Metadata in gRPC, languages will
implement Getter and Setter corresponding to the propagator type.
Currently, OpenTelemetry propagator API only supports `TextMapPropagator`,
that is to send string key/value pairs between the client and server.
Therefore, adding Getter and Setter is to implement the TextMap carrier interface:
`TextMapCarrier` (For C++/Go), or `TextMapGetter`/`TextMapSetter` (For Java). (see
pseudocode in section [Migrate to OpenTelemetry](#migrate-to-opentelemetry--cross-process-networking-concerns)).

The APIs to enable and configure OpenTelemetry tracing are different among
languages due to different underlying infrastructures.
The APIs to enable and configure OpenTelemetry tracing are different among
languages due to different underlying infrastructures.

### Java
In Java, it will be part of global interceptors, so that the interceptors are
Expand Down Expand Up @@ -152,7 +135,7 @@ type TraceOptions struct {
TraceProvider trace.TraceProvider
}

// DialOption returns a dial option which enables OpenCensus instrumentation
// DialOption returns a dial option which enables OpenTelemetry instrumentation
// code for a grpc.ClientConn.
//
// Client applications interested in instrumenting their grpc.ClientConn should
Expand All @@ -164,8 +147,7 @@ type TraceOptions struct {
// TraceOption. Client side has retries, so a Unary and Streaming Interceptor are
// registered to handle per RPC traces, and a Stats Handler is registered to handle
// per RPC attempt trace. These three components registered work together in
// conjunction, and do not work standalone. It is not supported to use this
// alongside another stats handler dial option.
// conjunction, and do not work standalone.
func DialOption(to TraceOptions) grpc.DialOption {}

// ServerOption returns a server option which enables OpenTelemetry
Expand All @@ -178,14 +160,13 @@ func DialOption(to TraceOptions) grpc.DialOption {}
// Using this option will always lead to instrumentation, however in order to
// use the data a SpanExporter must be registered with the TraceProvider option.
// Server side does not have retries, so a registered Stats Handler is the only
// option that is returned. It is not supported to use this alongside another
// stats handler server option.
// option that is returned.
func ServerOption(to TraceOptions) grpc.ServerOption {}
```

YifeiZhuang marked this conversation as resolved.
Show resolved Hide resolved
## Tracing Information
RPCs on the client side may undergo retry attempts, whereas on the server side,
they do not. gRPC records both per-call tracing details (at the parent span)
they do not. gRPC records both per-call tracing details (on the parent span)
and per-attempt tracing details (on the attempt span) on the client side.
On the server side, there is only per-call traces. With the new OpenTelemetry
plugin we will produce the following tracing information during an RPC lifecycle:
Expand All @@ -205,57 +186,75 @@ On attempt span:
boolean value indicating whether the stream is undergoing a transparent retry.
* If the RPC experienced load balancer pick delay, add an Event with the name
"Delayed LB pick complete" upon creation of the stream on the transport.
* When the application sends an outbound message to the transport, add an Event
* When the application sends an outbound message to the transport, add Event(s)
markdroth marked this conversation as resolved.
Show resolved Hide resolved
(it depends on implementation whether there is a single event or two separate
events for compressed/uncompressed message sizes, the same below)
with name "Outbound message sent" and the following attributes:
YifeiZhuang marked this conversation as resolved.
Show resolved Hide resolved
YifeiZhuang marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we talked about this earlier, but "sent" in "Outbound message sent" is confusing. Can we drop it?
For C++, this would otherwise mean that we would first see -
"Outbound message sent" and then "Outbound message compressed"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Outbound message sent" and then "Outbound message compressed"

It looks fine. "sent" means this trace happens somewhere from application to the wire, and we made that vague and general(decision made from a discussion). Are you confused because compression should not happen after "sent"?

* key `message.event.type` with string value "SENT".
YifeiZhuang marked this conversation as resolved.
Show resolved Hide resolved
* key `message.message.id` with integer value of the seq no. The seq no. is a sequence
of integer numbers starting from 0 to identify sent messages within the stream, the same below.
* key `message.message.id` with integer value of the seq no. The seq no. indicates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was the namespacing for "message.message.id" decided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was from the Otel java opencensus shim. Languages can be different.
I think whenever possible, we can make the name generated from the shim and the one generated directly from otel be consistent. This is not happening because the naming in our census needs improvements. And I am fine with that. @markdroth @ejona86

(https://github.com/open-telemetry/opentelemetry-java/blob/main/opencensus-shim/src/main/java/io/opentelemetry/opencensusshim/SpanConverter.java#L24-L27)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have to copy the shim's behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proposed new names. Let's move forward and fix/vote in API stabilization.

the order of the sent messages on the attempt (i.e., it starts at 0 and is
incremented by 1 for each message sent), the same below.
* key `message.event.size.uncompressed` with integer value of uncompressed
message size. The size is the total attempt message bytes without encryption,
not including grpc or transport framing bytes, the same below.
* If any compression, key `message.event.size.compressed` with integer value
of compressed message size.
* When an inbound message has been received from the transport, add an Event
with name "Inbound message read" and the following attributes:
* key `message.event.type` with String value "RECEIVED".
* key `message.message.id` with integer value of the seq no.
* key `message.event.size.compressed` with integer value of wire message size.
* If any compression, key `message.event.size.uncompressed` with integer value
of uncompressed message size.
* If compression needed, add key `message.event.size.compressed` with integer
value of compressed message size.
* When an inbound message has been received from wire, add Event(s) with name
markdroth marked this conversation as resolved.
Show resolved Hide resolved
markdroth marked this conversation as resolved.
Show resolved Hide resolved
"Inbound message read" and the following attributes:
* key `message.event.type` with string value "RECEIVED".
* key `message.message.id` with integer value of the seq no. The seq no. indicates
the order of the received messages on the attempt (i.e., it starts at 0 and is
incremented by 1 for each message received), the same below.
* key `message.event.size.compressed` with integer value of wire message size.
* If the message needs decompression, add key `message.event.size.uncompressed`
with integer value of uncompressed message size.
* When the stream is closed, set RPC status and end the attempt span.

At the server:
* When the application sends an outbound message to the transport, add an Event
* When the application sends an outbound message to the transport, add Event(s)
markdroth marked this conversation as resolved.
Show resolved Hide resolved
YifeiZhuang marked this conversation as resolved.
Show resolved Hide resolved
with name "Outbound message sent" and the following attributes:
* key `message.event.type` with string value "SENT".
* key `message.message.id` with integer value of the seq no.
* key `message.event.size.uncompressed` with integer value of uncompressed
message size.
* If any compression, key `message.event.size.compressed` with integer value
of compressed message size.
* When an inbound message has been read from the transport, add an Event
with name "Inbound message read" and the following attributes:
* If compression needed, add key `message.event.size.compressed` with integer
value of compressed message size.
* When an inbound message has been received from wire, add Event(s) with name
markdroth marked this conversation as resolved.
Show resolved Hide resolved
YifeiZhuang marked this conversation as resolved.
Show resolved Hide resolved
"Inbound message read" and the following attributes:
* key `message.event.type` with string value "RECEIVED".
* key `message.message.id` with integer value of the seq no.
* key `message.event.size.compressed` with integer value of wire message size.
* If any compression, key `message.event.size.uncompressed` with integer value
of uncompressed message size.
* If the message needs decompression, add key `message.event.size.uncompressed`
with integer value of uncompressed message size.
* When the stream is closed, set the RPC status and end the span.

### Limitations
Note that C++ is missing the seq no. information due to lack of transport support.
While it's not critical, we can include these information if users request it in the future.
The timestamp information on the Events that report compressed/uncompressed message
sizes are not accurate or useful. It only gives you a relative order with other Events.
We can tighten the timing in the future if users find this information critical.

Java has an issue of reporting decompressed message size upon receiving messages,
as a workaround, on the client parent span and server span:
* When the uncompressed size of some inbound data is revealed, add an Event
with name "Inbound message read" and the following attributes:
* key `message.event.type` with string value "RECEIVED".
* key `message.message.id` with integer value of the seq no.
* key `message.event.size.uncompressed` with integer value
of uncompressed message size.
Java has an open issue of reporting uncompressed message size upon receiving message.
It does that at a later time when deserializing. Therefore, at the client Java only
reports the uncompressed message size for incoming messages on parent span, not attempt span.

## Migrate from OpenCensus to OpenTelemetry
## Propagator Wire Format
gRPC OpenTelemetry will use the existing OpenTelemetry propagators API for context propagation
by encoding them in metadata, for the following benefits:
1. Full integration with OpenTelemetry APIs that is easier for users to reason about.
2. Make it possible to plugin other propagators that the community supports.
3. Flexible API that allows clean and simple migration paths to a different propagator.

We will have OpenTelemetry propagator APIs for context propagation.
In order for the propagator to perform injecting and extracting spanContext value
from the carrier, which is the Metadata in gRPC, languages will
implement Getter and Setter corresponding to the propagator type.
Currently, OpenTelemetry propagator API only supports `TextMapPropagator`,
that is to send string key/value pairs between the client and server.
Therefore, adding Getter and Setter is to implement the TextMap carrier interface:
`TextMapCarrier` (For C++/Go), or `TextMapGetter`/`TextMapSetter` (For Java). (see
pseudocode in section [Migration to OpenTelemetry: Cross-process Networking Concerns](#migration-to-opentelemetry--cross-process-networking-concerns)).

## Migration from OpenCensus to OpenTelemetry

### gRPC OpenCensus API
The existing gRPC OpenCensus tracing APIs in grpc-census plugin are different between
Expand All @@ -276,7 +275,7 @@ where during migration one will use OpenCensus and the other will use OpenTeleme

Here are the suggested solutions for both use cases.

### Migrate to OpenTelemetry: Cross-process Networking Concerns
### Migration to OpenTelemetry: Cross-process Networking Concerns
When users first introduce gRPC OpenTelemetry, for the time window when the
gRPC client and server have mixed plugins of OpenTelemetry and OpenCensus,
spanContext can not directly propagate due to different header name and wire format.
Expand All @@ -295,17 +294,16 @@ different from the binary header that gRPC currently uses. The future roadmap
to support binary propagators at OpenTelemetry is unclear. So, gRPC will use
propagator API in TextMap format with an optimization path (Go and Java) to work
around the lack of binary propagator API to support `grpc-trace-bin`. In fact,
TextMap propagator does not show visible performance impact for C++, which is
the most sensitive language to performance, based on internal micro benchmarking.
Therefore, gRPC will only support propagating `grpc-trace-bin` in TextMap propagator.
Only one `grpc-trace-bin` header will be sent for a single RPC as long as only one of
OpenTelemetry or OpenCensus is enabled for the channel.
TextMap propagator is a viable alternative to the existing binary format in gRPC
in terms of performance, based on internal C++ micro benchmarking on W3C TextMap
propagator. If this posed a performance problem for users, we can consider
implementing an alternative API in C++, see [Rationale](#rationale).
Only one `grpc-trace-bin` header will be sent for a single RPC as long as only
one of OpenTelemetry or OpenCensus is enabled for the channel.
A `grpc-trace-bin` formatter implementation for OpenTelemetry is
needed in each language, which can be similar to the OpenCensus implementation.
Go already has community support for that.



#### GrpcTraceBinPropagator and TextMapGetter/Setter in Java/Go
The pseudocode below demonstrates `GrpcTraceBinPropagator` and the corresponding
gRPC Getter/Setter with an optimization path.
Expand Down Expand Up @@ -555,7 +553,7 @@ a new propagator. An example migration path can be:
2. Configure the client with the desired new propagators and to drop the old propagator.
3. Make the server only accept the new propagators and complete the migration.

### Migrate to OpenTelemetry: In Binary
### Migration to OpenTelemetry: In Binary
The OpenCensus [shim](https://github.com/open-telemetry/opentelemetry-java/tree/main/opencensus-shim)
yashykt marked this conversation as resolved.
Show resolved Hide resolved
(currently available in Java, Go, Python) allows binaries that have a mix of
OpenTelemetry and OpenCensus dependencies to export trace spans from both
Expand Down