-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add semantic conventions for errors #427
Conversation
cc: @kbrockhoff |
| :------------- | :--------- | :----------------------------------------------------------- | :-------- | | ||
| error.type | String | The type of the error. E.g. "Exception", "OSError" | Yes | | ||
| error.message | String | The error message. E.g. `"Division by zero"`, `"Can't convert 'int' object to str implicitly"` | Yes | | ||
| error.stack | Array\<String\> | A stack trace formatted as an array of strings. The stackframe at which the exeception occurred should be the first entry. E.g. `["Exception in thread \"main\" java.lang.RuntimeException: Test exception", " at com.example.GenerateTrace.methodB(GenerateTrace.java:13)", " at com.example.GenerateTrace.methodA(GenerateTrace.java:9)", " at com.example.GenerateTrace.main(GenerateTrace.java:5)" ]` | No | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how simple this is, but it does not expose information about file:line etc in a structured format for backends which may use it for grouping and/or filtering. And given the fact that most languages (at least JS, Java, Go, Python, Ruby & PHP) provide access to stack info in a structured/already parsed way we will end up with
- Languages/runtimes providing stack info in a structured way like
{file: string, line: number, func: string}
- OpenTelemetry requiring encoding that stack info as array of strings in language native human-readable format
- Backends having to parse those strings to extract information they need
That is unneeded extra work both in OpenTelemetry and backends. So my 2 cents are that error.stack
should have some internal structure like in this proposal. Perhaps by default it should be array of JSON-encoded strings like this:
[
"{file: \"GenerateTrace.java\", line: 13, func: \"com.example.GenerateTrace.methodB\"}",
"{file: \"GenerateTrace.java\", line: 9, func: \com.example.GenerateTrace.methodA\"}",
]
And as a fallback if a string can't be parsed as JSON - it is a string in human-readable format you proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining an internal structure makes sense if languages have a mechanism (other than string parsing) to access the data. If folks can chime in with what fields are readily available for stackframes perhaps we can come up with a set of required and optional attributes to collect.
For Ruby we have Thead::Backtrace::Location. The line
and file
are readily available, but func
is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are existing systems that solved this problem long time ago. We don't need to reinvent the wheel.
Eg https://docs.sentry.io/development/sdk-dev/event-payloads/stacktrace/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that I have some reservations about the cost and complexity of collecting data at such a granular level. Most (if not all) of the relevant information is available in a stack trace formatted as an array of strings (or a newline delimited one). If we added the language as an attribute backends would have the information they need to parse the relevant data, and spare OpenTelemetry users from having to gather it. While this does increase the complexity of for the backend, the trade off is simplicity and improved performance for OpenTelemetry users.
If we do go the route of more granular data, I think we should have actual structures for collecting it, and possibly separate APIs. That might make OTEP-69 a better fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwear It would be helpful if you describe the complexity (ideally write some code) even for one language of your choice like Ruby. I insist that practically for most languages there is no significant difference between generating at com.example.GenerateTrace.methodA(GenerateTrace.java:9)
and {file: "com.example.GenerateTrace.methodA(GenerateTrace.java", line: 9}
. Or the difference will be the same as between sprintf
and json_encode
. And it is not that hard to use sprintf
to generate JSON if you worried about micro-optimizations. Overall the requirement to use JSON is imposed by OpenTelemetry spec - it is not specific to this use case. IMHO OpenTelemetry could be more flexible and let exporters decide what to do in this case instead of fitting data structures for protobuf usage. E.g. JSON or MsgPack handles it without any problems.
@yurishkuro all those error reporting services reinvent their own wheels to solve exactly the same task. And Sentry is not a particularly good example - it is a very old wheel with some legacy.
I would start with something very simple like:
- file - required
- line - required
- func - required
- column - optional for JS
It is self-documented and covers 99% use cases. I intentionally omit
- language specific things like module, class, namespace etc - they can be stored in func attribute and are not that interesting to complicate things
- frame vars aka context-locals - AFAIK only supported by Python & Ruby and double/triple stack size
- IOS and symbolication in general - too complex and too specific
- in_app, abs_path, raw_function - specific to Sentry
- source_version - it is not clear why every frame needs this information and how to obtain it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vmihailenco - I think that stack frame addressing may be a bit more varied than that. In Java you need both a class and a function to fully qualify a function unit and you won't have a column. In some cases a frame will not have a file or number- I believe that an exception thrown by a system framework in .Net would be an example of the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Java I see StackTraceElement
that has
- getMethodName
- getClassName that can be used together with ClassUtil.getFilepath to obtain filepath
- getLineNumber
That information can be nicely fitted in {func, file, line}
structure I suggested.
In some cases a frame will not have a file or number
Ack, but that is not very popular and useful case - debugging becomes very complicated without such basic info. But we could omit missing fields and use only {func}
when file:line is not available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was actually just the first case I thought of, and that case is indeed useful- you may not go to the source to view the location, but a system framework frame is as useful to troubleshooting as an application source location. I believe a C/C++ stack trace won't have line numbers when debugging symbols have been stripped, either. I think that to @yurishkuro's point, having a "simple" tuple is liable to back us into a corner, and there is some prior work that could inform a more extensible approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the idea that there is some prior work with good ideas was repeated several times already - what are those ideas and how they solve C++ without debugging symbols use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, in C++ you often have at least unwind information for exception handling, so you can get the stack in form of a list of return addresses (e.g. using libunwind). With that you record the base address of the module in which that adress points with it, so you get something like 0xfff12345 (libfoobar.so+0xabc12)
. At the backend, you may have (detached) debugging symbols for libfoobar.so available, so you can now combine that information and symbolize the module+offset pair to get something like 0xfff12345 (libfoobar.so!MyClass::myFunction(int,long)+0xa1, MyClass.cpp:356)
.
| :------------- | :--------- | :----------------------------------------------------------- | :-------- | | ||
| error.type | String | The type of the error. E.g. "Exception", "OSError" | Yes | | ||
| error.message | String | The error message. E.g. `"Division by zero"`, `"Can't convert 'int' object to str implicitly"` | Yes | | ||
| error.stack | Array\<String\> | A stack trace formatted as an array of strings. The stackframe at which the exeception occurred should be the first entry. E.g. `["Exception in thread \"main\" java.lang.RuntimeException: Test exception", " at com.example.GenerateTrace.methodB(GenerateTrace.java:13)", " at com.example.GenerateTrace.methodA(GenerateTrace.java:9)", " at com.example.GenerateTrace.main(GenerateTrace.java:5)" ]` | No | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You say you needed arrays for these conventions, but I can't see any advantages of the array over a newline-separated string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was (incorrectly) under the assumption that there were length limits for attribute values and that a collection would be treated differently, in that the length limits would apply to each entry rather than the value as a whole.
While it's true that OpenTelemetry doesn't specify length limits, backends and alternate SDK implementations might choose to do so. Having a stack trace stored as an array might make it easier to avoid truncation in this situation, so I still think the idea has some merit. Aside from that, I think a newline delimited string and an array of strings are roughly equivalent, and transforming between the two representations is fairly trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of attributes that will be important which are covered in open-telemetry/oteps#69 are a unique hash for the particular exception and another unique hash for the exception type and line it was thrown in the code. This is important for deduplication and aggregation.
Also I still think it is important to support more kinds of errors besides application exceptions. For example, core dumps from IoT devices which would get sent after the device reboots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There also needs to be an attribute which contains specific input parameter values. This is supported by several backend systems. It helps in debugging exceptions caused by unexpected data inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kbrockhoff could you explain why the hashes can't be computed in the backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases, the only way to effectively calculate the identity of an exception object is the address in memory. The only alternative I can think of that would work on the backend would be if the file and line number of entire nested exception chain plus a full list of local variable values from the stack is recorded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address of what in memory? That sounds deployment-dependent. At Uber we have an internal system that performs deduping of exceptions, all in the backend, by hashing the stacktrace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hashing of the stacktrace working fine for identifying multiple triggers of the same exception/event but does not help in identifying multiple reports of the same invocation or event. For example if auto-instrumentation of a method call records the exception which is then also recorded by tracing instrumentation higher up in the call stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hashing of the stacktrace could be important when information is dropped- for example a Java stacktrace with nested causes can run into the hundreds of lines, so it's not uncommon to limit the data to the nearest so many frames and drop the rest, retaining a hash of the full exception would allow determination whether two now-identical stack traces started out that way.
On the other hand, creating a hash of the exception is going to require serializing the whole exception in order to hash it which might not be needed if you are only going to send a few frames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A string representation, whether it's an array of strings or newline-separated, does encode data in a way that can be troublesome to parse later on. In most languages I'm familiar with, the language offers access to the stack at the time of exception handling- it would seem that putting the stack frames into a normalized form would make sense at that point, rather than requiring a language dependent parse stage on the backend.
@@ -0,0 +1,19 @@ | |||
# Semantic conventions for errors | |||
|
|||
This document defines semantic conventions for recording errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the semantic conventions here (especially error.type
) it seems that these conventions are specifically geared towards recording thrown exceptions rather than any kind of error (cf. #306 which, despite the title, contains quite some interesting discussion about the issue).
Co-Authored-By: Armin Ruech <armin.ruech@gmail.com>
Here is my attempt at extending this PR to support logs. I wonder whether this is a step in right direction or not... |
@vmihailenco I would encourage you to open a PR for your attempt as a third option to consider. I think between this, your approach and OTEP-69 we have reasonable options to choose to from. We may end up with a combination of approaches, but having these here to drive the conversation forward is useful. |
@@ -16,4 +16,5 @@ their types. | |||
| :------------- | :--------- | :----------------------------------------------------------- | :-------- | | |||
| error.type | String | The type of the error (its fully-qualified class name, if applicable). E.g. "java.net.ConnectException", "OSError" | Yes | | |||
| error.message | String | The error message. E.g. `"Division by zero"`, `"Can't convert 'int' object to str implicitly"` | Yes | | |||
| error.language | String | The language from which the error was generated. E.g. "Java", "Python", "Ruby". | Yes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have recommendations for how to spell language names in the confusingly named resource semantic conventions for the telemetry library: https://github.com/open-telemetry/opentelemetry-specification/blob/3bbed37ba65df027721a6a3ab62f62dfc7fcf467/specification/data-resource-semantic-conventions.md#library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Language should be captured at the SDK level. Do we need it repeated on errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all SDK. Do you mean resource level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I mean resource that identifies the SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @yurishkuro. I think we can assume that the backend can get this information from the resource identifying the application. We shouldn't have to repeat it on every error. I removed error.language
for now. If it turns out that we need it, I can add it back in later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are cases like Jython, or C# calling into native code...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, Visual Studio shows the language for each frame in a Stack trace, so you can get as complicated as you want here.
5b6079b
to
b90e199
Compare
their types. | ||
|
||
| Attribute name | Type | Notes and examples | Required? | | ||
| :------------- | :--------- | :----------------------------------------------------------- | :-------- | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwear thanks for tagging this PR @ open-telemetry/oteps#86.
Proposed fields are already part of opentracing specification. Please refer https://github.com/opentracing/specification/blob/master/semantic_conventions.md#log-fields-table
I like the discussion on error.stack. We should add below tag https://github.com/opentracing/specification/blob/master/semantic_conventions.md#span-tags-table
error | bool | true if and only if the application considers the operation represented by the Span to have failed|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this was added, I'd rather call it fatal_error
, unrecoverable
or similar. It would still be an error that has occurred even after being able to recover from it by retrying or using another fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tigrannajaryan suggested that we use log.severity
to indicate how severe the exception is. I would prefer log.severity
in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jmacd. Small correction: log data model uses SeverityNumber. Would be useful to keep the name consistent here too (and I am not sure what casing we want to use here - log data model does not mandate a particular casing).
I went ahead and wrote some ruby code for this proposal and #432 to see how they compare. Example for #427 begin
buggy_method # raises an exception when called
rescue => e
span.add_event(name: 'error',
attributes: {
'error.type' => e.class,
'error.message' => e.message,
'error.stack' => e.backtrace
})
end Example for #432 begin
buggy_method # raises an exception when called
rescue => e
span.add_event(name: 'error',
attributes: {
'log.type' => e.class,
'log.severity' => 'error',
'log.message' => e.message,
'frame.file' => e.backtrace_locations[0].absolute_path,
'frame.line' => e.backtrace_locations[0].lineno,
'frame.func' => e.backtrace_locations[0].label,
'frame.stack' => stack_to_json(e)
})
end
def stack_to_json(e)
JSON.dump(
e.backtrace_locations.map do |loc|
{ 'frame.file' => loc.absolute_path, 'frame.line' => loc.lineno, 'frame.func' => loc.label }
end
)
end #427 has merits for its simplicity. It does not cover logs as #432 does, although it could be abstracted for that purpose. I like that #432 works generically for logs with an error being a specific type of log. There is more work to transform the backtrace into json. I also noticed that Both of these approaches seem viable and I'd be 👍 on either. It'd be interesting to see how this plays out in other languages. |
I want to add that I like the schema in https://cloud.google.com/trace/docs/reference/v2/rest/v2/StackTrace that you pointed out. If we're going with a structured representation, that looks good. The 50-line version (Cloud Trace Java) does not include causes or limit stack depth. The 150-line version (X-Ray) does include causes and limit stack depth. Which do you prefer? I think I would prefer to include causes, sometimes I believe they're necessary to categorize an error stack. We should compare these two functions with the builtin The limit on stack depth is pretty important, as a complete stack in a common Java framework can easily add 10s of KB of annotation if we are not careful. I looked up the Go implementation of
That's optimized in a way not available to the user. I'd like to leave this task to the runtime. I'm still worried for the user, regardless of the implementation. Including a complete stack can be dangerous if, for some reason, suddenly every request encounters an error. Will the SDK need to implement limits on error stack rate? (I am reminded of https://www.sigops.org/s/conferences/sosp/2009/papers/glerum-sosp09.pdf) |
Yeah would prefer to have causes (with a limit on the entire stack payload). That being said, it's true the difference in overhead for having cause vs not is pretty huge, so leaving it out at least in a first version that only had the handled exception would be pretty good bang for buck I think.
Yup, the complexity of that seems to indeed be trimming out common frames. If we include causes, this would be a nice optimization to copy into any description of the stack trace in the model, but also maybe not in an initial version. Luckily, once it's implemented once for a language it should be stable for a long time while they implement the myriads of framework instrumentation :)
API-wise, Go's https://golang.org/pkg/runtime/#CallersFrames We'd need to run a benchmark to compare the performance of that vs
Yeah, this is definitely an issue - to clarify, I guess this is more about including stack trace in any form, structured or not, though structured would probably have a larger impact. I agree this should either be opt-out or even opt-in. I suspect sites where this could really be an issue, due to serving huge traffic, will have sophisticated setups like log correlation in place, and the importance of the stack trace in the span goes down. The sites that only look at tracing independently though can get a lot out of seeing the stack trace though. Ideally the number of sites not using correlation goes down and down over time, but I think we have a long way to go for that. |
I would not want to lose the cause chain. Adding to the X-Ray and GCP examples from above, here's an example from Glowroot that has both common frame trimming (which really helps reduce payload) and overall frame limit: |
I'm still trying to compare the CPU and memory allocation cost of recording structured stacks as compared to using a string encoding. The contention was that coding this structure by allocating a bunch of objects and then calling a protobuf encoder adds a lot of cost compared with sending a string form. This shifts the complexity and risk to the server side, and it relies on a stable format across time. OpenJDK's printStackTrace writes:
and it's followed by a lengthy explanation of how to parse the output. I believe letting the server parse the stack is a better arrangement from a total operational cost standpoint. I wonder if I looked at I would put safety above cost in this discussion. |
I agree safety is the most important - it's the main reason I want to use structs :) Even I appreciate the concern about the missing checks / features in our reference implementations. But I don't think that's a reason to assume we can't get it right - we're doing a good job of learning from prior art and going through the details. Serializing a stack isn't such a complex issue that this is untenable, and once we have it down right it should be forwards compatible through any changes in the runtime, I don't see a big reason to expect our code to be less safe than a runtime.
I spiked a change to marshall exceptions into the proto and compare with string. https://github.com/anuraaga/opentelemetry-java/blob/exception-serialize/exporters/otlp/src/jmh/java/io/opentelemetry/trace/ExceptionDescriptionBenchmark.java As with all benchmarks, there's always a chance it's missing something, happy to adjust with any feedback. FWIW, I tried a default From the numbers, it seems that proto is a little faster, maybe due to the reduced amount of whitespace / words like One thing I noticed with Let me know if any other info can help here (though I can't commit to writing similar code in other languages). |
We're thinking about "safety" differently.
To me, safety means not crashing or deadlocking the caller. I think your concern is more about "reliability". The safety issue in this case is that a malicious Throwable would cause an infinite loop or a stack overflow, and one of the reasons I prefer to use a builtin stack-to-string method is that we should expect the language runtime to be safer. We were looking at OpenJDK code, but another implementation of There are many risks in calling out to a protocol-buffer implementation, but we're already factoring some of this risk in when we export data. Still, adding a bunch of memory allocations creates risk of OOMing and otherwise destabilizing the application. Builtin stack-to-string methods generally are optimized in ways that the user does not have access to. For example, in Java I sense that a lot of demand for this "feature" is essentially coming from vendors who are trying to simplify their services (I suspect a there is a tiny amount of cost-shifting here too: why pay for something if you can ask your users to pay for it?) Lastly, let's ask: is reliability a real problem when parsing a stack from the builtin stack-to-string function? I will argued that this concern is over-inflated. The stack trace format of most language runtimes is stable in time, precisely because it is meant to be parsed and language implementors know that users are already parsing these stacks. Java 7 changed their stacktrace format in a backward compatible way when they introduced Suppressed exceptions. It is a significant amount of work to implement these parsers, but I do not expect a great deal of ongoing maintenance will be needed. This functionality will be owned by the community and contributed to the opentelemetry collector. When problems are discovered, the parsers will be improved. Vendors and/or users can operate the otel-collector as a service for stacktrace parsing. I believe the total cost of maintaining stack parsers will be lower than the total cost of maintaining structured stack exporters in every language. Note that extracting and parsing stacks from unstructured text is a more difficult task than parsing a stack from a structured field (where length is delimited for you), but this is not what w're talking about. Note that users who are operating custom runtimes will be forced to output stacks in a well-known text format when annotating for OTel, given my proposal. I would recommend they re-use one of the supported formats. Lastly, I can see that I'm arguing an uphill battle here. I would be willing to compromise with both solutions. Let users send stacks as strings, which gives us the safer, more-performant option. Let us also specify the structure of a stack and allow libraries to implement the structured representation too. Then we'll let users choose: the less-risky less-expensive option that does require a collector? Or, the more-risky more-expensive option that does not require a collector? |
The only thing I'll respond to here, the rest seems like mostly rehash with an inappropriate cost-shifting allegation.
In a text format, changing the way that the text is outputted, with different text and whitespace is a backwards incompatible change. A parser may do fine with it, or may not, and unfortunately that's generally only known after the change has been made.
This seems fine to me. There was some interest in this with concern about CSV, I would still argue the structured version should be proto like the rest of our structure. I think we've added our input as well as we can. Committee, can we get some steering? @carlosalberto @bogdandrutu @yurishkuro @SergeyKanzhelev |
I don't see the point about cost shifting as inappropriate. One of the major techniques in reducing the cost of instrumentation is to defer work. It's cheaper to defer stacktrace parsing until it's needed. Cost and safety are both important, and still think the concern about reliability is over-inflated. When stacktrace parsing fails, the result is still a human-readable stacktrace in its natural representation. The user is still able to read their data, it's only an automated error monitoring system that will fail at the point stacktraces do not parse. System operators would of course be advised to test their error monitoring systems before deploying a major new language runtime release. I expect stacktrace formats to evolve, just not quickly with respect to OTel collector releases. When problems arise they'll be fixed long before responsible system operators are deploying those language runtimes in production, I would assert. About the benchmarks: I am not a Java or JMH expert, but I want to be sure that you are not microbenchmarking only the CPU cost and that your benchmark factors in GC. I believe you can use the THROUGHPUT BenchmarkMode for this? The |
I would assume structs are going to be way cheaper memory-wise, because they get to just point to all of the existing JVM-provided string constants that represent class/method names, and don't have to concatenate them all into a separate, very large, string. I can take review the benchmark if that would help the discussion. |
Ah didn't realize you were treating the free-form output and automated monitoring / code jump type features separately. Agree that there aren't really any compatibility concerns for the free-form output to a user. One thing to keep in mind is a collector will generally be infrastructure, probably managed by a different team than business logic. If a team wants to try out a new language that isn't even supported by their infrastructure, it's nice that this would still work as best as it can without requiring the infrastructure team to deploy an updated collector, it somewhat leaves the destiny of the language in the hands of the language itself, not the infrastructure.
Thanks, happy to tweak as needed. AFAIK, GC is being reflected (using something like Looking at the GC stats, there seems to be some small difference but over different runs I think I've seen both versions take more GC time than the other and it seems like noise.
I should have linked to the code too, but it does at least intend to eliminate common frames, didn't see anything out of the ordinary when comparing the output of the https://github.com/anuraaga/opentelemetry-java/blob/exception-serialize/api/src/main/java/io/opentelemetry/trace/StackTrace.java#L25 |
I suggest supporting both text and structured format. There may be languages where structured access to stack frames is not available and text is the only option. When structured access is available, I'd be surprised if it's not possible to write serialization into a structured format (in a pre-allocated buffer) that is just as efficient as the runtime's toString. Also, efficiency is an interesting topic given that we're talking about thrown exceptions, one could argue that it's not necessary to optimize the error conditions. FWIW, the safety argument does not sound very convincing to me. There are plenty of prior art of libraries that capture very rich stack traces (e.g. https://github.com/getsentry/sentry#official-sentry-sdks). I don't think it's as hard of a problem as @jmacd is implying. I am also very interested in having the ability to represent application state (like the local vars) for each frame. Plain text stack trace output won't be able to represent them, but structured format can be further extended, if Frame is its own struct like in https://cloud.google.com/trace/docs/reference/v2/rest/v2/StackTrace or in Sentry Go. |
Presumably the structured protocol would be a part of OTLP. I am concerned that the amount of work implied by supporting structured exceptions is large, and I would be much happier with this plan if I knew there were committed resources willing to add this support in the collector and all the languages. Since I don't think that commitment exists, can we agree to make a string-valued @yurishkuro points to the solution: a hand-coded implementation that encodes structured data can be as safe and fast as the built-in stack-to-string method. That's a lot of work (more than writing a parser, probably 😀 ), and this has been my concern all along, that we're getting in the way of a GA release. |
Agree that we should get I'm quite interested in the topic of structured exceptions and am interested in helping drive it through, my spike contains much of the code change that would be required in proto + java, though is by no means cleaned up yet. But I'm very hesitant to continue on this as I don't see any reason it won't result in another wall of defensiveness. When even benchmarks can't produce a thought that, "hey maybe the performance issue isn't so clear after all" in the very end (it took no hand-coded encoding for proto to perform generally better than the string, and that's with buffer recycling for the string and not for the proto, that doesn't guarantee the benchmark is ideal but it's saying something), I can't imagine working on this topic to be resulting in anything more than burnout... Will just go ahead and wait on the sidelines :) |
@anuraaga I'm sorry that you see this as a wall of defensiveness. It appears to be just me. If we are not going to block GA on this feature request, I'm feeling less defensive, although I'm still not convinced that a structured representation improves life for the user (structured stacks are less readable, meaning the user needs more tooling to study their data). I raised a number of points related to performance, because it seems like there is a common argument that a structured approach is going to be faster or at least not slower. I've pointed out that the Java |
I reviewed the benchmark. The struct-based benchmark does have the same safety check as And the benchmark does show the struct-based approach being faster (I ran it locally to validate). [note, the jmh results do not show gc time normalized per operation, so you need to do some division to compare those numbers] |
Thank you @trask. I don't think it's worth continuing to debate the benchmark. It's curious that the text version is allocating more memory than the structured version, but I'm not qualified to scrutinize them any further. |
See my comment above:
|
In the end, the structured form is copied into a byte array, and the text form is copied into a string. Each of the underlying strings should be copied once, so that doesn't sound like the explanation I'm looking for. Either you serialize the underlying strings (copying them once) with protobuf encoding bits in between, or you serialize the underlying strings (copying them once) with syntactic markers in between. Anyway, I shouldn't be discussing Java performance. |
and then the text form (string) still has to be copied (again) into a byte array (or streamed directly over the wire, but you could do that with the structured form also) |
The point was that each string should be copied once in both cases. In the benchmark, the string representation is being copied twice, once into a My guess is that the difference in memory allocation rate has something not obvious to do with |
Is it OK to rename this PR to "Add semantic conventions for exceptions"? Since errors are a more generic topics and we have several PRs & issues with similar names. |
I created another, updated version of this proposal in #697, PTAL. |
Closing. This work was addressed in #697. |
Having a way to record errors and annotate spans with that information is essential functionality that is currently missing from OpenTelemetry.
There is currently OTEP-69 that proposes typed events and includes structures for recording errors.
Since it was authored, array-value attributes have been added to the specification. This addition makes it possible to record errors using the existing event type by representing a stack trace as an array of strings, where each entry corresponds to a stackframe.
This PR doesn't intend to compete with OTEP-69, but offers another approach to be considered. Hopefully, between this proposal and OTEP-69 we can begin to drive a discussion on how to record errors forward.