-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Schema] Converting values that cause conflicts #3497
Comments
We discussed this in the Semantic Convention WG. Suggestions from there:
Specifically, if you have the following attributes:
If you have a
That is, attribute1 does NOT overwrite attribute2, regardless of how the attributes are ordered. This is specifically in reference to the OTLP generated. Additionally, you could leave |
Resolves: open-telemetry#3497 Alternates Considered ===================== Instead of requiring overwriting the input data we could say that the transformation SHOULD be aborted if a conflict is detected or that transformation SHOULD be ignored if a conflict is detected. The chosen approach (overwriting) seems to be best balance between maintaining reasonable outcome (keep the more important data, discarding the less important data) while also having minimal complexity of implementation (aborting and reporting errors is more complicated and more costly).
Resolves: open-telemetry#3497 Alternates Considered ===================== Instead of requiring overwriting the input data we could say that the transformation SHOULD be aborted if a conflict is detected or that transformation SHOULD be ignored if a conflict is detected. The chosen approach (overwriting) seems to be best balance between maintaining reasonable outcome (keep the more important data, discarding the less important data) while also having minimal complexity of implementation (aborting and reporting errors is more complicated and more costly). Why "SHOULD" and not "MUST"? ============================ There are significant tradeoffs involved in what happens in the schema transform and we believe there may be valid use cases where after careful consideration a different approach may be taken (e.g. reject and surface the problem to the end user). We do not want to prohibit these other approaches. The goal of this change is to merely have a reasonable default interpretation of the schema transformation rules.
Resolves: open-telemetry#3497 Alternates Considered ===================== Instead of requiring overwriting the input data we could say that the transformation SHOULD be aborted if a conflict is detected or that transformation SHOULD be ignored if a conflict is detected. The chosen approach (overwriting) seems to be best balance between maintaining reasonable outcome (keep the more important data, discarding the less important data) while also having minimal complexity of implementation (aborting and reporting errors is more complicated and more costly). Why "SHOULD" and not "MUST"? ============================ There are significant tradeoffs involved in what happens in the schema transform and we believe there may be valid use cases where after careful consideration a different approach may be taken (e.g. reject and surface the problem to the end user). We do not want to prohibit these other approaches. The goal of this change is to merely have a reasonable default interpretation of the schema transformation rules.
What exactly does this mean? Do you suggest that the input attributes are preserved and the SchemaURL is not updated? What happens with transforms that were already applied? Do you have to roll them back? This is very expensive to implement. I don't think this is the best approach. I submitted a PR where I recommend overwriting. IMO this is the best default approach. Rejecting is complex and expensive. It requires performing transformation on a copy of data since it may be aborted any time or be able to rollback already applied changes. This is very expensive unless you have some very efficient copy-on-write data structure or something like that. I don't think we should do it. Overwriting is much simpler. You just apply on the original data knowing that there is no need to ever abort the changes. IMO, this should be the philosophy of all schema transforms (existing or future proposals). I left the door open for implementations that want to carry the burden of much more complex approaches by using SHOULD clause instead of MUST. |
I think this is a bad outcome. The resulting data complies neither with old schema nor with new schema. What do we set the SchemaURL to in this case? |
Resolves: open-telemetry#3497 Alternates Considered ===================== Instead of requiring overwriting the input data we could say that the transformation SHOULD be aborted if a conflict is detected or that transformation SHOULD be ignored if a conflict is detected. The chosen approach (overwriting) seems to be best balance between maintaining reasonable outcome (keep the more important data, discarding the less important data) while also having minimal complexity of implementation (aborting and reporting errors is more complicated and more costly). Why "SHOULD" and not "MUST"? ============================ There are significant tradeoffs involved in what happens in the schema transform and we believe there may be valid use cases where after careful consideration a different approach may be taken (e.g. reject and surface the problem to the end user). We do not want to prohibit these other approaches. The goal of this change is to merely have a reasonable default interpretation of the schema transformation rules.
We can remove schema URL from invalid telemetry or add a way to signal that we could not apply transformation on a given data. I think having a way to signal about it is essential for this and other similar problems and should be a prerequisite to removing moratorium on transformations. |
To summarize some options (there is probably more possibilities): OverwritePros
Cons
Ignore, SkipPros
Cons
Preserve and OverwriteBefore overwriting preserve the value of the conflicting attribute in some other attribute (can be a bag of attributes like that or can have some renaming rules). Pros
Cons
[UPDATE 2023-05-17] A possible way to preserve the conflicting attribute is to rename it and append a suffix equal to the source schema version number. For example when converting the data from 1.0.0 to 1.1.0 we would append a suffix {
"attributes": { "host": "spool", "host.name": "spool.example.com" }
} The output data will become: {
"attributes": { "host.name": "spool", "host.name_1.0.0": "spool.example.com" }
} This approach allows applying multiple version transformations and preserve all conflicting attributes such that it is also captured in a human-readable form which "previous" version of the schema the original data was when the conflict was detected and thus can be a useful hint for understanding the supposed semantics of the original data. Another possible way to preserve the conflicting attributes would be to put them all as sub-keys under some well-known attribute, however it is currently not allowed to have nested attributes (see #376) Detect and Reject/RollbackPros
Cons
|
|
I think that's worse than overwriting a non-compliant attribute. To me the offender who incorrectly uses a custom attribute name should be punished, instead we chose for a collateral damage and loose the information about schema :-) |
An application developer (e.g. an early adopter) who has added an attribute that is now used in their alerts, dashboards and business reports should have it all broken because now some automation in their vendor ingestion pipeline decided schema is more important? |
A combination of Overwrite+Detect is also possible. Detect that a conflicting attribute exists, rename it to something that preserves it and then overwrite. This way there is no loss of data, but we still forcefully make the data compliant with the target schema version. |
renaming an attribute set by application for the sake of compliance with spec and breaking their tooling does not look reasonable. |
If some important UX depends on compliance with schema, such users will suffer from not getting this experience and will go and fix it themselves. |
Yes, I think so. The spec clearly says how to use custom attributes and if you follow the rules nothing like that can happen (I don't see what does it have to do with being an early adopter). If they don't follow the rules there are consequences. Obviously, this is just one opinion, if we don't find one good approach that we all like then perhaps this needs to be a configurable option. |
Also, a reminder to make sure we are aware of the context of the discussion: the data loss is only applicable to use-cases where the data is actually transformed before storing, e.g. in a Collector transformer processor or equivalent "before-storage" transformer. Schemas also can be used for query-time transformations where there is no data loss. The "overwrite" rule merely defines how the queries are rewritten. |
Can you point me to this section? I don't remember any limitations on custom attributes, reserving namespaces or anything else. |
|
It does not tell me I cannot add |
and as an early adopter I'm free to follow the advice here "The name may be generally applicable to applications in the industry. In that case consider submitting a proposal to this specification to add a new name to the semantic conventions, and if necessary also to add a new namespace." and create |
I think you are right, that page is not well written. I know that the intent was precisely to avoid the conflicts of custom attributes with Otel names. We need to fix it to make it clearer and use normative language. |
For the Overwrite approach, the consequence for not following the rules is data-loss. For the Skip/Ignore approach, the consequence for not following the rules is data that is only partly schema compliant. In my opinion, the latter is more user-friendly and less invasive. Many of the OTel users I work with would be very critical of their telemetry data being dropped because they don't follow certain "recommendations". |
We drop data elsewhere too if it does not fit the requirements, e.g. drop attributes if the count exceeds the configured limit. We record the fact of dropping, but don't keep a copy of dropped data. In that case too the consequence of not following the rules is data-loss. If it is acceptable for setting the attributes in the API then I think it is similarly acceptable for transformations in the Collector. That said, I don't think that's the best we can do. It seems to me "Preserve and Overwrite" may be a better option. We will preserve the data (although move to a different place, but readily accessible) and also end up with a consistent shape of the output data, complying with the target schema. What does everyone think about this approach? |
@open-telemetry/specs-approvers any thoughts on the possible options outlined above? |
"Preserve and Overwrite" is the option that makes the most sense to me, either with an explicit bag of attributes or just with a count of the overwritten attributes. Producing non-compliant data as "Ignore" would do seems like a no-go to me, and it is unclear to me how we would implement "Detect and Reject/Rollback". |
I updated "Preserve and Overwrite" option above to show how we can possibly preserve the conflicting data. I am inclining towards this solution, it seem the best available from the 4 listed options so far. |
"Preserve and Override" changes the name of an attribute, which can break user dashboards, alerts, and queries. Given that schema transformation may be part of the vendor ingestion pipeline, users won't even control it. I.e. this change would be breaking for existing user infra build on top of this data. When conversion happens inside user pipeline, investigating why this data was changed and why everything is broken will be difficult. So it seems, we're ready to break user tooling for the sake of compliance with later version of otel schema. To me, the benefit does not outweigh the risk and I'm still on "Ignore, Skip" + warn. |
Thinking more about it, it made me realize that schema transformation by default should preserve original attributes to avoid breaking existing custom tooling built on top of such attributes - #3510 |
I don't have a strong opinion on this (one could also argue that vendors are the ones that need to handle that) but I feel like this is independent from this issue; we should discuss this issue assuming the current state of the spec which AIUI is that original attributes are removed. |
the difference is that current behavior happens in-process, it's stable and easy to discover with tests. We also give users the ability to control this behavior by configuring limits. In case of schema transformation, this behavior becomes unstable - someone changes to latest version (either it's processor inside user's collector pipeline, or in the vendor pipeline) and accidentally your data is dropped or moved. This is never discovered by tests because happens much later and is out-of-process. I suggest the following in #3510
|
@lmolkova From what I understand this is the same as "Ignore, Skip" approach described above. Can you clarify what you suggest to happen with SchemaURL in there is a collision? Do we update the SchemaURL or keep the old value? Also, how do we address the fact that the data as a whole after such partially applied transformation does not comply with any Otel semconv version, neither the old one nor the new one? |
Correct.
A few options:
if we cannot provide reasonable defaults, stability guarantees, performance, and observability into edge cases for schema transformation, it means it should just stay experimental and not be used for stable conventions. |
I spent some time to get a better feel for what it would take to implement the "Rollback" option and what the impact will be. I wrote a toy implementation and run some benchmarks with rollback-ability enabled and disabled. Toy source code here: https://github.com/tigrannajaryan/telemetry-schema/blob/main/schema/compiled/common_actions.go The schema converter creates a changelog that can be used for rolling the changes back (for whatever reason, including if conflicts are detected). Here are the benchmarking results:
The first too lines is schema conversion of 100 spans, with and without rollback capability enabled (but without doing any rollback, this just creates the changelog). It tests only the success case, when there are no conflicts and, there is no need to rollback. I believe this is the most important case to care about from performance perspective. As you can see enabling changelog adds about 40µs (per 100 spans) and roughly 1.4 times slows down the conversion process. As a reference point I also included benchmarks for marshalling/unmarshall the exact same data, which is roughly what a Collector with otlp receiver, a schema processor and an otlp exporter would do. As a rough number enabling changelog would add overall about 5% of CPU time, which is not insignificant, but I don't think it is a dealbreaker. Furthermore, I think there is room to improve performance of the changelog if needed (at the cost of complexity and maintainability). Also take the benchmark with a huge grain of salt. It is just one scenario, with one particular schema conversion. We can validate this further, but for now it is enough for me to get a very approximate sense of expected performance. So, given all the above, in my opinion, performance is not going to be the decisive factor. At the moment I am more concerned with the complexity of implementation of the changelog and the opportunity to introduce bugs. IMO, it is not terribly complicated, but still relatively easy to make a mistake, and especially since it is in a code that is exercised rarely (rollbacks will be rare) the bugs introduced may be hard to notice. To be clear: I would prefer the "Rollback" option since it results in the best functional outcome IMO: no corrupt data belonging to no schema, clear ability to see the errors if needed, etc. I would like an opinion of @open-telemetry/collector-approvers @open-telemetry/collector-contrib-approvers @open-telemetry/collector-maintainers @open-telemetry/collector-contrib-maintainer on this. |
I worry about rollbacks because it feels like silently failing but with more steps, I also worry about the cognitive burden on developing schemas since it could be possible to add in a conflict without realising it. If this has been deployed on collector gateway so data mutations happen on egress to vendor, but a conflict caused rollbacks of the data then my expected visualisations and alerts are broken if I am expecting them to be of a fixed version. Not sure of the likelihood of this scenario but I can see it becoming a higher chance of happening while transition from manual instrumentation to a hybrid of auto instrumentation. |
@tigrannajaryan , let me bring this to the sig. |
@MovieStoreGuy thanks, please do. |
The level of complexity required to support the rollback mechanism seems excessive to me, considering that such conflicts should be infrequent occurrences. Additionally, if I configure the schema transform processor to generate a specific schema version, I would expect the processor to make its best attempt to translate the data to that version. Otherwise, I would opt to use a different processor and modify the data accordingly. I believe we can produce the data as close to the required state as possible. Even if we encounter a conflict, we can utilize the value set by the user, which may not precisely adhere to the specification but will, at least, have the correct attribute name. We might encounter this situation regardless in the following scenario:
In that case, translation isn't triggered, and we simply update the schema from version |
What are you trying to achieve?
When building the the schema processor for the collector, it was noted that there is no defined behaviour on what to expect if there is an existing value that would be overridden.
What did you expect to see?
I've currently made the assumption that the schema conversions take priority over existing values and return an error that can be consumed by the collector signify a metric (or log) on how many conflicts were found and report that.
Additional context.
Context: open-telemetry/opentelemetry-collector-contrib#17020 (comment)
The text was updated successfully, but these errors were encountered: