From bd0177d4aa53b590ed5e75c29690df5bcc380b10 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Tue, 8 Sep 2020 22:03:33 -0700 Subject: [PATCH 01/33] Add error flagging proposal --- text/trace/0000-error_flagging.md | 89 +++++++++++++++++++++++++++++++ 1 file changed, 89 insertions(+) create mode 100644 text/trace/0000-error_flagging.md diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md new file mode 100644 index 000000000..10c5bca90 --- /dev/null +++ b/text/trace/0000-error_flagging.md @@ -0,0 +1,89 @@ +# Error Flagging with Status Codes +This proposal adds two status codes explicitly for use as overrides by the end user, and proposes a canonical mapping of semantic conventions to status codes. This clarifies how error reporting should work in OpenTelemetry. + +## Motivation +Error reporting is a fundamental use case for distributed tracing. While we prefer that error flagging occurs within analysis tools, and not within instrumentation plugins, a number of currently supported analysis tools and protocols rely on the existence of an explicit error flag reported from instrumentation. In OpenTelmetry, the error flag is called "status codes." + +However, there is confusion over the mapping of semantic conventions to status codes, and concern over the subjective nature of errors. Which network failures count as an error? Are 404s an error? The answer is dependent on the situation. + +There is one major exception. Both application developers and operators have a deep understanding of what constitutes an error in their system. OpenTelemetry must provide a way for these users to control error flagging, and explicitly indicate both when a span should and should not be counted as an error. +A second exception is supporting analysis tools which require explicit error flagging in the data which they receive. In this case, an operator must be able to apply an error flagging schema at some point during the OTLP data processing pipeline. + +## Explanation +The following changes add several missing features required for proper error reporting, and are completely backwards compatible with OpenTelemetry today. + +### Status Codes +The following status codes are added to our current schema. + +* `DEFAULT` No status has been set. Any errors must be detected by the analysis tool. (This replaces `OK`.) +* `ERROR` Instrumentation has marked a span as an error. (This replaces `UNKNOWN`.) +* `OK_OVERRIDE` The user has provided an override. The span should NOT be flagged as an error, regardless of other analysis. +* `ERROR_OVERRIDE` The user has provided an override. The span SHOULD be flagged as an error, regardless of other analysis. + +(Note that our current status codes include a long list of error types. We may choose to keep them, change them, or drop them in favor of a single `ERROR` code. How many error types we have is not relevant to this proposal .) + +`OK_OVERRIDE` and `ERROR_OVERRIDE` are special status codes. These are explicit overrides provided by the end user, and should never be set by shared instrumentation. They should only be set by the application developer (via application code), or by the operator (via the collector). + +Analysis tools are free to disregard status codes, in favor of their own approach to error analysis. However it is strongly suggested that analysis tools handle `OK_OVERRIDE` and `ERROR_OVERRIDE`, as these are explicitly set by the end user and contain valuable information. + + +### Error Mapping Schema +As part of the specification, OpenTelemetry provides a canonical mapping of semantic conventions to status codes. This removes any ambiguity as to what OpenTelemetry ships with out of the box. + +### Error Processor +The collector will provide a processor and a configuration language to make adjustments to this error mapping schema. This provides the flexibility and customization needed for real world scenarios. + +### Semantic conventions +`error.message` - A description of the error to be displayed in the analysis tool. This attribute may be set on both spans and events. This is optional, but useful when the error does not map to an exception, or an existing semantic convention. It is not necessary to set this for exceptions or our standard error mapping. + +### Convenience methods +As a convenience, OpenTelemetry provides helper functions for adding semantic conventions and exceptions to a span. These helper functions will also set the correct status code. This simplifies the life of the instrumentation author, and helps ensure compliance and data quality. + + +## Internal details +Except for the renaming of two status codes, this proposal is backwards compatible with existing code, protocols, and the OpenTracing bridge. + +**OK is renamed to DEFAULT** + Using the term “OK” as the default implies more meaning than we intend. The span is not necessarily OK - it simply has not triggered our standard error mapping. The default status code should be renamed to `DEFAULT` to avoid confusion. + + Note: I intentionally avoided terms like "unset" as it may imply to users that they are required to set the status code, which is not the intention. + +**UNKNOWN is renamed to ERROR** + In the new schema, ERROR is the primary status code for reporting errors. Especially in a reduced list of status codes, the term "unknown" is vague and may accidentally imply a meaning to users which we do not intend. + + +## BUT ERRORS ARE SUBJECTIVE!! HOW CAN WE KNOW WHAT IS AN ERROR? WHO ARE WE TO DEFINE THIS? +First of all, every tracing system to-date comes with a default set of errors. No system requires that end users start completely from scratch. So... be calm!! Have faith!! + +While flagging errors can be a subjective decision, it is true that many semantic conventions qualify as an error. By providing a default mapping of semantic conventions to errors, we ensure compatibility with existing analysis tools (e.g. Jaeger), and provide guidance to users and future implementers. + +Obviously, all systems are different, and users will want to adjust error reporting on a case by case basis. Unwanted errors may be suppressed, and additional errors may be added. The collector will provide a processor and a configuration language to make this a straightforward process. Working from a baseline of standard errors will provide a better experience than having to define a schema from scratch. + +Note that analysis tools are free to disregard Span Status, and do their own error analysis. For these systems, the only Status codes of import are `OK_OVERRIDE` and `ERROR_OVERRIDE`. + +Removing the need to explicitly set span status, and instead have it + +If we really hate the current canonical status codes, most may be removed and added back in later. I do suggest we keep the status codes that map to network failures, and I agree that the rest are a bit suspect. + +The minimal number of status codes would be `DEFAULT`, `ERROR`, `OK_OVERRIDE` and `ERROR_OVERRIDE`. `ERROR` will be used to differentiate between standard errors applied by instrumentation and overrides provided by the end user. + +## Remind me why we need status codes again? +Status codes provide a low overhead mechanism for checking if a span counts against an error budget, without having to scan every attribute and event. This reduces overhead and is a benefit for many systems. + +Again, the status codes may be customized by the operator during the telemetry pipeline, in order to add and suppress errors. + +## Open questions +If we add error processing to the collector, it is unclear what the overhead would be. + +It is also unclear what the cost is for backends to scan for errors on every span, without a hint from instrumentation that an error might be present. + +## Prior art and alternatives +In OpenTracing, the lack of a Collector and error mapping schema proved to be unwieldy. It placed a burden on instrumentation plugin authors to set the flag correctly, and led to an explosion of non-standardized configuration options in every plugin just to adjust the default error flagging. This in turn placed a configuration burden on application developers. + +An alternative is the `error.hint` proposal, paired with the removal of status code. This would work, but essentially provides the same mechanism provided in this proposal, only with a large number of breaking changes. It also does not address the need for user overrides. + +## Future Work + +The inclusion of status codes and error mappings help the opentelemetry community speak the same language in terms of error reporting. It lifts the burden on future analysis tools, and (when respected) it allows users to employ multiple analysis tools without having to synchronize an important form of configuration across multiple tools. + +In the future, OpenTelemtry may add a control plane which allows dynamic configuration of the error mapping schema. From 54121cd263b94d4f4ea17166e26763c9975e6c21 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Tue, 8 Sep 2020 23:44:35 -0700 Subject: [PATCH 02/33] removed half finished sentence --- text/trace/0000-error_flagging.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index 10c5bca90..732260b37 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -61,9 +61,7 @@ Obviously, all systems are different, and users will want to adjust error report Note that analysis tools are free to disregard Span Status, and do their own error analysis. For these systems, the only Status codes of import are `OK_OVERRIDE` and `ERROR_OVERRIDE`. -Removing the need to explicitly set span status, and instead have it - -If we really hate the current canonical status codes, most may be removed and added back in later. I do suggest we keep the status codes that map to network failures, and I agree that the rest are a bit suspect. +If we really hate the current canonical status codes, most may be removed and added back in later. I do suggest we keep the status codes that map to network failures, and I agree that the rest are a bit suspect for our current needs. The minimal number of status codes would be `DEFAULT`, `ERROR`, `OK_OVERRIDE` and `ERROR_OVERRIDE`. `ERROR` will be used to differentiate between standard errors applied by instrumentation and overrides provided by the end user. From 725cbc006ee50e5d090d93bc255bfc1dff34c776 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Tue, 8 Sep 2020 23:49:07 -0700 Subject: [PATCH 03/33] Remove error.message, it is redundant with the status message. --- text/trace/0000-error_flagging.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index 732260b37..ac4acca56 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -33,9 +33,6 @@ As part of the specification, OpenTelemetry provides a canonical mapping of sema ### Error Processor The collector will provide a processor and a configuration language to make adjustments to this error mapping schema. This provides the flexibility and customization needed for real world scenarios. -### Semantic conventions -`error.message` - A description of the error to be displayed in the analysis tool. This attribute may be set on both spans and events. This is optional, but useful when the error does not map to an exception, or an existing semantic convention. It is not necessary to set this for exceptions or our standard error mapping. - ### Convenience methods As a convenience, OpenTelemetry provides helper functions for adding semantic conventions and exceptions to a span. These helper functions will also set the correct status code. This simplifies the life of the instrumentation author, and helps ensure compliance and data quality. From ae86379bdf1ad3ef3985986c779662866c343ccf Mon Sep 17 00:00:00 2001 From: Ted Young Date: Tue, 8 Sep 2020 23:51:10 -0700 Subject: [PATCH 04/33] whitespace Co-authored-by: Reiley Yang --- text/trace/0000-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index ac4acca56..e963c4e02 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -81,4 +81,4 @@ An alternative is the `error.hint` proposal, paired with the removal of status c The inclusion of status codes and error mappings help the opentelemetry community speak the same language in terms of error reporting. It lifts the burden on future analysis tools, and (when respected) it allows users to employ multiple analysis tools without having to synchronize an important form of configuration across multiple tools. -In the future, OpenTelemtry may add a control plane which allows dynamic configuration of the error mapping schema. +In the future, OpenTelemtry may add a control plane which allows dynamic configuration of the error mapping schema. From 74dd7f50f1f234af4342cbbfb5fae30b5b176689 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Tue, 8 Sep 2020 23:51:31 -0700 Subject: [PATCH 05/33] whitespace Co-authored-by: Reiley Yang --- text/trace/0000-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index e963c4e02..5c32dfd20 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -20,7 +20,7 @@ The following status codes are added to our current schema. * `OK_OVERRIDE` The user has provided an override. The span should NOT be flagged as an error, regardless of other analysis. * `ERROR_OVERRIDE` The user has provided an override. The span SHOULD be flagged as an error, regardless of other analysis. -(Note that our current status codes include a long list of error types. We may choose to keep them, change them, or drop them in favor of a single `ERROR` code. How many error types we have is not relevant to this proposal .) +(Note that our current status codes include a long list of error types. We may choose to keep them, change them, or drop them in favor of a single `ERROR` code. How many error types we have is not relevant to this proposal.) `OK_OVERRIDE` and `ERROR_OVERRIDE` are special status codes. These are explicit overrides provided by the end user, and should never be set by shared instrumentation. They should only be set by the application developer (via application code), or by the operator (via the collector). From 0d34ea11b2c55bbe66d7432714940d387c9db4e4 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Tue, 8 Sep 2020 23:55:01 -0700 Subject: [PATCH 06/33] clarify convenience methods are in a seprate package --- text/trace/0000-error_flagging.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index 5c32dfd20..7487ef272 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -35,6 +35,8 @@ The collector will provide a processor and a configuration language to make adju ### Convenience methods As a convenience, OpenTelemetry provides helper functions for adding semantic conventions and exceptions to a span. These helper functions will also set the correct status code. This simplifies the life of the instrumentation author, and helps ensure compliance and data quality. + +Note that these convenience methods simply wire together multiple API calls. They should live in a helper package, and should not be directly added to existing API interfaces. Given how many semantic conventions we have, there will be a pile of them. ## Internal details From d951674c3b06252e42521dc392efcee17aab6105 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Wed, 9 Sep 2020 11:46:35 -0700 Subject: [PATCH 07/33] whitespace Co-authored-by: Steve Flanders --- text/trace/0000-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index 7487ef272..33811e1af 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -13,7 +13,7 @@ A second exception is supporting analysis tools which require explicit error fla The following changes add several missing features required for proper error reporting, and are completely backwards compatible with OpenTelemetry today. ### Status Codes -The following status codes are added to our current schema. +The following status codes are added to our current schema. * `DEFAULT` No status has been set. Any errors must be detected by the analysis tool. (This replaces `OK`.) * `ERROR` Instrumentation has marked a span as an error. (This replaces `UNKNOWN`.) From c814974849ac651ef6253d333bb0ab610cc44198 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Wed, 9 Sep 2020 11:47:33 -0700 Subject: [PATCH 08/33] whitespace Co-authored-by: Steve Flanders --- text/trace/0000-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index 33811e1af..25e896a71 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -20,7 +20,7 @@ The following status codes are added to our current schema. * `OK_OVERRIDE` The user has provided an override. The span should NOT be flagged as an error, regardless of other analysis. * `ERROR_OVERRIDE` The user has provided an override. The span SHOULD be flagged as an error, regardless of other analysis. -(Note that our current status codes include a long list of error types. We may choose to keep them, change them, or drop them in favor of a single `ERROR` code. How many error types we have is not relevant to this proposal.) +(Note that our current status codes include a long list of error types. We may choose to keep them, change them, or drop them in favor of a single `ERROR` code. How many error types we have is not relevant to this proposal.) `OK_OVERRIDE` and `ERROR_OVERRIDE` are special status codes. These are explicit overrides provided by the end user, and should never be set by shared instrumentation. They should only be set by the application developer (via application code), or by the operator (via the collector). From e787f948d4244f3ddef003d23780beed7a83c098 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Wed, 9 Sep 2020 11:48:22 -0700 Subject: [PATCH 09/33] use correct RFC language Co-authored-by: Steve Flanders --- text/trace/0000-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index 25e896a71..c9abbd48c 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -24,7 +24,7 @@ The following status codes are added to our current schema. `OK_OVERRIDE` and `ERROR_OVERRIDE` are special status codes. These are explicit overrides provided by the end user, and should never be set by shared instrumentation. They should only be set by the application developer (via application code), or by the operator (via the collector). -Analysis tools are free to disregard status codes, in favor of their own approach to error analysis. However it is strongly suggested that analysis tools handle `OK_OVERRIDE` and `ERROR_OVERRIDE`, as these are explicitly set by the end user and contain valuable information. +Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD handle `OK_OVERRIDE` and `ERROR_OVERRIDE`, as these are explicitly set by the end-user and contain valuable information. ### Error Mapping Schema From 3a25b4ca8e4caff95ff4e8c70808e5776a45eaa2 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Wed, 9 Sep 2020 11:52:54 -0700 Subject: [PATCH 10/33] spelling Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> --- text/trace/0000-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index c9abbd48c..b16f74a6c 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -2,7 +2,7 @@ This proposal adds two status codes explicitly for use as overrides by the end user, and proposes a canonical mapping of semantic conventions to status codes. This clarifies how error reporting should work in OpenTelemetry. ## Motivation -Error reporting is a fundamental use case for distributed tracing. While we prefer that error flagging occurs within analysis tools, and not within instrumentation plugins, a number of currently supported analysis tools and protocols rely on the existence of an explicit error flag reported from instrumentation. In OpenTelmetry, the error flag is called "status codes." +Error reporting is a fundamental use case for distributed tracing. While we prefer that error flagging occurs within analysis tools, and not within instrumentation plugins, a number of currently supported analysis tools and protocols rely on the existence of an explicit error flag reported from instrumentation. In OpenTelemetry, the error flag is called "status codes." However, there is confusion over the mapping of semantic conventions to status codes, and concern over the subjective nature of errors. Which network failures count as an error? Are 404s an error? The answer is dependent on the situation. From 4a42c3aaf960ba5de1ef54701787ec110a6ca60c Mon Sep 17 00:00:00 2001 From: Ted Young Date: Wed, 9 Sep 2020 11:53:11 -0700 Subject: [PATCH 11/33] spellingUpdate text/trace/0000-error_flagging.md Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> --- text/trace/0000-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index b16f74a6c..6931abae8 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -83,4 +83,4 @@ An alternative is the `error.hint` proposal, paired with the removal of status c The inclusion of status codes and error mappings help the opentelemetry community speak the same language in terms of error reporting. It lifts the burden on future analysis tools, and (when respected) it allows users to employ multiple analysis tools without having to synchronize an important form of configuration across multiple tools. -In the future, OpenTelemtry may add a control plane which allows dynamic configuration of the error mapping schema. +In the future, OpenTelemetry may add a control plane which allows dynamic configuration of the error mapping schema. From d204bc7b907e287cddf8594f57e800772f1b23a2 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Wed, 9 Sep 2020 11:53:24 -0700 Subject: [PATCH 12/33] capitalization Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> --- text/trace/0000-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index 6931abae8..fdbee74ff 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -70,7 +70,7 @@ Status codes provide a low overhead mechanism for checking if a span counts agai Again, the status codes may be customized by the operator during the telemetry pipeline, in order to add and suppress errors. ## Open questions -If we add error processing to the collector, it is unclear what the overhead would be. +If we add error processing to the Collector, it is unclear what the overhead would be. It is also unclear what the cost is for backends to scan for errors on every span, without a hint from instrumentation that an error might be present. From f551758886f736b45f564d78d9916022a6ecc4a3 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Wed, 9 Sep 2020 11:53:38 -0700 Subject: [PATCH 13/33] Capitalization Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com> --- text/trace/0000-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index fdbee74ff..7efcc896b 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -81,6 +81,6 @@ An alternative is the `error.hint` proposal, paired with the removal of status c ## Future Work -The inclusion of status codes and error mappings help the opentelemetry community speak the same language in terms of error reporting. It lifts the burden on future analysis tools, and (when respected) it allows users to employ multiple analysis tools without having to synchronize an important form of configuration across multiple tools. +The inclusion of status codes and error mappings help the OpenTelemetry community speak the same language in terms of error reporting. It lifts the burden on future analysis tools, and (when respected) it allows users to employ multiple analysis tools without having to synchronize an important form of configuration across multiple tools. In the future, OpenTelemetry may add a control plane which allows dynamic configuration of the error mapping schema. From 44b1fa4bf1712d761a79e6fd46f87f9f8c1246e8 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Wed, 9 Sep 2020 13:34:14 -0700 Subject: [PATCH 14/33] error mapping -> status mapping --- text/trace/0000-error_flagging.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index 7efcc896b..3e4fdbff5 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -27,11 +27,11 @@ The following status codes are added to our current schema. Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD handle `OK_OVERRIDE` and `ERROR_OVERRIDE`, as these are explicitly set by the end-user and contain valuable information. -### Error Mapping Schema +### Status Mapping Schema As part of the specification, OpenTelemetry provides a canonical mapping of semantic conventions to status codes. This removes any ambiguity as to what OpenTelemetry ships with out of the box. -### Error Processor -The collector will provide a processor and a configuration language to make adjustments to this error mapping schema. This provides the flexibility and customization needed for real world scenarios. +### Status Processor +The collector will provide a processor and a configuration language to make adjustments to this status mapping schema. This provides the flexibility and customization needed for real world scenarios. ### Convenience methods As a convenience, OpenTelemetry provides helper functions for adding semantic conventions and exceptions to a span. These helper functions will also set the correct status code. This simplifies the life of the instrumentation author, and helps ensure compliance and data quality. @@ -43,7 +43,7 @@ Note that these convenience methods simply wire together multiple API calls. The Except for the renaming of two status codes, this proposal is backwards compatible with existing code, protocols, and the OpenTracing bridge. **OK is renamed to DEFAULT** - Using the term “OK” as the default implies more meaning than we intend. The span is not necessarily OK - it simply has not triggered our standard error mapping. The default status code should be renamed to `DEFAULT` to avoid confusion. + Using the term “OK” as the default implies more meaning than we intend. The span is not necessarily OK - it simply has not triggered our standard status mapping. The default status code should be renamed to `DEFAULT` to avoid confusion. Note: I intentionally avoided terms like "unset" as it may imply to users that they are required to set the status code, which is not the intention. @@ -75,12 +75,12 @@ If we add error processing to the Collector, it is unclear what the overhead wou It is also unclear what the cost is for backends to scan for errors on every span, without a hint from instrumentation that an error might be present. ## Prior art and alternatives -In OpenTracing, the lack of a Collector and error mapping schema proved to be unwieldy. It placed a burden on instrumentation plugin authors to set the flag correctly, and led to an explosion of non-standardized configuration options in every plugin just to adjust the default error flagging. This in turn placed a configuration burden on application developers. +In OpenTracing, the lack of a Collector and status mapping schema proved to be unwieldy. It placed a burden on instrumentation plugin authors to set the error flag correctly, and led to an explosion of non-standardized configuration options in every plugin just to adjust the default error flagging. This in turn placed a configuration burden on application developers. An alternative is the `error.hint` proposal, paired with the removal of status code. This would work, but essentially provides the same mechanism provided in this proposal, only with a large number of breaking changes. It also does not address the need for user overrides. ## Future Work -The inclusion of status codes and error mappings help the OpenTelemetry community speak the same language in terms of error reporting. It lifts the burden on future analysis tools, and (when respected) it allows users to employ multiple analysis tools without having to synchronize an important form of configuration across multiple tools. +The inclusion of status codes and status mappings help the OpenTelemetry community speak the same language in terms of error reporting. It lifts the burden on future analysis tools, and (when respected) it allows users to employ multiple analysis tools without having to synchronize an important form of configuration across multiple tools. -In the future, OpenTelemetry may add a control plane which allows dynamic configuration of the error mapping schema. +In the future, OpenTelemetry may add a control plane which allows dynamic configuration of the status mapping schema. From bb1bfb61051ede34d658c5ac1da56d4196c4c564 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Wed, 9 Sep 2020 13:34:30 -0700 Subject: [PATCH 15/33] whitespace --- text/trace/0000-error_flagging.md | 1 + 1 file changed, 1 insertion(+) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index 3e4fdbff5..a87730263 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -7,6 +7,7 @@ Error reporting is a fundamental use case for distributed tracing. While we pref However, there is confusion over the mapping of semantic conventions to status codes, and concern over the subjective nature of errors. Which network failures count as an error? Are 404s an error? The answer is dependent on the situation. There is one major exception. Both application developers and operators have a deep understanding of what constitutes an error in their system. OpenTelemetry must provide a way for these users to control error flagging, and explicitly indicate both when a span should and should not be counted as an error. + A second exception is supporting analysis tools which require explicit error flagging in the data which they receive. In this case, an operator must be able to apply an error flagging schema at some point during the OTLP data processing pipeline. ## Explanation From ccc0f5ead0ae05f2d308eb10b196c93b0ee57ee2 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Wed, 9 Sep 2020 15:08:12 -0700 Subject: [PATCH 16/33] replace override status codes with user_override boolean --- text/trace/0000-error_flagging.md | 41 +++++++++---------------------- 1 file changed, 12 insertions(+), 29 deletions(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index a87730263..28acb0bde 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -4,29 +4,22 @@ This proposal adds two status codes explicitly for use as overrides by the end u ## Motivation Error reporting is a fundamental use case for distributed tracing. While we prefer that error flagging occurs within analysis tools, and not within instrumentation plugins, a number of currently supported analysis tools and protocols rely on the existence of an explicit error flag reported from instrumentation. In OpenTelemetry, the error flag is called "status codes." -However, there is confusion over the mapping of semantic conventions to status codes, and concern over the subjective nature of errors. Which network failures count as an error? Are 404s an error? The answer is dependent on the situation. +However, there is confusion over the mapping of semantic conventions to status codes, and concern over the subjective nature of errors. Which network failures count as an error? Are 404s an error? The answer is often dependent on the situation, but without even a baseline of suggested status codes for each convention, the instrumentation author is placed under the heavy burden of making the decision. Worse, the decisions will not be in sync across different instrumentation packages. -There is one major exception. Both application developers and operators have a deep understanding of what constitutes an error in their system. OpenTelemetry must provide a way for these users to control error flagging, and explicitly indicate both when a span should and should not be counted as an error. +There is one other missing piece, required for proper error flagging. Both application developers and operators have a deep understanding of what constitutes an error in their system. OpenTelemetry must provide a way for these users to control error flagging, and explicitly indicate that it is the end user setting the status code, and not instrumentation plugins. In these specific cases, the error flagging is known to be correct: the end user has decided the status of the span, and they do not want another interpretation. -A second exception is supporting analysis tools which require explicit error flagging in the data which they receive. In this case, an operator must be able to apply an error flagging schema at some point during the OTLP data processing pipeline. +While generic instrumentation can only provide a generic schema, end users are capable of making subjective decisions about their systems. And, as the end user, they should get to have the final call in what consitutes an error. In order to accomplish this, there must be a way to differntiate between errors flagged by instrumentation, and errors flagged by the end user. ## Explanation The following changes add several missing features required for proper error reporting, and are completely backwards compatible with OpenTelemetry today. +### `span.user_override(boolean)` +The `user_override` method indicates that the end user has confirmed that the status code is correct. When using OTLP, this will set the `user_override` field. When setting status codes via the collector or application code, `user_override` can be set to ensure that the span status is not re-interpreted by further analysis. + +Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD pay attention to the status code when `user_override` is set, as it is a communication from the end-user and contains valuable information. + ### Status Codes -The following status codes are added to our current schema. - -* `DEFAULT` No status has been set. Any errors must be detected by the analysis tool. (This replaces `OK`.) -* `ERROR` Instrumentation has marked a span as an error. (This replaces `UNKNOWN`.) -* `OK_OVERRIDE` The user has provided an override. The span should NOT be flagged as an error, regardless of other analysis. -* `ERROR_OVERRIDE` The user has provided an override. The span SHOULD be flagged as an error, regardless of other analysis. - -(Note that our current status codes include a long list of error types. We may choose to keep them, change them, or drop them in favor of a single `ERROR` code. How many error types we have is not relevant to this proposal.) - -`OK_OVERRIDE` and `ERROR_OVERRIDE` are special status codes. These are explicit overrides provided by the end user, and should never be set by shared instrumentation. They should only be set by the application developer (via application code), or by the operator (via the collector). - -Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD handle `OK_OVERRIDE` and `ERROR_OVERRIDE`, as these are explicitly set by the end-user and contain valuable information. - +Note that our current status codes include a long list of error types. We may choose to keep them, change them, or drop them in favor of a single `ERROR` code. How many error types we have is not relevant to this proposal. ### Status Mapping Schema As part of the specification, OpenTelemetry provides a canonical mapping of semantic conventions to status codes. This removes any ambiguity as to what OpenTelemetry ships with out of the box. @@ -41,15 +34,7 @@ Note that these convenience methods simply wire together multiple API calls. The ## Internal details -Except for the renaming of two status codes, this proposal is backwards compatible with existing code, protocols, and the OpenTracing bridge. - -**OK is renamed to DEFAULT** - Using the term “OK” as the default implies more meaning than we intend. The span is not necessarily OK - it simply has not triggered our standard status mapping. The default status code should be renamed to `DEFAULT` to avoid confusion. - - Note: I intentionally avoided terms like "unset" as it may imply to users that they are required to set the status code, which is not the intention. - -**UNKNOWN is renamed to ERROR** - In the new schema, ERROR is the primary status code for reporting errors. Especially in a reduced list of status codes, the term "unknown" is vague and may accidentally imply a meaning to users which we do not intend. +This proposal is backwards compatible with existing code, protocols, and the OpenTracing bridge. ## BUT ERRORS ARE SUBJECTIVE!! HOW CAN WE KNOW WHAT IS AN ERROR? WHO ARE WE TO DEFINE THIS? @@ -59,11 +44,9 @@ While flagging errors can be a subjective decision, it is true that many semanti Obviously, all systems are different, and users will want to adjust error reporting on a case by case basis. Unwanted errors may be suppressed, and additional errors may be added. The collector will provide a processor and a configuration language to make this a straightforward process. Working from a baseline of standard errors will provide a better experience than having to define a schema from scratch. -Note that analysis tools are free to disregard Span Status, and do their own error analysis. For these systems, the only Status codes of import are `OK_OVERRIDE` and `ERROR_OVERRIDE`. - -If we really hate the current canonical status codes, most may be removed and added back in later. I do suggest we keep the status codes that map to network failures, and I agree that the rest are a bit suspect for our current needs. +Note that analysis tools MAY disregard Span Status, and do their own error analysis. There is no requirement that the status code is respected, even when `user_override` is set. However, it is strongly suggested that analysis tools SHOULD pay attention to the status code when `user_override` is set, as it represents a subjective decision made by either the operator or application developer. -The minimal number of status codes would be `DEFAULT`, `ERROR`, `OK_OVERRIDE` and `ERROR_OVERRIDE`. `ERROR` will be used to differentiate between standard errors applied by instrumentation and overrides provided by the end user. +If we really hate the current canonical status codes, most may be removed and added back in later. I do suggest we keep the status codes that map to network failures, and I agree that the rest are a bit suspect for our current needs. The minimal number of status codes would be `OK` and `ERROR`. ## Remind me why we need status codes again? Status codes provide a low overhead mechanism for checking if a span counts against an error budget, without having to scan every attribute and event. This reduces overhead and is a benefit for many systems. From 241e2aa1e06cb74bc1a45d065c2a9951f74dd7f5 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Thu, 10 Sep 2020 10:03:05 -0700 Subject: [PATCH 17/33] rewrite based on error WG feedback --- text/trace/0000-error_flagging.md | 37 +++++++++++++++++++------------ 1 file changed, 23 insertions(+), 14 deletions(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index 28acb0bde..25d5f9b8f 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -8,21 +8,32 @@ However, there is confusion over the mapping of semantic conventions to status c There is one other missing piece, required for proper error flagging. Both application developers and operators have a deep understanding of what constitutes an error in their system. OpenTelemetry must provide a way for these users to control error flagging, and explicitly indicate that it is the end user setting the status code, and not instrumentation plugins. In these specific cases, the error flagging is known to be correct: the end user has decided the status of the span, and they do not want another interpretation. -While generic instrumentation can only provide a generic schema, end users are capable of making subjective decisions about their systems. And, as the end user, they should get to have the final call in what consitutes an error. In order to accomplish this, there must be a way to differntiate between errors flagged by instrumentation, and errors flagged by the end user. +While generic instrumentation can only provide a generic schema, end users are capable of making subjective decisions about their systems. And, as the end user, they should get to have the final call in what constitutes an error. In order to accomplish this, there must be a way to differentiate between errors flagged by instrumentation, and errors flagged by the end user. ## Explanation The following changes add several missing features required for proper error reporting, and are completely backwards compatible with OpenTelemetry today. - -### `span.user_override(boolean)` -The `user_override` method indicates that the end user has confirmed that the status code is correct. When using OTLP, this will set the `user_override` field. When setting status codes via the collector or application code, `user_override` can be set to ensure that the span status is not re-interpreted by further analysis. - -Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD pay attention to the status code when `user_override` is set, as it is a communication from the end-user and contains valuable information. ### Status Codes -Note that our current status codes include a long list of error types. We may choose to keep them, change them, or drop them in favor of a single `ERROR` code. How many error types we have is not relevant to this proposal. +Currently, OpenTelemetry does not have a use case for differentiating between different types of errors. However, this use case may appear in the future. For now, we would like to reduce the number of status codes, and then add them back in as the need becomes clear. We would also like to differentiate between status codes which have not been +set, and an explicit OK status set by an end user. + +* `UNSET` is the default status code. +* `ERROR` represents all error types. +* `OK` represents a span which has been explicitly marked as being free of errors, and should not be counted against an error budget. Note that only end users should set this status. Instead, instrumentation should leave the status as `UNSET` for nominal operations. + +### `Status Source` +The Status Source field identifies the origin of the status code on the span. This is important, as statuses set by application developers and operators have been confirmed by the end user to be correct to the particular situation. Statuses set by instrumentation, on the other hand, are only following a generic schema. These statuses When using OTLP, this will set the `user_override` field. When setting status codes via the collector or application code, `user_override` can be set to ensure that the span status is not re-interpreted by further analysis. + +* `INSTRUMENTATION` is the default source. This is used for instrumentation contained within shared code, such as OSS libraries and frameworks. All instrumentation plugins shipped with OpenTelemetry use this status code. +* `APPLICATION` identifies statuses set by the application developer, within their application code. +* `OPERATOR` identifies statuses which have been altered during data egress. + +Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD pay attention to the status codes when set by `APPLICATION` or `OPERATOR`, as it is a communication from the end-user and contains valuable information. ### Status Mapping Schema -As part of the specification, OpenTelemetry provides a canonical mapping of semantic conventions to status codes. This removes any ambiguity as to what OpenTelemetry ships with out of the box. +As part of the specification, OpenTelemetry provides a canonical mapping of semantic conventions to status codes. This removes any ambiguity as to what OpenTelemetry ships with out of the box. + +Please note that semantic conventions, and thus status mapping from conventions, are still a work in progress and will continue to change after GA. ### Status Processor The collector will provide a processor and a configuration language to make adjustments to this status mapping schema. This provides the flexibility and customization needed for real world scenarios. @@ -34,7 +45,7 @@ Note that these convenience methods simply wire together multiple API calls. The ## Internal details -This proposal is backwards compatible with existing code, protocols, and the OpenTracing bridge. +This proposal is mostly backwards compatible with existing code, protocols, and the OpenTracing bridge. The only potential exception is the removal of status codes enums from the current OTLP protocol, and the rewriting of the small number of instrumentation plugins that were making use of them. ## BUT ERRORS ARE SUBJECTIVE!! HOW CAN WE KNOW WHAT IS AN ERROR? WHO ARE WE TO DEFINE THIS? @@ -44,14 +55,12 @@ While flagging errors can be a subjective decision, it is true that many semanti Obviously, all systems are different, and users will want to adjust error reporting on a case by case basis. Unwanted errors may be suppressed, and additional errors may be added. The collector will provide a processor and a configuration language to make this a straightforward process. Working from a baseline of standard errors will provide a better experience than having to define a schema from scratch. -Note that analysis tools MAY disregard Span Status, and do their own error analysis. There is no requirement that the status code is respected, even when `user_override` is set. However, it is strongly suggested that analysis tools SHOULD pay attention to the status code when `user_override` is set, as it represents a subjective decision made by either the operator or application developer. - -If we really hate the current canonical status codes, most may be removed and added back in later. I do suggest we keep the status codes that map to network failures, and I agree that the rest are a bit suspect for our current needs. The minimal number of status codes would be `OK` and `ERROR`. +Note that analysis tools MAY disregard Span Status, and do their own error analysis. There is no requirement that the status code is respected, even when Status Source is set. However, it is strongly suggested that analysis tools SHOULD pay attention to the status code when Status Source is set, as it represents a subjective decision made by either the operator or application developer. ## Remind me why we need status codes again? -Status codes provide a low overhead mechanism for checking if a span counts against an error budget, without having to scan every attribute and event. This reduces overhead and is a benefit for many systems. +Status codes provide a low overhead mechanism for checking if a span counts against an error budget, without having to scan every attribute and event. It is an inexpensive and low cardinality approach to track multiple types of error budgets. This reduces overhead and could be a benefit for many systems. -Again, the status codes may be customized by the operator during the telemetry pipeline, in order to add and suppress errors. +However, adding in an existing set of error types without first clearly defining their use and how they might be set has caused confusion. If the status codes are not set consistently and correctly, then the resulting error budgeting will not be useful. So we are consolidating all error types into a single ERROR type, to avoid this situation. We may add more error types back in if we can agree on their use cases and a method for applying them consistently. ## Open questions If we add error processing to the Collector, it is unclear what the overhead would be. From db506d655f18e201225fa2530ff8835de895c60c Mon Sep 17 00:00:00 2001 From: Ted Young Date: Thu, 10 Sep 2020 10:56:42 -0700 Subject: [PATCH 18/33] indicate that status source is a new field. --- text/trace/0000-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index 25d5f9b8f..ef5e8fe50 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -22,7 +22,7 @@ set, and an explicit OK status set by an end user. * `OK` represents a span which has been explicitly marked as being free of errors, and should not be counted against an error budget. Note that only end users should set this status. Instead, instrumentation should leave the status as `UNSET` for nominal operations. ### `Status Source` -The Status Source field identifies the origin of the status code on the span. This is important, as statuses set by application developers and operators have been confirmed by the end user to be correct to the particular situation. Statuses set by instrumentation, on the other hand, are only following a generic schema. These statuses When using OTLP, this will set the `user_override` field. When setting status codes via the collector or application code, `user_override` can be set to ensure that the span status is not re-interpreted by further analysis. +A new Status Source field identifies the origin of the status code on the span. This is important, as statuses set by application developers and operators have been confirmed by the end user to be correct to the particular situation. Statuses set by instrumentation, on the other hand, are only following a generic schema. These statuses When using OTLP, this will set the `user_override` field. When setting status codes via the collector or application code, `user_override` can be set to ensure that the span status is not re-interpreted by further analysis. * `INSTRUMENTATION` is the default source. This is used for instrumentation contained within shared code, such as OSS libraries and frameworks. All instrumentation plugins shipped with OpenTelemetry use this status code. * `APPLICATION` identifies statuses set by the application developer, within their application code. From 7e5ef2b30204ce08605bb12ddbd0e2715262167b Mon Sep 17 00:00:00 2001 From: Ted Young Date: Thu, 10 Sep 2020 10:58:14 -0700 Subject: [PATCH 19/33] remove some garbage --- text/trace/0000-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index ef5e8fe50..c817b1441 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -22,7 +22,7 @@ set, and an explicit OK status set by an end user. * `OK` represents a span which has been explicitly marked as being free of errors, and should not be counted against an error budget. Note that only end users should set this status. Instead, instrumentation should leave the status as `UNSET` for nominal operations. ### `Status Source` -A new Status Source field identifies the origin of the status code on the span. This is important, as statuses set by application developers and operators have been confirmed by the end user to be correct to the particular situation. Statuses set by instrumentation, on the other hand, are only following a generic schema. These statuses When using OTLP, this will set the `user_override` field. When setting status codes via the collector or application code, `user_override` can be set to ensure that the span status is not re-interpreted by further analysis. +A new Status Source field identifies the origin of the status code on the span. This is important, as statuses set by application developers and operators have been confirmed by the end user to be correct to the particular situation. Statuses set by instrumentation, on the other hand, are only following a generic schema. * `INSTRUMENTATION` is the default source. This is used for instrumentation contained within shared code, such as OSS libraries and frameworks. All instrumentation plugins shipped with OpenTelemetry use this status code. * `APPLICATION` identifies statuses set by the application developer, within their application code. From 6fb4bcbb951138faf8b0608afaeab3c5679a491b Mon Sep 17 00:00:00 2001 From: Ted Young Date: Thu, 10 Sep 2020 14:31:15 -0700 Subject: [PATCH 20/33] Consolidate OPERATOR and APPLICATION into USER --- text/trace/0000-error_flagging.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index c817b1441..4f6c1b11b 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -25,10 +25,9 @@ set, and an explicit OK status set by an end user. A new Status Source field identifies the origin of the status code on the span. This is important, as statuses set by application developers and operators have been confirmed by the end user to be correct to the particular situation. Statuses set by instrumentation, on the other hand, are only following a generic schema. * `INSTRUMENTATION` is the default source. This is used for instrumentation contained within shared code, such as OSS libraries and frameworks. All instrumentation plugins shipped with OpenTelemetry use this status code. -* `APPLICATION` identifies statuses set by the application developer, within their application code. -* `OPERATOR` identifies statuses which have been altered during data egress. +* `USER` identifies statuses set by the end user, either in application code or the collector. -Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD pay attention to the status codes when set by `APPLICATION` or `OPERATOR`, as it is a communication from the end-user and contains valuable information. +Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD pay attention to the status codes when set by `USER`, as it is a communication from the end-user and contains valuable information. ### Status Mapping Schema As part of the specification, OpenTelemetry provides a canonical mapping of semantic conventions to status codes. This removes any ambiguity as to what OpenTelemetry ships with out of the box. From 639e6612594135ea047a1d174adba420d9111a05 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Thu, 10 Sep 2020 14:32:31 -0700 Subject: [PATCH 21/33] spelling Co-authored-by: Reiley Yang --- text/trace/0000-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index 4f6c1b11b..5bcb550ce 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -2,7 +2,7 @@ This proposal adds two status codes explicitly for use as overrides by the end user, and proposes a canonical mapping of semantic conventions to status codes. This clarifies how error reporting should work in OpenTelemetry. ## Motivation -Error reporting is a fundamental use case for distributed tracing. While we prefer that error flagging occurs within analysis tools, and not within instrumentation plugins, a number of currently supported analysis tools and protocols rely on the existence of an explicit error flag reported from instrumentation. In OpenTelemetry, the error flag is called "status codes." +Error reporting is a fundamental use case for distributed tracing. While we prefer that error flagging occurs within analysis tools, and not within instrumentation plugins, a number of currently supported analysis tools and protocols rely on the existence of an explicit error flag reported from instrumentation. In OpenTelemetry, the error flag is called "status codes". However, there is confusion over the mapping of semantic conventions to status codes, and concern over the subjective nature of errors. Which network failures count as an error? Are 404s an error? The answer is often dependent on the situation, but without even a baseline of suggested status codes for each convention, the instrumentation author is placed under the heavy burden of making the decision. Worse, the decisions will not be in sync across different instrumentation packages. From 0c8899bd99fe2467d611f8ddb1d2896bbaa67b3e Mon Sep 17 00:00:00 2001 From: Ted Young Date: Thu, 10 Sep 2020 15:16:14 -0700 Subject: [PATCH 22/33] nominal -> normal --- text/trace/0000-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index 5bcb550ce..c01de6c14 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -19,7 +19,7 @@ set, and an explicit OK status set by an end user. * `UNSET` is the default status code. * `ERROR` represents all error types. -* `OK` represents a span which has been explicitly marked as being free of errors, and should not be counted against an error budget. Note that only end users should set this status. Instead, instrumentation should leave the status as `UNSET` for nominal operations. +* `OK` represents a span which has been explicitly marked as being free of errors, and should not be counted against an error budget. Note that only end users should set this status. Instead, instrumentation should leave the status as `UNSET` for normal operations. ### `Status Source` A new Status Source field identifies the origin of the status code on the span. This is important, as statuses set by application developers and operators have been confirmed by the end user to be correct to the particular situation. Statuses set by instrumentation, on the other hand, are only following a generic schema. From ec77065808f85e405ce20d1612d36293cdf28d80 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Fri, 11 Sep 2020 12:21:49 -0700 Subject: [PATCH 23/33] markdownlint --- text/trace/0000-error_flagging.md | 57 ++++++++++++++++++------------- 1 file changed, 34 insertions(+), 23 deletions(-) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0000-error_flagging.md index c01de6c14..8b123d25c 100644 --- a/text/trace/0000-error_flagging.md +++ b/text/trace/0000-error_flagging.md @@ -1,19 +1,23 @@ # Error Flagging with Status Codes + This proposal adds two status codes explicitly for use as overrides by the end user, and proposes a canonical mapping of semantic conventions to status codes. This clarifies how error reporting should work in OpenTelemetry. - + ## Motivation + Error reporting is a fundamental use case for distributed tracing. While we prefer that error flagging occurs within analysis tools, and not within instrumentation plugins, a number of currently supported analysis tools and protocols rely on the existence of an explicit error flag reported from instrumentation. In OpenTelemetry, the error flag is called "status codes". - + However, there is confusion over the mapping of semantic conventions to status codes, and concern over the subjective nature of errors. Which network failures count as an error? Are 404s an error? The answer is often dependent on the situation, but without even a baseline of suggested status codes for each convention, the instrumentation author is placed under the heavy burden of making the decision. Worse, the decisions will not be in sync across different instrumentation packages. - + There is one other missing piece, required for proper error flagging. Both application developers and operators have a deep understanding of what constitutes an error in their system. OpenTelemetry must provide a way for these users to control error flagging, and explicitly indicate that it is the end user setting the status code, and not instrumentation plugins. In these specific cases, the error flagging is known to be correct: the end user has decided the status of the span, and they do not want another interpretation. While generic instrumentation can only provide a generic schema, end users are capable of making subjective decisions about their systems. And, as the end user, they should get to have the final call in what constitutes an error. In order to accomplish this, there must be a way to differentiate between errors flagged by instrumentation, and errors flagged by the end user. - + ## Explanation + The following changes add several missing features required for proper error reporting, and are completely backwards compatible with OpenTelemetry today. ### Status Codes + Currently, OpenTelemetry does not have a use case for differentiating between different types of errors. However, this use case may appear in the future. For now, we would like to reduce the number of status codes, and then add them back in as the need becomes clear. We would also like to differentiate between status codes which have not been set, and an explicit OK status set by an end user. @@ -22,57 +26,64 @@ set, and an explicit OK status set by an end user. * `OK` represents a span which has been explicitly marked as being free of errors, and should not be counted against an error budget. Note that only end users should set this status. Instead, instrumentation should leave the status as `UNSET` for normal operations. ### `Status Source` + A new Status Source field identifies the origin of the status code on the span. This is important, as statuses set by application developers and operators have been confirmed by the end user to be correct to the particular situation. Statuses set by instrumentation, on the other hand, are only following a generic schema. * `INSTRUMENTATION` is the default source. This is used for instrumentation contained within shared code, such as OSS libraries and frameworks. All instrumentation plugins shipped with OpenTelemetry use this status code. * `USER` identifies statuses set by the end user, either in application code or the collector. Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD pay attention to the status codes when set by `USER`, as it is a communication from the end-user and contains valuable information. - + ### Status Mapping Schema + As part of the specification, OpenTelemetry provides a canonical mapping of semantic conventions to status codes. This removes any ambiguity as to what OpenTelemetry ships with out of the box. Please note that semantic conventions, and thus status mapping from conventions, are still a work in progress and will continue to change after GA. - + ### Status Processor + The collector will provide a processor and a configuration language to make adjustments to this status mapping schema. This provides the flexibility and customization needed for real world scenarios. - + ### Convenience methods + As a convenience, OpenTelemetry provides helper functions for adding semantic conventions and exceptions to a span. These helper functions will also set the correct status code. This simplifies the life of the instrumentation author, and helps ensure compliance and data quality. Note that these convenience methods simply wire together multiple API calls. They should live in a helper package, and should not be directly added to existing API interfaces. Given how many semantic conventions we have, there will be a pile of them. - - + ## Internal details + This proposal is mostly backwards compatible with existing code, protocols, and the OpenTracing bridge. The only potential exception is the removal of status codes enums from the current OTLP protocol, and the rewriting of the small number of instrumentation plugins that were making use of them. - - + ## BUT ERRORS ARE SUBJECTIVE!! HOW CAN WE KNOW WHAT IS AN ERROR? WHO ARE WE TO DEFINE THIS? + First of all, every tracing system to-date comes with a default set of errors. No system requires that end users start completely from scratch. So... be calm!! Have faith!! - + While flagging errors can be a subjective decision, it is true that many semantic conventions qualify as an error. By providing a default mapping of semantic conventions to errors, we ensure compatibility with existing analysis tools (e.g. Jaeger), and provide guidance to users and future implementers. - + Obviously, all systems are different, and users will want to adjust error reporting on a case by case basis. Unwanted errors may be suppressed, and additional errors may be added. The collector will provide a processor and a configuration language to make this a straightforward process. Working from a baseline of standard errors will provide a better experience than having to define a schema from scratch. - + Note that analysis tools MAY disregard Span Status, and do their own error analysis. There is no requirement that the status code is respected, even when Status Source is set. However, it is strongly suggested that analysis tools SHOULD pay attention to the status code when Status Source is set, as it represents a subjective decision made by either the operator or application developer. - + ## Remind me why we need status codes again? + Status codes provide a low overhead mechanism for checking if a span counts against an error budget, without having to scan every attribute and event. It is an inexpensive and low cardinality approach to track multiple types of error budgets. This reduces overhead and could be a benefit for many systems. - + However, adding in an existing set of error types without first clearly defining their use and how they might be set has caused confusion. If the status codes are not set consistently and correctly, then the resulting error budgeting will not be useful. So we are consolidating all error types into a single ERROR type, to avoid this situation. We may add more error types back in if we can agree on their use cases and a method for applying them consistently. - + ## Open questions + If we add error processing to the Collector, it is unclear what the overhead would be. - + It is also unclear what the cost is for backends to scan for errors on every span, without a hint from instrumentation that an error might be present. - + ## Prior art and alternatives + In OpenTracing, the lack of a Collector and status mapping schema proved to be unwieldy. It placed a burden on instrumentation plugin authors to set the error flag correctly, and led to an explosion of non-standardized configuration options in every plugin just to adjust the default error flagging. This in turn placed a configuration burden on application developers. - + An alternative is the `error.hint` proposal, paired with the removal of status code. This would work, but essentially provides the same mechanism provided in this proposal, only with a large number of breaking changes. It also does not address the need for user overrides. - + ## Future Work - + The inclusion of status codes and status mappings help the OpenTelemetry community speak the same language in terms of error reporting. It lifts the burden on future analysis tools, and (when respected) it allows users to employ multiple analysis tools without having to synchronize an important form of configuration across multiple tools. - + In the future, OpenTelemetry may add a control plane which allows dynamic configuration of the status mapping schema. From 788844d3f9aaf9b0f25e2cf2e4801f84e5f3cdb2 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Fri, 11 Sep 2020 12:22:48 -0700 Subject: [PATCH 24/33] Add PR number to file name --- text/trace/{0000-error_flagging.md => 0136-error_flagging.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename text/trace/{0000-error_flagging.md => 0136-error_flagging.md} (100%) diff --git a/text/trace/0000-error_flagging.md b/text/trace/0136-error_flagging.md similarity index 100% rename from text/trace/0000-error_flagging.md rename to text/trace/0136-error_flagging.md From ed50989e2e864ed6ed739048c75b6ca4a1192af4 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Fri, 11 Sep 2020 12:27:26 -0700 Subject: [PATCH 25/33] more lint --- text/trace/0136-error_flagging.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/text/trace/0136-error_flagging.md b/text/trace/0136-error_flagging.md index 8b123d25c..ca3816eec 100644 --- a/text/trace/0136-error_flagging.md +++ b/text/trace/0136-error_flagging.md @@ -8,7 +8,7 @@ Error reporting is a fundamental use case for distributed tracing. While we pref However, there is confusion over the mapping of semantic conventions to status codes, and concern over the subjective nature of errors. Which network failures count as an error? Are 404s an error? The answer is often dependent on the situation, but without even a baseline of suggested status codes for each convention, the instrumentation author is placed under the heavy burden of making the decision. Worse, the decisions will not be in sync across different instrumentation packages. -There is one other missing piece, required for proper error flagging. Both application developers and operators have a deep understanding of what constitutes an error in their system. OpenTelemetry must provide a way for these users to control error flagging, and explicitly indicate that it is the end user setting the status code, and not instrumentation plugins. In these specific cases, the error flagging is known to be correct: the end user has decided the status of the span, and they do not want another interpretation. +There is one other missing piece, required for proper error flagging. Both application developers and operators have a deep understanding of what constitutes an error in their system. OpenTelemetry must provide a way for these users to control error flagging, and explicitly indicate that it is the end user setting the status code, and not instrumentation plugins. In these specific cases, the error flagging is known to be correct: the end user has decided the status of the span, and they do not want another interpretation. While generic instrumentation can only provide a generic schema, end users are capable of making subjective decisions about their systems. And, as the end user, they should get to have the final call in what constitutes an error. In order to accomplish this, there must be a way to differentiate between errors flagged by instrumentation, and errors flagged by the end user. @@ -36,7 +36,7 @@ Analysis tools MAY disregard status codes, in favor of their own approach to err ### Status Mapping Schema -As part of the specification, OpenTelemetry provides a canonical mapping of semantic conventions to status codes. This removes any ambiguity as to what OpenTelemetry ships with out of the box. +As part of the specification, OpenTelemetry provides a canonical mapping of semantic conventions to status codes. This removes any ambiguity as to what OpenTelemetry ships with out of the box. Please note that semantic conventions, and thus status mapping from conventions, are still a work in progress and will continue to change after GA. @@ -66,7 +66,7 @@ Note that analysis tools MAY disregard Span Status, and do their own error analy ## Remind me why we need status codes again? -Status codes provide a low overhead mechanism for checking if a span counts against an error budget, without having to scan every attribute and event. It is an inexpensive and low cardinality approach to track multiple types of error budgets. This reduces overhead and could be a benefit for many systems. +Status codes provide a low overhead mechanism for checking if a span counts against an error budget, without having to scan every attribute and event. It is an inexpensive and low cardinality approach to track multiple types of error budgets. This reduces overhead and could be a benefit for many systems. However, adding in an existing set of error types without first clearly defining their use and how they might be set has caused confusion. If the status codes are not set consistently and correctly, then the resulting error budgeting will not be useful. So we are consolidating all error types into a single ERROR type, to avoid this situation. We may add more error types back in if we can agree on their use cases and a method for applying them consistently. From a6de9ccde7085414f751413f557606e9451314d0 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Tue, 15 Sep 2020 14:05:52 -0700 Subject: [PATCH 26/33] clarify end users Co-authored-by: Armin Ruech --- text/trace/0136-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0136-error_flagging.md b/text/trace/0136-error_flagging.md index ca3816eec..459ac62b0 100644 --- a/text/trace/0136-error_flagging.md +++ b/text/trace/0136-error_flagging.md @@ -30,7 +30,7 @@ set, and an explicit OK status set by an end user. A new Status Source field identifies the origin of the status code on the span. This is important, as statuses set by application developers and operators have been confirmed by the end user to be correct to the particular situation. Statuses set by instrumentation, on the other hand, are only following a generic schema. * `INSTRUMENTATION` is the default source. This is used for instrumentation contained within shared code, such as OSS libraries and frameworks. All instrumentation plugins shipped with OpenTelemetry use this status code. -* `USER` identifies statuses set by the end user, either in application code or the collector. +* `USER` identifies statuses set by application developers or operators, either in application code or the collector. Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD pay attention to the status codes when set by `USER`, as it is a communication from the end-user and contains valuable information. From 19a3d237206e7d85de123b6026bab11460e1e511 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Tue, 15 Sep 2020 14:06:26 -0700 Subject: [PATCH 27/33] clarify end user Co-authored-by: Armin Ruech --- text/trace/0136-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0136-error_flagging.md b/text/trace/0136-error_flagging.md index 459ac62b0..01593d73e 100644 --- a/text/trace/0136-error_flagging.md +++ b/text/trace/0136-error_flagging.md @@ -32,7 +32,7 @@ A new Status Source field identifies the origin of the status code on the span. * `INSTRUMENTATION` is the default source. This is used for instrumentation contained within shared code, such as OSS libraries and frameworks. All instrumentation plugins shipped with OpenTelemetry use this status code. * `USER` identifies statuses set by application developers or operators, either in application code or the collector. -Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD pay attention to the status codes when set by `USER`, as it is a communication from the end-user and contains valuable information. +Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD pay attention to the status codes when set by `USER`, as it is a communication from the application developer or operator and contains valuable information. ### Status Mapping Schema From cc8f30509461badbe70c70dd743a934c16619d94 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Tue, 15 Sep 2020 15:50:21 -0700 Subject: [PATCH 28/33] clarify terms, update old intro --- text/trace/0136-error_flagging.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/text/trace/0136-error_flagging.md b/text/trace/0136-error_flagging.md index 01593d73e..6fec1e25b 100644 --- a/text/trace/0136-error_flagging.md +++ b/text/trace/0136-error_flagging.md @@ -1,14 +1,16 @@ # Error Flagging with Status Codes -This proposal adds two status codes explicitly for use as overrides by the end user, and proposes a canonical mapping of semantic conventions to status codes. This clarifies how error reporting should work in OpenTelemetry. +This proposal reduces the number of status codes to three, a new field to identify status codes set by application developers and operators, and proposes a canonical mapping of semantic conventions to status codes. This clarifies how error reporting should work in OpenTelemetry. + +Note: the term **end user** is defined as the application developers and operators of the system running opentelemetry. The term **instrumentation** refers to [instrumentation libraries](/~https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/glossary.md#instrumentation-library) for common code shared between different systems, such as web frameworks and database clients. ## Motivation -Error reporting is a fundamental use case for distributed tracing. While we prefer that error flagging occurs within analysis tools, and not within instrumentation plugins, a number of currently supported analysis tools and protocols rely on the existence of an explicit error flag reported from instrumentation. In OpenTelemetry, the error flag is called "status codes". +Error reporting is a fundamental use case for distributed tracing. While we prefer that error flagging occurs within analysis tools, and not within instrumentation, a number of currently supported analysis tools and protocols rely on the existence of an explicit error flag reported from instrumentation. In OpenTelemetry, the error flag is called "status codes". -However, there is confusion over the mapping of semantic conventions to status codes, and concern over the subjective nature of errors. Which network failures count as an error? Are 404s an error? The answer is often dependent on the situation, but without even a baseline of suggested status codes for each convention, the instrumentation author is placed under the heavy burden of making the decision. Worse, the decisions will not be in sync across different instrumentation packages. +However, there is confusion over the mapping of semantic conventions to status codes, and concern over the subjective nature of errors. Which network failures count as an error? Are 404s an error? The answer is often dependent on the situation, but without even a baseline of suggested status codes for each convention, the instrumentation author is placed under the heavy burden of making the decision. Worse, the decisions will not be in sync across different instrumentation. -There is one other missing piece, required for proper error flagging. Both application developers and operators have a deep understanding of what constitutes an error in their system. OpenTelemetry must provide a way for these users to control error flagging, and explicitly indicate that it is the end user setting the status code, and not instrumentation plugins. In these specific cases, the error flagging is known to be correct: the end user has decided the status of the span, and they do not want another interpretation. +There is one other missing piece, required for proper error flagging. Both application developers and operators have a deep understanding of what constitutes an error in their system. OpenTelemetry must provide a way for these users to control error flagging, and explicitly indicate that it is the end user setting the status code, and not instrumentation. In these specific cases, the error flagging is known to be correct: the end user has decided the status of the span, and they do not want another interpretation. While generic instrumentation can only provide a generic schema, end users are capable of making subjective decisions about their systems. And, as the end user, they should get to have the final call in what constitutes an error. In order to accomplish this, there must be a way to differentiate between errors flagged by instrumentation, and errors flagged by the end user. @@ -52,7 +54,7 @@ Note that these convenience methods simply wire together multiple API calls. The ## Internal details -This proposal is mostly backwards compatible with existing code, protocols, and the OpenTracing bridge. The only potential exception is the removal of status codes enums from the current OTLP protocol, and the rewriting of the small number of instrumentation plugins that were making use of them. +This proposal is mostly backwards compatible with existing code, protocols, and the OpenTracing bridge. The only potential exception is the removal of status codes enums from the current OTLP protocol, and the rewriting of the small number of instrumentation that were making use of them. ## BUT ERRORS ARE SUBJECTIVE!! HOW CAN WE KNOW WHAT IS AN ERROR? WHO ARE WE TO DEFINE THIS? From 25321074d98d14cbef98117c54ea2bd7b1bf72d2 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Tue, 15 Sep 2020 15:51:58 -0700 Subject: [PATCH 29/33] fix intro --- text/trace/0136-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0136-error_flagging.md b/text/trace/0136-error_flagging.md index 6fec1e25b..24210cef1 100644 --- a/text/trace/0136-error_flagging.md +++ b/text/trace/0136-error_flagging.md @@ -1,6 +1,6 @@ # Error Flagging with Status Codes -This proposal reduces the number of status codes to three, a new field to identify status codes set by application developers and operators, and proposes a canonical mapping of semantic conventions to status codes. This clarifies how error reporting should work in OpenTelemetry. +This proposal reduces the number of status codes to three, adds a new field to identify status codes set by application developers and operators, and adds a mapping of semantic conventions to status codes. This clarifies how error reporting should work in OpenTelemetry. Note: the term **end user** is defined as the application developers and operators of the system running opentelemetry. The term **instrumentation** refers to [instrumentation libraries](/~https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/glossary.md#instrumentation-library) for common code shared between different systems, such as web frameworks and database clients. From 7de13e9d2b931535a78b740f83cade5efa5338e7 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Tue, 15 Sep 2020 15:56:47 -0700 Subject: [PATCH 30/33] clarify the meaning of normal Co-authored-by: Justin Foote --- text/trace/0136-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0136-error_flagging.md b/text/trace/0136-error_flagging.md index 24210cef1..e6f624c98 100644 --- a/text/trace/0136-error_flagging.md +++ b/text/trace/0136-error_flagging.md @@ -25,7 +25,7 @@ set, and an explicit OK status set by an end user. * `UNSET` is the default status code. * `ERROR` represents all error types. -* `OK` represents a span which has been explicitly marked as being free of errors, and should not be counted against an error budget. Note that only end users should set this status. Instead, instrumentation should leave the status as `UNSET` for normal operations. +* `OK` represents a span which has been explicitly marked as being free of errors, and should not be counted against an error budget. Note that only end users should set this status. Instead, instrumentation should leave the status as `UNSET` for operations that do not generate an error. ### `Status Source` From 1537cc39cd4a8de78e48d7a2d01321a4ecfdfe60 Mon Sep 17 00:00:00 2001 From: Ted Young Date: Tue, 15 Sep 2020 16:01:58 -0700 Subject: [PATCH 31/33] clarify status mapping --- text/trace/0136-error_flagging.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/text/trace/0136-error_flagging.md b/text/trace/0136-error_flagging.md index e6f624c98..5a4efa74b 100644 --- a/text/trace/0136-error_flagging.md +++ b/text/trace/0136-error_flagging.md @@ -38,7 +38,9 @@ Analysis tools MAY disregard status codes, in favor of their own approach to err ### Status Mapping Schema -As part of the specification, OpenTelemetry provides a canonical mapping of semantic conventions to status codes. This removes any ambiguity as to what OpenTelemetry ships with out of the box. +As part of the specification, OpenTelemetry provides a mapping of semantic conventions to status codes. This removes any ambiguity as to what OpenTelemetry ships with out of the box. + +This will help ensure our instrumentation is consistent across languages, when errors relate to a cross-langauge concept, such as a database protocol. Please note that semantic conventions, and thus status mapping from conventions, are still a work in progress and will continue to change after GA. From 2b9e943dc904003d4e56e62fadd250b866d5205d Mon Sep 17 00:00:00 2001 From: Ted Young Date: Wed, 16 Sep 2020 12:54:48 -0700 Subject: [PATCH 32/33] Update text/trace/0136-error_flagging.md Co-authored-by: Armin Ruech --- text/trace/0136-error_flagging.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0136-error_flagging.md b/text/trace/0136-error_flagging.md index 5a4efa74b..e35253162 100644 --- a/text/trace/0136-error_flagging.md +++ b/text/trace/0136-error_flagging.md @@ -2,7 +2,7 @@ This proposal reduces the number of status codes to three, adds a new field to identify status codes set by application developers and operators, and adds a mapping of semantic conventions to status codes. This clarifies how error reporting should work in OpenTelemetry. -Note: the term **end user** is defined as the application developers and operators of the system running opentelemetry. The term **instrumentation** refers to [instrumentation libraries](/~https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/glossary.md#instrumentation-library) for common code shared between different systems, such as web frameworks and database clients. +Note: The term **end user** in this document is defined as the application developers and operators of the system running OpenTelemetry. The term **instrumentation** refers to [instrumentation libraries](/~https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/glossary.md#instrumentation-library) for common code shared between different systems, such as web frameworks and database clients. ## Motivation From 76f0597b573ff128e49d58fc1ba444ab936caa5e Mon Sep 17 00:00:00 2001 From: Ted Young Date: Wed, 16 Sep 2020 13:01:12 -0700 Subject: [PATCH 33/33] final nits --- text/trace/0136-error_flagging.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/trace/0136-error_flagging.md b/text/trace/0136-error_flagging.md index e35253162..d55d928bf 100644 --- a/text/trace/0136-error_flagging.md +++ b/text/trace/0136-error_flagging.md @@ -40,7 +40,7 @@ Analysis tools MAY disregard status codes, in favor of their own approach to err As part of the specification, OpenTelemetry provides a mapping of semantic conventions to status codes. This removes any ambiguity as to what OpenTelemetry ships with out of the box. -This will help ensure our instrumentation is consistent across languages, when errors relate to a cross-langauge concept, such as a database protocol. +Including the correct status codes as part of our semantic conventions will help ensure our instrumentation is consistent when errors relate to a cross-language concept, such as a database protocol. Please note that semantic conventions, and thus status mapping from conventions, are still a work in progress and will continue to change after GA. @@ -56,7 +56,7 @@ Note that these convenience methods simply wire together multiple API calls. The ## Internal details -This proposal is mostly backwards compatible with existing code, protocols, and the OpenTracing bridge. The only potential exception is the removal of status codes enums from the current OTLP protocol, and the rewriting of the small number of instrumentation that were making use of them. +This proposal is mostly backwards compatible with existing code, protocols, and the OpenTracing bridge. The only potential exception is the removal of status code enums from the current OTLP protocol, and the rewriting of the small number of instrumentation that were making use of them. ## BUT ERRORS ARE SUBJECTIVE!! HOW CAN WE KNOW WHAT IS AN ERROR? WHO ARE WE TO DEFINE THIS?