-
Notifications
You must be signed in to change notification settings - Fork 166
Add support for more expressive span status API #123
Conversation
|
text/0123-improve-span-status-api.md
Outdated
|
||
**Closed vs. open-ended status descriptions** - We could make status represented by a fixed set of options such as the current gRPC codes or any larger/different set. It likely is simpler for tools consuming the data to only operate on fixed options and it is pleasant to imagine that all (or most) status results will be reasonably categorized into the fixed list. However I do not expect most users to take up the responsibility of mapping their status results if OpenTelemetry/back-end tools don't provide a clear representation. Instead these users will not set status or choose options that are generic and inconsistent rendering the data useless to everyone. For a fixed list to work I believe we either have to curate a very large list of options or accept that significant portions of the developer audience will not find the status representation useful. | ||
|
||
**API using attribute semantic conventions ** - It is also possible to do this via semantic conventions on attributes and although I think the strongly typed API is preferable I don't have that strong of an opinion. Semantic conventions are likely to have higher performance overhead, higher risk of error in key names and are less discoverable/refactorable in IDEs. The advantages are that there is some past precedent for doing this specific to http codes and new conventions can be added easily. If we go semantic conventions it does imply that Exception becomes a type that can be directly passed as an argument to SetAttribute(). Requiring the user to destructure the exception into a list of key value pairs would be overly onerous and error-prone for a common usage scenario. If desired the SDK or exporter could destructure it, but that can be determined independently from API design and I'd like to keep it out of scope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Markdown syntax problem:
**API using attribute semantic conventions ** - It is also possible to do this via semantic conventions on attributes and although I think the strongly typed API is preferable I don't have that strong of an opinion. Semantic conventions are likely to have higher performance overhead, higher risk of error in key names and are less discoverable/refactorable in IDEs. The advantages are that there is some past precedent for doing this specific to http codes and new conventions can be added easily. If we go semantic conventions it does imply that Exception becomes a type that can be directly passed as an argument to SetAttribute(). Requiring the user to destructure the exception into a list of key value pairs would be overly onerous and error-prone for a common usage scenario. If desired the SDK or exporter could destructure it, but that can be determined independently from API design and I'd like to keep it out of scope. | |
**API using attribute semantic conventions** - It is also possible to do this via semantic conventions on attributes and although I think the strongly typed API is preferable I don't have that strong of an opinion. Semantic conventions are likely to have higher performance overhead, higher risk of error in key names and are less discoverable/refactorable in IDEs. The advantages are that there is some past precedent for doing this specific to http codes and new conventions can be added easily. If we go semantic conventions it does imply that Exception becomes a type that can be directly passed as an argument to SetAttribute(). Requiring the user to destructure the exception into a list of key value pairs would be overly onerous and error-prone for a common usage scenario. If desired the SDK or exporter could destructure it, but that can be determined independently from API design and I'd like to keep it out of scope. |
(you see now why line breaks would be good? 😉 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I love reading paragraphs hunting for the changed space character! ... agreed and fixed 👍 )
text/0123-improve-span-status-api.md
Outdated
|
||
**API using attribute semantic conventions ** - It is also possible to do this via semantic conventions on attributes and although I think the strongly typed API is preferable I don't have that strong of an opinion. Semantic conventions are likely to have higher performance overhead, higher risk of error in key names and are less discoverable/refactorable in IDEs. The advantages are that there is some past precedent for doing this specific to http codes and new conventions can be added easily. If we go semantic conventions it does imply that Exception becomes a type that can be directly passed as an argument to SetAttribute(). Requiring the user to destructure the exception into a list of key value pairs would be overly onerous and error-prone for a common usage scenario. If desired the SDK or exporter could destructure it, but that can be determined independently from API design and I'd like to keep it out of scope. | ||
|
||
**API using Span events** - Most of the rationale for attribute semantic conventions also applies here, events are effectively another key-value store mechanism. The timestamp that is attached to an event appears to hold little value as status is probably produced at the same approximate time the span end timestamp is recorded. Similar to attribute conventions it sounds like there is precedent for storing some errors as events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the rationale for attribute semantic conventions also applies here
I see this as completely orthogonal. If we use semantic conventions, they could prescribe span attributes or event attributes or both.
If we have additional API functions, they could be defined on the span or some event (builder) object, and even if defined on the span they could create event-level or span-level data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are in agreement? Perhaps my second paragraph made it sound like events aren't semantic conventions so I've clarified in the title that events are also conventions.
I've been limiting the scope of the discussion to the tracing API so here I was attempting to distinguish whether the user provides the information via strongly typed API or provides it via loosely typed conventions. I agree that a strongly typed API could store the data internally as attributes/events and doing so probably (but not necessarily) has some implications on the SDK API for reading the data back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some text in the implementation section of the coming revision to explicitly call out the ability to store the data into attributes or events regardless of the tracing API that provides the data to us.
I would suggest an API called Span.SetStatus(...) that takes all the arguments above, and optionally overloads or default parameters that make common calls easier. For example in C#: | ||
|
||
````C# | ||
void SetStatus(Exception spanException, string statusType = "LanguageException", bool? successHint = null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capturing an exception with traceback and storing it until export is potentially extremely memory heavy (e.g. in Python the exception has a reference to a traceback, and the traceback has references to all local variables on the stack, preventing them from being GCd).
So the solution to this would probably be to call into some exporter-specific code already when calling this API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I added a comment about this issue and potential solution to the internal details section.
text/0123-improve-span-status-api.md
Outdated
|
||
**Closed vs. open-ended status descriptions** - We could make status represented by a fixed set of options such as the current gRPC codes or any larger/different set. It likely is simpler for tools consuming the data to only operate on fixed options and it is pleasant to imagine that all (or most) status results will be reasonably categorized into the fixed list. However I do not expect most users to take up the responsibility of mapping their status results if OpenTelemetry/back-end tools don't provide a clear representation. Instead these users will not set status or choose options that are generic and inconsistent rendering the data useless to everyone. For a fixed list to work I believe we either have to curate a very large list of options or accept that significant portions of the developer audience will not find the status representation useful. | ||
|
||
**API using attribute semantic conventions ** - It is also possible to do this via semantic conventions on attributes and although I think the strongly typed API is preferable I don't have that strong of an opinion. Semantic conventions are likely to have higher performance overhead, higher risk of error in key names and are less discoverable/refactorable in IDEs. The advantages are that there is some past precedent for doing this specific to http codes and new conventions can be added easily. If we go semantic conventions it does imply that Exception becomes a type that can be directly passed as an argument to SetAttribute(). Requiring the user to destructure the exception into a list of key value pairs would be overly onerous and error-prone for a common usage scenario. If desired the SDK or exporter could destructure it, but that can be determined independently from API design and I'd like to keep it out of scope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you can provide a nicely typed API and still use semantic conventions. E.g. the API could just "serialize" whatever you give it into SetAttribute / CreateEvent calls. The difference is that the API between the exporter and the API is then "weakly typed".
To point at another project, the idea of "typed spans", i.e. utilities to have strongly typed APIs for semantic conventions, is at least as old as OpenTelemetry. E.g. see "Typed Spans" in the PR description of open-telemetry/opentelemetry-specification#571
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an additional advantage, the exception/error APIs could be provided as a package separate from the core API, (you could provide a function com.opentelemetry.contrib.errorutils.ErrorUtils.captureError(Span, other params)
that sets any attributes/events on the span).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've added this suggestion to the list of alternatives.
2. StatusData - A discriminated union of: | ||
- An integer, a string, or a tuple of integer and string. These options can be used for: | ||
- Enumerated status codes: For example an http status code could be represented as 404, "Not Found", or both. In the case of common status codes OpenTelemetry SDK or backend could optionally assist in filling out the remainder of a partially specified enumeration value. For enumerations that aren't well-known the community of users is responsible for determining any conventions. | ||
- Free-form error messages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would a named error with a separate message be captured? I would treat these separately and not combine them in one free-form string. This would allow for searching, filtering and grouping by the error name alone but still preserve the message. A backend could, for example, display the number of all EntityNotFound
errors in one place and filter for them in another place where users can inspect the messages bearing the entities that were not found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current proposal wouldn't be able to capture this, but we could change the proposal to allow it. I've added some text to the trade-offs and mitigations section mentioning the idea. I'd be fine adding it, not sure if we'll see additional opinions about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'd add it right from the start. I'd expect users want to capture the information in a way that most closely resembles how it's reported by their respective language or framework, rather than having to combine both in one string that potentially has to be parsed/decomposed for analysis later again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original intent wasn't that users would combine strings, but that they would provide either an enumerated status code OR a message, not both. If we anticipate it is common to provide both I agree I'd rather provide an API that easily facilitates that vs. an API that encourages users to create private conventions to merge and split text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect it to be common enough to support it in the API. Often it will indeed be a type of error that can be enumerated but will not necessarily have an integer representation like we have in HTTP but rather a name of the error (like "Entity not found" or an exception name) with a (detailed) description (e.g. "The entity with ID '7583' was not found in the table 'products'" or an exception message).
Therefore I'm suggesting to have two strings called, for example, name, brief or kind and a separate string called details, message or description.
Happy to hear others' opinions on that.
I've updated the text based on the feedback, thanks for that! If you were looking for different/more changes vs. what I did please just let me know : ) |
2. StatusData - A discriminated union of: | ||
- An integer, a string, or a tuple of integer and string. These options can be used for: | ||
- Enumerated status codes: For example an http status code could be represented as 404, "Not Found", or both. In the case of common status codes OpenTelemetry SDK or backend could optionally assist in filling out the remainder of a partially specified enumeration value. For enumerations that aren't well-known the community of users is responsible for determining any conventions. | ||
- Free-form error messages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'd add it right from the start. I'd expect users want to capture the information in a way that most closely resembles how it's reported by their respective language or framework, rather than having to combine both in one string that potentially has to be parsed/decomposed for analysis later again.
data hard to work with because it can no longer be treated uniformly. | ||
- **Loss of fidelity** - Mapping from a status representation that | ||
distinguishes a large number of different results to one that only | ||
distinguishes a few is inherently lossy. This prevents users from isolating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of the span status code is slightly different than the summary in this document. I see the span status code as a categorization of the status meant to augment our understanding of the span's condition. I would not characterize the span status code as a lossy transformation: I'd describe it as additive.
As an example, suppose we have an HTTP server and it encounters some kind of condition. In the HTTP domain we have well-defined status codes, which can be encoded in a non-lossy way using the http.status_code
semantic convention.
I believe one of the sources of confusion here is that in order to categorize an error into one of the canonical span status codes, the author has to take a position, and that position will not be unique. Depending on positionality, different users will categorize the various HTTP status codes differently. Accordingly, I would define the span status code as the span actor's perspective on the activity it performed, how it would categorize its own status. Other actors in a system may witness consequential failures (e.g., a client receiving a HTTP 500 code) and handle them in a way that disagrees with the self-assessment used by a particular span. E.g., a server may decide that a 500 status is maps to Internal errors, and the client that receives a 500 may decide to retry, thus suppressing the 500. That 500 could become a span event with http.status_code
and http.status_text
attributes, and still the client could have a successful retry and set the span status to OK.
Summarizing, I don't see a problem with Status code as it stands, but think we need to recognize that the Status code is not a unique position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thoughtful feedback!
Summarizing, I don't see a problem with Status code as it stands
I'm not sure how to relate your opinion on the status code concept to the OTEP overall. How do you feel about the goals outlined in the goals section of the OTEP:
a) They aren't useful and don't need to be pursued?
b) They are useful but the existing Span API already satisfies them?
c) They are useful, and the existing Span API doesn't satisfy them. However we shouldn't frame the proposal as a change to the 'Status' API because that API is already fulfilling a different important purpose
d) Something else?
I believe one of the sources of confusion here is that in order to categorize an error into one of the canonical span status codes, the author has to take a position, and that position will not be unique
That sounds like the problem I named Inconsistency in the text, yes? "If the mapping from native representation to
OpenTelemetry representation isn't well-defined then API users or SDK implementations are unlikely to choose the same mapping."
FYI in the Spec SIG call today I requested that Google provide a design document on the canonical status codes, as there was such a document written when this practice was introduced. @mtwo will follow up. |
@jmacd @bogdandrutu I read the doc that you directed me to and it describes high level concepts like how to return an error or how to check for an error, but doesn't actually include the status code proto. I'm guessing that we want both? Let me know and I'll reach out to Sanjay and will CC both of you. |
Maybe one alternative to this OTEP would be to:
|
@mtwo Yes please if you could ask Sanjay to locate a document that lays out the status codes and justifies them, probably this is what Bogdan and I remember reading many years ago. I think that would help this discussion. |
easier. For example in C#: | ||
|
||
````C# | ||
void SetStatus(Exception spanException, string statusType = "LanguageException", bool? successHint = null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple exceptions can occur during the lifetime of a span and experiencing an exception may or may not result in a non-ok status depending on the circumstances. How can multiple exceptions can handled using this API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple functions returning error codes may be called during the lifetime of a span and an error return may result in a non-ok status or not.
I think it's up to the user/instrumentation to decide which exception was the semantically important one. Most likely the one that went uncaugth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @Oberon00. I also attempted to capture this concept in the scoping section above but perhaps the wording wasn't good?
The operation represented by a span might be composed of sub-operations (for
example a function that calls into other lower level functions) and each of
those sub-operations might have a notion of status or error. If those
sub-operations aren't represented with a child span then it is out of scope
how their status is captured.
Feedback suggestions Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com>
Thanks for the suggestion! The first two parts sound fine to me but there isn't enough information for me to stake an opinion on the 3rd bullet. I did want to keep the spotlight on the API/data structure that developer directly observes when providing the status and I wasn't sure if you were refering to one of these options or something else?
IMO 1 or 4 could look pretty reasonable, but 2 requires the developer to write too much code. |
I like a variant of option 1: Instead of making it a primitive, we allow configuring a TracerProvider with AttributeSerializers that convert a non-primitive attribute value into one or more primitive attributes. |
@noahfalk in your comment #123 (comment) Note that the OTel-Go API already has a similar API to your (1) and (4). We're just waiting for semantic conventions to be finished. /~https://github.com/open-telemetry/opentelemetry-go/blob/9e99b441d2cb63bf7359b22860b47d8a8bbbb87d/api/trace/api.go#L126 @mtwo asked for some information from Google and we got a little bit of info from Sanjay Ghemawat. Here's a summary: Google's Status Code were first developed in 2006 for an abstract file-system interface. They were adopted for use in RPC systems several years later. In 2012, these codes were recommended for all APIs. Over the 14 years, only one new code was introduced. UNAUTHORIZED was split apart from PERMISSION_DENIED. The primary motivation for the recommendation to use canonical codes in all APIs resulted from bugs caused by translation of error codes in multi-layer systems. Google still has other error code spaces in use, but the author of an alternative error code space is required to provide a translation into the canonical error space. All of this leads me to think that we should remove Span Status as a special field and use semantic conventions for error codes. To me, asking users to specify a canonical status code when they are using another error code space raises the original problem that Google meant to solve: "bugs caused by translation of error codes". We should not be asking users to translate error code spaces. If we can ignore backwards compatibility for the time being. As a first step, I would specify that users should simply record their status in its natural code space, e.g., "hresult.status_code", "posix.status_code", "http.status_code". These conventional code spaces are standardized, so a vendor can automatically translate into canonical status codes. If the user has no natural error code space of their own, they may consider using Google's canonical error space. The problem with all of this, IMO, is that Google has done nowhere near enough to promote their error code space as a generic solution to propagating errors in a multi-layer system. When gRPC was released, public documentation emerged that talked about the meaning of the codes and how they can be used with gRPC, but there is still no public design document. We need a design document that goes into great detail about the fine distinctions between the codes. For example, how and why is UNAUTHORIZED different from PERMISSION_DENIED? Note that the first Google result for "grpc status codes" is this, a page from the Google Maps Booking API documentation. This is not a design document, though it goes into depth on FAILED_PRECONDITION vs. ABORTED vs. UNAVAILABLE and INVALID_ARGUMENT vs. FAILED_PRECONDITION vs. OUT_OF_RANGE. @mtwo if I may make a suggestion, this feels like a responsibility for the gRPC team. Is there a gRPC product manager who could dig up an old internal design document, the one that was used in 2012 to justify the universal use of these codes, sanitize it, and publish externally? |
I would be in favor of removing the dedicated status code field from Spans. For our use cases in Dynatrace so far, we haven't used it so far and instead did what you suggest in your comment, i.e., adding dedicated semantic conventions for specific kinds of errors. |
+1 to remove the explicit span status. We didn't have it in OpenTracing and I never run into a user ask for adding it. Some users at Uber did ask for more consistent application of the |
I also care about enumerated status codes (including ones like Http, HRESULT or POSIX) and error messages. I was using Exception as an example because @Oberon00 had put some focus on it in the comment I was responding to ("The most common one I can think of would be the exception semantic conventions..."). Specific to .NET I expect Exception to be the most common case but I also anticipate enumerated error codes and error messages to show up in some scenarios.
Probably implied from the OTEP that I also agree : ) To add the .NET evidence to the others - Activity, the API .NET has used for many years for distributed tracing has nothing like the current Grpc code specification of Status. To the best of my knowledge no user or telemetry tool has ever told us it would be valuable to them prior to its appearance in OpenTelemetry. |
We have "mostly approved" open-telemetry/opentelemetry-specification#697 for recording exceptions on Spans. Among other things, it was decided to decouple span status from exceptions. This means that mentioning error messages or exception objects and having them in API proposed by this PR is a contradiction to that decision. We already have technology/runtime specific semantic conventions for various status/error codes, such as I am cautiously sceptical that operations statuses from very different domains can be mapped to gRPC status codes. I was surprised in the past when discovered, that we have status codes copied from gRPC. Taking into account all of the above I will join the proposal to remove |
No objection to closing the PR and thanks for everyone's consideration and discussion : ) I'd also be glad if the generic term 'status' is no longer defined as a grpc code. I do want to call out that open-telemetry/opentelemetry-specification#697 only partially satisfied the scope of proposal here for anyone looking back on this in the future. Remaining open issues are how to capture:
I'm fine with the premise that solutions to these issues could be designed later, potentially post-GA, and potentially via semantic conventions only. |
@noahfalk will you close this PR? Or some action is needed before that could happen? |
I'm not sure where the discussion about Span Status will go, but there's interest in the Metrics specification of having a nice way to label span-like duration measurements with a status, and I think it would be nice if we could use the natural status code, not the canonical status code. See open-telemetry/opentelemetry-specification#657 (comment) |
A while back we discussed the status API and I was concerned the current API isn't expressive enough for many kinds of errors. I agreed to suggest an alternative, this OTEP. I know there has been a lot of past discussion and divergent opinion on this topic so I am not convinced there is a plan that everyone will get behind but I hope this at least contributes to moving the discussion forward.
Also I want to give a heads up that I may not have enough time to have extended discussions or revisions that something like this may deserve. With luck other contributors will also be motivated to push for improvements in this area too : )