-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Span.Status from API #706
Comments
There have been many PRs related to errors so would help if you could summarize the current state of what we have if |
Other affected areas would be:
|
@anuraaga There is no generic way to detect if a span has failed. The status is also not really suitable for this, except if you also consider things such as HTTP 404 as errors. Still, I think a generic Exceptions are not the same as errors. Often a span with at least one exception will be failed, but all of them could be handled too. Also, errors can occur without exceptions. |
Presence of |
The simplest replacement would indeed be generic |
Could we create a separate issue for discussing how to indicate errors generically, once status is removed? Or maybe #599 is the place to discuss this. |
I am afraid we cannot just remove status without offering any way to signal that this span signifies an error. The simplest way forward, assuming that we have #697 in its current form, is the following:
|
During todays Errors WG meeting it was agreed:
As a separate step to start discussion about adding severity notion to Span reusing log severity concept |
A discussion about severity on Span events is happening in #698 |
Just a top level boolean failed field should be enough to identify the Span as having an error should be sufficient as a quick check. If this is true then the span attributes can be checked for further info. The other change that needs to be made is to add a gRPC semantic convention to report gRPC status code since this will now not be in status. |
Collapsing the span status field into a single boolean seems problematic to me. The main problem really lies in the definition of "error" states. If the instrumentation is responsible for adding an error boolean to spans (and, presumably, a matching label to metrics), then it must be aware of what the current user considers an error condition. This implies some configuration API that something uses to set the desired error conditions, and another API that instrumentation must use to configure itself. The instrumentation should apply as much information it can to the spans and metrics it creates; it's easier to reduce granularity of error conditions downstream than it is to increase it. |
Rather than collapsing, I think we are exploding it: Instead of a single canonical status we will have:
Collapsing every conceivable error in a fixed StatusCode was already quite bad. The error boolean is obviously not gonna cut it alone: It's just meant as a supplement, for easier detection, it does not really add that much information in itself. (Can I take this as an opportunity to plug the idea of setting the value of error not to true, but to the name of the attribute with more information?) |
No :) One of the benefits of having simple boolean field is ease of reading it. Without the need to parse attributes. We can, should and will add semantic conventions/more fields with more information about the error state of the given span. But without sacrificing easy and fast way for collector/backend/exporter to get very meaningful information from the span. |
Isn't that independent? If the error field/attribute is non-empty, that is equivalent to |
Why do you want that name to be in What benefit will bring such indirect reference via |
It will bring no benefit except for conserving a field name. Using a separate attribute like |
Maybe now I understand where we are talking past each other: I think that we already have different names for error detail attributes, and should continue to have and add even more different names. For example |
A! Now I get it! :) I think this is overcomplicating, but I will not object it. If nobody comment on this question here, I will raise this question during Error WG this Thursday. |
I'm happy to find a simpler way. But some way to classify errors even without knowing a particular semantic convention the span falls into would be good (although IMHO not necessarily required for GA). Ideally without forcing instrumentations to translate their natural error conditions into some enum. 😃 |
Still, my main concern is this:
Making the instrumentation responsible for knowing what constitutes an error feels like a big can of worms to me. I'd much prefer to see instrumentation add non-opinionated data to spans and metrics, and for views, exporters, and back-ends to apply their own logic to interpret that data. |
Hmm, you are right that this |
@justinfoote Do you oppose adding |
I oppose both. 😄 I like I also dislike the |
I don't think this is the direction we plan to go with our error. It simply should be a hint from the instrumentation when it thinks that something is likely an error. I think |
At the spec SIG on Tuesday I mentioned three components of error handling that need to be addressed.
We have discussed at length that there is not universal agreement on what constitutes an error for an application. I would propose that we will never reach agreement here. However, I think we can agree that if an error does happen, you want to know about it, and there should be a way to indicate this on a span. Annotating whether or not as span has had an error is a requirement, even if we can't agree on what an error is. This is what I'm calling error identification. For the second part of this issue, we do need a way for users to decide that something that has been identified as an error, should not be considered an error for them. This is what I'm referring to as error suppression. We should give some thought into how we can facilitate this requirement from the SDK side, as well as what approaches could be used for backends. |
Matt, I really like this framing, and I agree that these three components of error handling need to be addressed. The way I read this debate is something like this:
I'm arguing that they belong downstream. Also, I believe that error identification and error suppression could be defined as a single concern. There's no reason to identify errors upstream (that is: in instrumentation) only to suppress those errors downstream (possibly in an exporter or in a vendor's back-end). |
I may be talking myself out of the canonical status code. Perhaps the best design is for instrumentation to add full details of exceptions and/or status codes, without passing judgement about whether that data qualifies as an error. Any interpretation of that data (that is: flagging errors with a boolean or attempting to categorize status into an enum) belongs downstream. Exactly where downstream is something to discuss, I believe. For metrics, I think this belongs in a MetricView sort of object. For spans, I'm not sure. In a SpanProcessor? In an exporter? |
The main complication, and the core reason that error suppression needs to exist is due to auto-instrumentation. Without auto-instrumentation, an application developer would not set Instrumentation is where errors are going to encountered and it will to make the best decision it can with the information it has at the time. Often this will be catch an exception, record it, then re-throw. I don't think you can push this downstream because the instrumentation is where you will have a handle on the error and a chance to process it. I would also posit that in most cases, auto-instrumentation makes the right decision for errors. There are edge cases, and matters of interpretation where a user may want override that decision. Given that auto-instrumentation is the likely generator of false positives ( Another possibility, is that an SDK could choose to implement something such as per instrumentation library exception handlers. Custom code be registered that could mark an exception to be recorded, or ignored on a per component basis. The per-component exception handlers might be a way to combine identification and suppression into a single concern, but I'm not sure it's completely necessary, and a backend solution might suffice. I'd be interested in any other proposals on how this can handled. |
I'm going to weigh in that I don't like the flavor of the current |
Based on WG meeting, July 23rd, 2020, some changes are in order. From the backward compatibility point of view it is much easier to add things later, than to remove them. There is no consensus on boolean Based on the above we are going to remove |
@iNikem talked about this during the bug triage session, sounds like a PR to remove the span.status from api would be desirable to finish this issue. cc @open-telemetry/technical-committee from the conversations, a proposal to replace will be a follow up action separate from this issue. |
@andrewhsu I will try to prepare such PR next week. I have a concern about wire protocols. OTLP is declared stable and it has this field. OTLP->Jaeger protocol translation uses this field as well. I will reach to @tigrannajaryan and @bogdandrutu regarding these issues. |
We need to decide this by the end of the week. Next steps:
|
See proposal in https://docs.google.com/document/d/1HgUI69rBridFzCXxXuTjQrkG6jb-YcFQnZjPcyBcK1U/edit made during error WG meeting for a replacement:
Note that there was no clear unanimous agreement on this as far as I could see. However, I'd agree that this is an improvement over the current status. |
FYI this OTEP addresses this issue: open-telemetry/oteps#136 |
This issue should now be closed in favor of #965. |
Closing this issue in favor of #965. |
It was proposed in open-telemetry/oteps#123 (comment) and subsequent discussion that we should drop
Span.Status
from the spec. If there is an agreement on that, I can create a PR.The text was updated successfully, but these errors were encountered: