Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

Instrumentation layers and suppressing duplicates #172

Closed
wants to merge 12 commits into from
177 changes: 177 additions & 0 deletions text/trace/0189-instrumentation-layers-suppression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# Instrumentation Layers and Suppression

This document describes approach for instrumentation layers, suppressing duplicate layers and unambiguously enriching spans.

## Motivation

- Provide clarity for instrumentation layers: e.g. DB calls on top of REST API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you define the term "instrumentation layer" in this document? I'm not sure if it refers to a single span, or to the whole span tree that's created by a single library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the questions about a definition of layers: there are discussing about actually modelling HTTP request with multiple spans (for e. g. retries or forwards). Should we aim to cover those cases with a suppression proposal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, for now, I assume we have one layer of HTTP spans (if we'll have more for logical + physical, I'd treat them as two different layers) - they can't carry the same information when both are enabled.

HTTP instrumentation would need configuration to emit 1) logical or 2) physical or 3) both to avoid redundancy and doubling of costs. I don't see a way to achieve it with suppression

- Give mechanism to suppress instrumentation layers for the same convention: e.g. multiple instrumented HTTP clients using each other.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there cases of multiple instrumentation HTTP clients using each other? That sounds like an odd configuration of clients. I wonder if when combining these clients, one can disable instrumentation of the other?

The DB call on top fo a REST call -- you propose to suppress the REST instrumentation, I take it? (Could the DB be configured with an un-instrumented REST client, instead?)

I wonder if it would be possible to replace InstrumentationType with InstrumentationLibrary.Name. Could you accomplish the same functional behavior, if for all spans with a non-remote parent you knew the InstrumentationLibrary of the nearest not-suppressed ancestor? It might make the proposal both simpler and more general: InstrumentationLibrary is already defined; it means you don't need an expandable enum.

Copy link
Contributor Author

@lmolkova lmolkova Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there cases of multiple instrumentation HTTP clients using each other?

  1. One case is manual/native + auto: Client libraries: Manual/programmatic and auto instrumentation duplication problem  opentelemetry-specification#1767
  2. Also, HTTP clients test to use each other (python, js, java all have one or another way to suppress HTTP instrumentation)
  • in some cases, they are pluggable, i.e. higher level don't really know if the lower level is instrumented - this is more popular in SERVER spans
  • in some cases, I guess higher-level clients instrumentation is not necessary

The DB call on top fo a REST call -- you propose to suppress the REST instrumentation, I take it? (Could the DB be configured with an un-instrumented REST client, instead?)

I suggest that

  1. DB and REST are two layers, i.e. they are both spans (and not DB attributes stamped on HTTP span)
  2. Depending on the configuration we can either
    • suppress nested HTTP (to reduce noise if users/backends don't want it) - note in this case DB span context will be propagated over HTTP
    • not suppress HTTP and have full picture

InstrumentationType with InstrumentationLibrary.Name

I'd love that, however InstrumentationLibrary.Name could be GoogleHttpClient or ApacheHttpClient while they both implement HTTP semantic conventions. The alternative could be to set InstrumentationType on the tracer resource in addition to InstrumentationLibrary.Name and version. I'd also put the semantic convention version, it follows, on the resource as well.

Copy link
Contributor Author

@lmolkova lmolkova Sep 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InstrumentationType with InstrumentationLibrary.Name

Experimented with this idea: instead of SpanBuilder.setType, I added 'type' parameter to library instrumentation info when tracer is being built:

  • it makes tracers type-aware, i.e. tracers should not emit spans of different types. Which makes sense, but a new thing.
  • it's even more subtle than the span creation option. App and lib devs would have a hard time paying attention and will end up with multi-type tracers (i.e. no suppression).

I updated the PR lmolkova/opentelemetry-java#1.

Will bring it up at the Tue Instrumentation SIG meeting.

I also entertained the idea of having a single tracer per type and returning noop tracers if one with the same type is registered - but it depends on the tracer creation time and is not stable.

- Give mechanism to enrich specific spans unambiguously: e.g. HTTP server span with routing information

## Explanation

### Spec changes proposal

- Semantic Conventions: Each span MUST follow at most one convention, specific to the call it describes.
lmolkova marked this conversation as resolved.
Show resolved Hide resolved
- Trace API: Add `SpanKey` API that
lmolkova marked this conversation as resolved.
Show resolved Hide resolved
- checks if similar span already exists on the context (e.g. `SpanKey.HTTP_CLIENT.exists(context)`)
- gets span following specific convention from the context (e.g. `SpanKey.HTTP_CLIENT.fromContextOrNull(context)`).
- Semantic Conventions: instrumentation MUST back off if span of same kind and following same contention is already exists on the context by using `SpanKey` API.
- Semantic Conventions: Client libraries instrumentation MUST make context current to enable correlation with underlying layers of instrumentation
- OTel SDK SHOULD allow suppression strategy configuration
- suppress nested by kind (e.g. only one CLIENT allowed)
- suppress nested by kind + convention it follows (only one HTTP CLIENT allowed, but outer DB -> nested HTTP is ok)
- suppress none

#### SpanKey API

SpanKey allows to

- read/write span to context
- check if specific span is on the context
- encapsulate different suppression strategies
- define known SpanKey, shared between instrumentations (static singletons): HTTP, RPC, DB, Messaging.

#### Suppression Example

- HTTP Client 1:
- Check if HTTP CLIENT span exists on the context: `SpanKey.HTTP_CLIENT.exists(ctx)`
- No HTTP client span on the context:
- start span
- store span in context: `SpanKey.HTTP_CLIENT.storeInContext(ctx, span)`
- Make `ctx` current
- Http Client 2:
- Checks if HTTP CLIENT span is already on the context: `SpanKey.HTTP_CLIENT.exists(Context.current())`
- HTTP client span is already there: do not instrument

Suppression logic is configurable and encapsulated in `SpanKey` - specific library instrumentation should not depend on configuration, e.g.:

- suppressing by kind only - context key does not distinguish conventions within kind
- `SpanKey.HTTP_CLIENT.exists` returns true if any CLIENT span is on the context
- `SpanKey.HTTP_CLIENT.fromContextOrNull` returns CLIENT span
- `SpanKey.HTTP_CLIENT.storeInContext` stores span in CLIENT span context key
- suppressing by kind + convention - context key is per convention and kind
- `SpanKey.HTTP_CLIENT.exists` returns true if HTTP CLIENT span is on the context
- `SpanKey.HTTP_CLIENT.fromContextOrNull` returns HTTP CLIENT span
- `SpanKey.HTTP_CLIENT.storeInContext` stores span in CLIENT + convention span context key
- suppressing none
- `SpanKey.HTTP_CLIENT.exists` returns false ignoring context
- `SpanKey.HTTP_CLIENT.fromContextOrNull` returns innermost HTTP CLIENT span on the context
- `SpanKey.HTTP_CLIENT.storeInContext` stores span in CLIENT + convention span context key

#### Enrichment Example

- HTTP SERVER 1 - middleware/servlet
- HTTP INTERNAL 1 - controller
- User code that wants to add event/attribute/status to HTTP SERVER span
- some internal instrumentation logic that sets route info status and exception ot anything else available after controller span starts.

Assuming user code uses current context, it will get controller INTERNAL span. In order to enrich HTTP SERVER span, users may use `SpanKey.HTTP_SERVER.fromContextOrNull`.

## Internal details

Client libraries frequently use common protocols (HTTP, gRPC, DB drivers) to perform RPC calls, which are usually instrumented by OpenTelemetry.
At the same time, client library is rarely a thin client and may need its own instrumentation to

- connect traces to application code
- provide extra context:
- duration of composite operations
- overall result of all operation
- any extra library-specific information not available on transport call span

Both, client library 'logical' and transport 'physical' spans are useful. They also rarely can be combined together because they have 1:many relationship.

So instrumentations form *layers*, where each layer follows specific convention or no convention at all. Spans that are not convention-specific (generic manual spans, INTERNAL spans) are never suppressed.

*Example*:

- HTTP SERVER span
- DB CLIENT call - 1
- HTTP CLIENT call - 1
- DNS CLIENT
- TLS CLIENT
- HTTP CLIENT call - 2

There are two HTTP client spans under DB call, they are children of DB client spans. DB spans follow DB semantics only, HTTP spans similarly only follow HTTP semantics. If there are other layers of instrumentation (TLS) - it happens under HTTP client spans.

### Duplication problem

Duplication is a common issue in auto-instrumentation:

- e.g. HTTP clients frequently are built on top of other HTTP clients, making multiple layers of HTTP spans
- Libraries may decide to add native instrumentation for common protocols like HTTP or gRPC:
- to support legacy correlation protocols
- to make better decisions failures (e.g. 404, 409)
- give better library-specific context
- support users that can't or don't want to use auto-instrumentation

So what happens in reality without attempts to suppress duplicates:

- HTTP SERVER span (middleware)
- HTTP SERVER span (servlet)
- Controller INTERNAL span
- HTTP CLIENT call - 1 (Google HTTP client)
- HTTP CLIENT call - 1 (Apache HTTP client)

#### Proposed solution

Disallow multiple layers of the same instrumentation, i.e. above picture translates into:

- HTTP SERVER span (middleware)
- Controller INTERNAL span
- HTTP CLIENT call - 1 (Google HTTP client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask what data is lost in this process of suppression? Is the intent that all the data that would be part of the spans that are now being surpressed, is that being discard, aggregated into the parent span?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We assume that two different HTTP clients send almost identical spans (if they follow OTel HTTP spec). The duration might be slightly different, but essentially they report the same thing. So we'll keep the outer span and fully disregard the inner span.


To do so, instrumentation:

- checks if span with same kind + convention is registered on context already
- yes: backs off, never starting a span
- no: starts a span and registers it on the context

Registration is done by writing a span on the context under the key. For this to work between different instrumentations (native and auto), the API to access spans must be in Trace API.

Same mechanism can be used by users/instrumentations to enrich spans, e.g. add route info to HTTP server span (current span is ambiguous)

### Configuration

Suppression strategy should be configurable:

- backends don't always support nested CLIENT spans (extra hints needed for Application Map to show outbound connection)
- users may prefer to reduce verbosity and costs by suppressing spans of same kind

So following strategies should be supported:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'd really like to know is what information is going to be lost because of this?

For example, during an incident the suppressed data could help provide useful clues into the problematic calls to either downstreams or internal calls that would have made up part of the span.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, e.g. you have DB running over HTTP.

  • CLIENT-only suppression will cause all inner CLIENT spans to be suppressed. Our conventions make each CLIENT span have enough information to identify which endpoint it calls into. DB spans have endpoint info. None of the HTTP details would be recorded (no status codes, tries, etc), only the overall result will be set on DB span. You'd choose this configuration if you only care about the logical results and don't care much about transport or if you're saving costs at all means. If you do HTTP requests without outer DB client call, they will be recorded as usual.
  • CLIENT + type suppression would keep DB and nested HTTP spans. If will only suppress duplicate HTTP client spans if multiple instrumentations work together. You'd choose this configuration if you want more details.

I understand your concern about suppression in general, what I hear from users some of them may want to sacrifice a bit of verbosity to save money. And kind + type suppression only suppresses duplicate data for those who can afford it.


- suppress all nested of same kind
- suppress all nested of same kind + convention (default?)
- suppress none (mostly for debugging instrumentation code and internal observability)

### Implementation

Here's [Instrumentation API in Java implementation](/~https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/instrumenter/SpanKey.java) with suppression by convention.

## Trade-offs and mitigations

Trace API change is needed to support native library instrumentations - taking dependency on unstable experimental instrumentation API (or common contrib code) is not a good option. Instrumentation API is a good temporary place until we can put it in Trace API, native instrumentation can use reflection to access `SpanKey` in instrumentation API.

## Prior art and alternatives

- Terminal context: suppressing anything below
- Exposing spans stack and allowing to walk it accessing span properties
- Suppress all nested spans of same kind
- Make logical calls INTERNAL

Discussions:

- [Client library + auto instrumentation](/~https://github.com/open-telemetry/opentelemetry-specification/issues/1767)
- [Prevent duplicate telemetry when using both library and auto instrumentation](/~https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/903)
- [Generic mechanism for preventing multiple Server/Client spans](/~https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/465)
- [Proposal for modeling nested CLIENT instrumentation](/~https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/1822)
- [SpanKind with layers](/~https://github.com/open-telemetry/opentelemetry-specification/issues/526)
- [Should instrumentations be able to interact with or know about other instrumentations](/~https://github.com/open-telemetry/opentelemetry-python-contrib/issues/369)
- [Server instrumentations should look for parent spans in current context before extracting context from carriers](/~https://github.com/open-telemetry/opentelemetry-python-contrib/issues/445)
- [CLIENT spans should update their parent span's kind to INTERNAL](/~https://github.com/open-telemetry/opentelemetry-python-contrib/issues/456)

## Open questions

- Backends need hint to separate logical CLIENT spans from physical ones
- Good default (suppress by kind or kind + convention)
- Should we have configuration option to never suppress anything