Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gateway: add opentelemetry #6110

Merged
merged 1 commit into from
Dec 14, 2023
Merged

gateway: add opentelemetry #6110

merged 1 commit into from
Dec 14, 2023

Conversation

ElysaSrc
Copy link
Member

This PR adds a basic opentelemetry implementation on OSRD stack.

  • Basic opentelemetry in the gateway
    • Support for Jaeger
    • Support for Datadog
    • Tracing
  • Add a jaeger "all-in-one" in the local docker compose

image

@ElysaSrc ElysaSrc requested a review from multun December 13, 2023 00:01
@ElysaSrc ElysaSrc requested a review from a team as a code owner December 13, 2023 00:01
Copy link

codecov bot commented Dec 13, 2023

Codecov Report

Attention: 109 lines in your changes are missing coverage. Please review.

Comparison is base (3afe9e4) 26.54% compared to head (b6674d1) 26.56%.
Report is 8 commits behind head on dev.

Files Patch % Lines
gateway/actix_proxy/src/lib.rs 0.00% 77 Missing ⚠️
gateway/src/config.rs 0.00% 31 Missing ⚠️
gateway/src/config_parser.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##                dev    #6110      +/-   ##
============================================
+ Coverage     26.54%   26.56%   +0.01%     
  Complexity     2123     2123              
============================================
  Files           931      931              
  Lines        123067   123263     +196     
  Branches       2678     2678              
============================================
+ Hits          32674    32740      +66     
- Misses        88806    88936     +130     
  Partials       1587     1587              
Flag Coverage Δ
core 78.96% <ø> (ø)
editoast 74.69% <ø> (+0.03%) ⬆️
front 9.37% <ø> (+0.04%) ⬆️
gateway 2.55% <0.00%> (-0.20%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ElysaSrc ElysaSrc force-pushed the ev/opentelemetry branch 3 times, most recently from 570f599 to 0eb6599 Compare December 13, 2023 08:30
@bloussou bloussou requested a review from woshilapin December 14, 2023 09:19
docker/gateway.dev.simple.toml Show resolved Hide resolved
gateway/actix_proxy/Cargo.toml Outdated Show resolved Hide resolved
gateway/src/main.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@woshilapin woshilapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's awesome that you tackled this 🚀

2 general remarks at this point:

  • why do we need some specific Jaeger/Datadog crates? Isn't the goal of OpenTelemetry to have a standard and not be vendor dependent? I used to connect to my OpenTelemetry collectors with only opentelemetry_otlp.
  • I see that all the spans and attributes are done manually with the rather verbose API of opentelemetry crates. On top of that, that means that the Context has to be forwarded into every function call, which doesn't make it easy. Maybe we could leverage tracing and its ecosystem to simplify all of that. With tracing, the code would look a bit like the following.
#[tracing::instrument(
    skip_all,
    fields(
        proxy.upstream_uri = self.proxy.rebase_uri(Some("ws"), unprocessed_path, query).to_string(),
        proxy.type = "ws"
    )
)]
async fn handle_websocket(
        &self,
        req: &HttpRequest,
        stream: web::Payload,
        unprocessed_path: &str,
        query: &str,
    ) -> Result<HttpResponse, actix_web::Error> {
        // keep the original body without modification
}

This means no more Context (I believe tracing use the thread local to keep all the needed information). And all the information related to instrumentation of the code is not intertwined anymore with the body of the function. By default, a new span is created for each #[tracing::instrument] with the function's name used for the span's name (can be customized if needed).

I might be able to help for both remarks, so reach out if you need to. I believe the second (or even both?) remarks could be addressed later (which is why I'm fine to merge as-is since it already brings a very nice improvement on current situation).

@ElysaSrc
Copy link
Member Author

ElysaSrc commented Dec 14, 2023

  • why do we need some specific Jaeger/Datadog crates? Isn't the goal of OpenTelemetry to have a standard and not be vendor dependent? I used to connect to my OpenTelemetry collectors with only opentelemetry_otlp.

Removed the jaeger backend, implemented the otlp one. Kept the Datadog backend, since adding using otlp with datadog requires adding a translation service.

  • I see that all the spans and attributes are done manually with the rather verbose API of opentelemetry crates. On top of that, that means that the Context has to be forwarded into every function call, which doesn't make it easy. Maybe we could leverage tracing and its ecosystem to simplify all of that. With tracing, the code would look a bit like the following.

Tracing relies on thread-local variables to attribute spans in an async context, which require careful context switching on each future poll. It can either be done manually or via a macro but either way failure to add context switching instrumentation causes span misattribution.

The current method is less likely to be misused and does not involve any macro.

@woshilapin
Copy link
Contributor

Removed the jaeger backend, implemented the otlp one. Kept the Datadog backend, since adding using otlp with datadog requires adding a translation service.

The advised way to do that is actually to use the Opentelemetry Collector with one of the contributed exporter for Datadog (see here). I see one some pros and one con:

  • pros
    • as an open-source project, we only depend on a open protocol and do not include code for a specific vendor
    • Opentelemetry Collector will play the role of the Intermediate Representation: imagine later we also start to collect CPU and memory from the pods, or Postgresql statistics. In each case, we're going to need to figure out how Datadog will work in each case. If later we end up moving from Datadog to some other solution, all of these will have to be done again. With Opentelemetry Collector, only the configuration of the Collector will have to change and redirect to another vendor, the applications and services don't have to evolve a bit.
  • cons
    • deploy and maintain one more web-service

Tracing relies on thread-local variables to attribute spans in an async context, which require careful context switching on each future poll. It can either be done manually or via a macro but either way failure to add context switching instrumentation causes span misattribution.

The current method is less likely to be misused and does not involve any macro.

Got it! Let's not overdo it, for now!

@multun
Copy link
Contributor

multun commented Dec 14, 2023

EDIT: breaking news: the datadog agent is needed anyway, so keeping the datadog exporter over the OLTP one is pointless. @ElysaSrc is making the change

@multun multun added this pull request to the merge queue Dec 14, 2023
@multun multun removed this pull request from the merge queue due to a manual request Dec 14, 2023
@multun multun added this pull request to the merge queue Dec 14, 2023
Merged via the queue into dev with commit dde2caf Dec 14, 2023
18 checks passed
@multun multun deleted the ev/opentelemetry branch December 14, 2023 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants