Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate otel metrics to libbeat monitoring #15094

Merged
merged 48 commits into from
Jan 28, 2025
Merged

Conversation

kruskall
Copy link
Member

@kruskall kruskall commented Jan 1, 2025

Motivation/summary

use otel api to record metrics and export them to beats monitoring

Checklist

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

  • run apm-server
  • send data
  • go to index management and validate the monitoring index is there and monitoring data is inside it

Related issues

Related to #14488

@kruskall kruskall marked this pull request as ready for review January 1, 2025 00:50
@kruskall kruskall requested a review from a team as a code owner January 1, 2025 00:50
Copy link
Contributor

mergify bot commented Jan 1, 2025

This pull request does not have a backport label. Could you fix it @kruskall? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.17 is the label to automatically backport to the 7.17 branch.
  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • backport-8.x is the label to automatically backport to the 8.x branch.

Copy link
Contributor

mergify bot commented Jan 1, 2025

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 1, 2025
@kruskall
Copy link
Member Author

This is ready for review again, unfortunately I'm afraid there's no easy way to split this up

Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for extracting the other PR - this is much easier to review now.

Comment on lines -633 to +636
equal, result := monitoringtest.CompareMonitoringInt(map[request.ResultID]int{
request.IDRequestCount: 2,
request.IDResponseCount: 2,
request.IDResponseErrorsCount: 1,
request.IDResponseValidCount: 1,
request.IDResponseErrorsTimeout: 1, // test data POST /intake/v2/events
request.IDResponseValidAccepted: 1, // self-instrumentation
}, intake.MonitoringMap)
assert.True(t, equal, result)
monitoringtest.ExpectContainOtelMetrics(t, reader, map[string]any{
"http.server." + string(request.IDRequestCount): 1,
"http.server." + string(request.IDResponseCount): 1,
"http.server." + string(request.IDResponseErrorsCount): 1,
"http.server." + string(request.IDResponseErrorsTimeout): 1, // test data POST /intake/v2/events
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding a comment that the self-instrumentation requests are not counted in metrics.

This is a change in behaviour. I think it's probably what users would want, but would like to hear @simitt's opinion on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we don't want to mingle the counters for user requests and self-instrumentation requests together, especially when not being able to encode a differentiator in the metadata. Seems like a bugfix to me.

@@ -79,7 +54,17 @@ func RegisterGRPCServices(
Semaphore: semaphore,
RemapOTelMetrics: true,
})
gRPCMonitoredConsumer.set(consumer)

// FIXME we should add an otel counter metric directly in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be needed now that we propagate the meterprovider instead of reusing a global meter.

I don't follow. The issue here is that:

  • the MeterProvider & Meters will survive across config reloads
  • on each config reload we will register a new callback
  • we never unregister those callbacks, so memory & CPU usage will increase with every reload

I also think adding metrics to apm-data is not as simple since we need to keep other systems in mind (that's a library).

What's difficult about that? We can add a MeterProvider to apm-data/input/otlp.ConsumerConfig, and if it's not set then we can either use the global or a noop provider.

The alternative to adding the metric to apm-data would be to find a way to unregister the callback when the server is stopped.

v.OnRegistryStart()
defer v.OnRegistryFinished()
for _, sm := range rm.ScopeMetrics {
switch {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why to use switch case here? do we expect to handle more cases in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment, it's for consistency

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligning all of these monitoring functions to use if statements also sounds fine, I was mostly addressing the inconsistency for the same kind of checks.

Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kruskall. Changes LGTM, please wait for sign off from @simitt though

Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one minor question;
PR looks great and improves the readability of the metrics collection

v.OnRegistryStart()
defer v.OnRegistryFinished()
for _, sm := range rm.ScopeMetrics {
switch {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligning all of these monitoring functions to use if statements also sounds fine, I was mostly addressing the inconsistency for the same kind of checks.

{IntakePath, builder.backendIntakeHandler(meterProvider)},
{OTLPTracesIntakePath, builder.otlpHandler(otlpHandlers.HandleTraces, "apm-server.otlp.http.traces.", meterProvider)},
{OTLPMetricsIntakePath, builder.otlpHandler(otlpHandlers.HandleMetrics, "apm-server.otlp.http.metrics.", meterProvider)},
{OTLPLogsIntakePath, builder.otlpHandler(otlpHandlers.HandleLogs, "apm-server.otlp.http.logs.", meterProvider)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you passing the metricsPrefix as an argument to the otel routes vs. hardcode them inside the handler functions for non-otel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I guess the otlp handler is shared so it must be parametrized while the intake path doesn't. We can change it so they are all consistent. I'll open a followup PR

Comment on lines -633 to +636
equal, result := monitoringtest.CompareMonitoringInt(map[request.ResultID]int{
request.IDRequestCount: 2,
request.IDResponseCount: 2,
request.IDResponseErrorsCount: 1,
request.IDResponseValidCount: 1,
request.IDResponseErrorsTimeout: 1, // test data POST /intake/v2/events
request.IDResponseValidAccepted: 1, // self-instrumentation
}, intake.MonitoringMap)
assert.True(t, equal, result)
monitoringtest.ExpectContainOtelMetrics(t, reader, map[string]any{
"http.server." + string(request.IDRequestCount): 1,
"http.server." + string(request.IDResponseCount): 1,
"http.server." + string(request.IDResponseErrorsCount): 1,
"http.server." + string(request.IDResponseErrorsTimeout): 1, // test data POST /intake/v2/events
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we don't want to mingle the counters for user requests and self-instrumentation requests together, especially when not being able to encode a differentiator in the metadata. Seems like a bugfix to me.

@kruskall kruskall enabled auto-merge (squash) January 28, 2025 18:38
@kruskall kruskall merged commit 378b60c into elastic:main Jan 28, 2025
12 checks passed
@kruskall kruskall deleted the otel-adapter branch January 28, 2025 18:47
mergify bot pushed a commit that referenced this pull request Jan 28, 2025
* Translate otel metrics to libbeat monitoring

* demo: send metrics directly and add another reader

* Revert "demo: send metrics directly and add another reader"

This reverts commit 166a717.

* lint: fix linter issues

* lint: fix linter issues

* feat: refactor code to propagate meterprovider and fix tests

* lint: fix linter issues

* refactor: remove unused files

* refactor: remove more global meters

* refactor: cleanup more unused code

* refactor: remove more unused code

* test: account for agentcfg metric in test

* test: account for agentcfg metric in test

* fix: pass meter provider in main func

* Fix LSM metrics tests

* test: fix remaining test

* lint: fix linter issues

* fix: update docappender metrics name

* test: update systemtest metric assertions

* fix: update grpc interceptor meter path

metrics were not being exported because the meter was not recognized
as apm-server meter

* fix: add back output.type=elasticsearch

* test: upate remaining systemtest

* test: remove debug print

* fix: use correct outputRegistry variable

fix panic

* fix: remove panic on err

* fix: propagate meter provider to grpc services

* lint: add meterprovider field docs

* lint: fix comment typo

* feat: pass apmotel gatherer too

* refactor: use switch pattern consistently when mapping metrics

* Update beater.go

* Update beat.go

* Update beater.go

* fix: solve compile errors and systemtest fix

* refactor: reduce diff noise

* lint: fix linter issues

* lint: fix linter issues

* Update x-pack/apm-server/main.go

Co-authored-by: Andrew Wilkins <axwalk@gmail.com>

* test: use legacy metrics for validating grpc tests

* fix: unregister callback if available

forgot to pus this

---------

Co-authored-by: Andrew Wilkins <axw@elastic.co>
Co-authored-by: Andrew Wilkins <axwalk@gmail.com>
(cherry picked from commit 378b60c)

# Conflicts:
#	internal/beatcmd/beat.go
#	internal/beater/beater.go
#	internal/beater/server.go
#	internal/beater/server_test.go
@kruskall
Copy link
Member Author

@Mergifyio backport 8.x

Copy link
Contributor

mergify bot commented Jan 31, 2025

backport 8.x

✅ Backports have been created

mergify bot added a commit that referenced this pull request Jan 31, 2025
…15440)

* Translate otel metrics to libbeat monitoring  (#15094)

* Translate otel metrics to libbeat monitoring

* demo: send metrics directly and add another reader

* Revert "demo: send metrics directly and add another reader"

This reverts commit 166a717.

* lint: fix linter issues

* lint: fix linter issues

* feat: refactor code to propagate meterprovider and fix tests

* lint: fix linter issues

* refactor: remove unused files

* refactor: remove more global meters

* refactor: cleanup more unused code

* refactor: remove more unused code

* test: account for agentcfg metric in test

* test: account for agentcfg metric in test

* fix: pass meter provider in main func

* Fix LSM metrics tests

* test: fix remaining test

* lint: fix linter issues

* fix: update docappender metrics name

* test: update systemtest metric assertions

* fix: update grpc interceptor meter path

metrics were not being exported because the meter was not recognized
as apm-server meter

* fix: add back output.type=elasticsearch

* test: upate remaining systemtest

* test: remove debug print

* fix: use correct outputRegistry variable

fix panic

* fix: remove panic on err

* fix: propagate meter provider to grpc services

* lint: add meterprovider field docs

* lint: fix comment typo

* feat: pass apmotel gatherer too

* refactor: use switch pattern consistently when mapping metrics

* Update beater.go

* Update beat.go

* Update beater.go

* fix: solve compile errors and systemtest fix

* refactor: reduce diff noise

* lint: fix linter issues

* lint: fix linter issues

* Update x-pack/apm-server/main.go

Co-authored-by: Andrew Wilkins <axwalk@gmail.com>

* test: use legacy metrics for validating grpc tests

* fix: unregister callback if available

forgot to pus this

---------

Co-authored-by: Andrew Wilkins <axw@elastic.co>
Co-authored-by: Andrew Wilkins <axwalk@gmail.com>
(cherry picked from commit 378b60c)

# Conflicts:
#	internal/beatcmd/beat.go
#	internal/beater/beater.go
#	internal/beater/server.go
#	internal/beater/server_test.go

* fix: resolve conflicts

---------

Co-authored-by: kruskall <99559985+kruskall@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
@endorama endorama self-assigned this Feb 11, 2025
@endorama endorama mentioned this pull request Feb 11, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify test-plan test-plan-ok
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants