Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix instrument types in system metrics #2865

Closed
wants to merge 1 commit into from

Conversation

bjrara
Copy link
Contributor

@bjrara bjrara commented Sep 11, 2024

Description

This PR fixes the instrument types when creating the system metrics.

Metrics that are impacted:

  • system.network.connections:
    Collected from psutil.net_connections is a snapshot of the current state of network connections, thus should be delta and is not monotonic.
  • process.runtime.cpython.gc_count
    Collected from python gc is the current collection counts, thus should be delta and is not monotonic.
  • process.runtime.cpython.memory
    Collected from psutil.Process.memory_info is a snapshot of the process’s memory usage at the time of the call, thus should be delta and is not monotonic.
  • process.runtime.cpython.thread_count
    Collected from psutil.Process.num_threads is the number of threads currently used by this process (non cumulative).
  • process.open_file_descriptor.count
    Collected from psutil.Process.num_fds is the number of file descriptors currently opened by this process (non cumulative).

Change the instrument types for the metrics above to observable_gauge for correction.

Fixes #2861

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Tested locally using console exporter configured with delta temporality for all the metric types.

Does This PR Require a Core Repo Change?

  • No.

Checklist:

See contributing.md for styleguide, changelog guidelines, and more.

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@xrmx
Copy link
Contributor

xrmx commented Sep 12, 2024

Don't have much clue about metrics but the generated semantic conventions package provides some helper functions to creates these metrics, and they return UpDownCounter, e.g. create_process_open_file_descriptor_count

Copy link
Member

@emdneto emdneto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From specs:

I want to measure something (by reporting an absolute value):

If the measurement values are additive:

So I think the UpDownCounter is correct no?

@bjrara
Copy link
Contributor Author

bjrara commented Sep 12, 2024

@xrmx @emdneto Thanks for taking a look at the PR.

The issue is not with UpDownCounter, but the temporality used by ObservableUpDownCounter.

From the definition, ObservableUpDownCounter is cumulative, while these metrics are delta.

if isinstance(instrument, ObservableUpDownCounter):
        return _SumAggregation(
            attributes,
            instrument_is_monotonic=False,
            instrument_aggregation_temporality=(
                AggregationTemporality.CUMULATIVE
            ),
            start_time_unix_nano=start_time_unix_nano,
        )

Changing them from ObservableUpDownCounter to UpDownCounter should also fix the issue. I agree with both of you that using UpDownCounter is more accurate from the semantic convention. However, create_up_down_counter isn't asynchronous, and doesn't accept callbacks, so I choose to use ObservableGauge. If you have better suggestions on how to fix the issue properly with UpDownCounter, please let me know.

I actually want to take back my words on using UpDownCounter. By checking the specification of UpDownCounter, it's more appropriate to use when the metrics are emitted using add(n) or add(-n). But in our case, we collect absolute values from snapshots.

I understand it deviates from the semantic conventions, but I also want to ask for a second thought based on the actual collection process.

Copy link
Member

@emdneto emdneto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a second look, the issue makes sense. According to the specs, UpDownCounter is suitable because the metrics are additive and not monotonically increasing, but in practice, for our asynchronous measurements (connections, memory, thread_count, etc), the cumulative temporality of the ObservableUpDownCounter doesn't seem to fit in that situation. We want to report an absolute value (current snapshot) that is not monotonic and non-cumulative.

In java it's using ObservableGauge.

@bjrara, would you be open to adding a failing test showing how this is fixing the issue with the cumulative temporality?

There's also another issue: The specs do not reference the gc_count metric, but I totally agree with that change (from async counter to async gauge).

@xrmx
Copy link
Contributor

xrmx commented Sep 24, 2024

If there is already another language going out of spec we should open an issue in the semconv repo, no?

@bjrara
Copy link
Contributor Author

bjrara commented Sep 24, 2024

Taking a second look, the issue makes sense. According to the specs, UpDownCounter is suitable because the metrics are additive and not monotonically increasing, but in practice, for our asynchronous measurements (connections, memory, thread_count, etc), the cumulative temporality of the ObservableUpDownCounter doesn't seem to fit in that situation. We want to report an absolute value (current snapshot) that is not monotonic and non-cumulative.

In java it's using ObservableGauge.

@bjrara, would you be open to adding a failing test showing how this is fixing the issue with the cumulative temporality?

There's also another issue: The specs do not reference the gc_count metric, but I totally agree with that change (from async counter to async gauge).

Thanks for your feedback. Sure, checking how to add a failing test on the issue.

@emdneto
Copy link
Member

emdneto commented Sep 26, 2024

As discussed in SIG (09/26) -- @bjrara will try using the LastValue aggregation in a View to configure the SDK to output this metric as a gauge. If it doesn't work, we should bring the issue to the semconv SIG.

@bjrara
Copy link
Contributor Author

bjrara commented Sep 27, 2024

As discussed in SIG (09/26) -- @bjrara will try using the LastValue aggregation in a View to configure the SDK to output this metric as a gauge. If it doesn't work, we should bring the issue to the semconv SIG.

The proposed solution works for me. Thanks to @emdneto and @aabmass for the suggestions. I will update the issue with the summary why we believe the existing instrument types are right, and how to work around to report metrics to systems that only accept delta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple system metrics return negative values when using DELTA temporality.
3 participants