Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record memory usage after garbage collection #6963

Merged
merged 5 commits into from
Nov 8, 2022

Conversation

jack-berg
Copy link
Member

Per conversation in #6362.

@jack-berg jack-berg requested a review from a team October 24, 2022 17:27
.buildWithCallback(callback(poolBeans, MemoryPoolMXBean::getUsage, MemoryUsage::getMax));

meter
.upDownCounterBuilder("process.runtime.jvm.memory.usage_after_gc")
Copy link
Contributor

@rapphil rapphil Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reporting this metric as absolute value might not be helpful when you have a fleet of hosts each with JVMs with different max heap sizes. I understand that you wanted to align with JMX metric conventions but I think there is more value in a normalized metric. Maybe we can add an extra metric? process.runtime.jvm.memory.usage_after_gc_percentage?

Calculating the percentage after the metric is emitted will not be trivial or feasible in all backends for the case that you want to analyze data from multiple hosts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculating the percentage after the metric is emitted will not be trivial or feasible in all backends for the case that you want to analyze data from multiple hosts.

Is this really a challenge for backends? Feels like doing operations across metrics like dividing one by another is pretty standard, and probably table stakes for a metrics backend. For example, if you exported these metrics to prometheus its trivial to compute a percentage with:

(process_runtime_jvm_memory_usage_after_gc / process_runtime_jvm_memory_limit) * 100

If we buy this argument wouldn't we also want to report the percentage for usage and committed as well? And if we report percentage instead of of the absolute value, doesn't that stand frustrate people on the other side that want to analyze their absolute figures instead of relative percentages?

Another argument against reporting utilization is that we don't actually have the data to report utilization because not all memory pools report memory usage after gc, and not all memory pools report a limit. Here's the sets of pool attribute values that are reported for usage, usage_after_gc, and limit for an app running with g1 garbage collector:

  • process.runtime.jvm.memory.usage reports values for 8 pools: CodeHeap 'non-nmethods', Code Heap 'non-profiled nmethods', CodeHeap 'profiled nmethods', Compressed Class Space, G1 Eden Space, G1 Old Gen, G1 Survivor Space, Metaspace
  • process.runtime.jvm.memory.usage_after_gc reports values for 3 pools: G1 Eden Space, G1 Old Gen, G1 Survivor Space
  • process.runtime.jvm.memory.limit reports values for 5 pools: CodeHeap 'non-nmethods', Code Heap 'non-profiled nmethods', CodeHeap 'profiled nmethods', G1 Old Gen

Notice how there's not usage_after_gc or limit values for all the pools. Reporting utilization would limit the set of pools to those that have both usage_after_gc and limit values, which is only G1 Old Gen. Same argument applies to reporting utilization instead of usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a challenge for backends? Feels like doing operations across metrics like dividing one by another is pretty standard, and probably table stakes for a metrics backend. For example, if you exported these metrics to prometheus its trivial to compute a percentage with:

If you have multiple jvms (say hundreds) reporting with different process_runtime_jvm_memory_limit because each jvm is using a different -xmx, how do you do that without manually creating a query that will match every process_runtime_jvm_memory_usage_after_gc to the respective process_runtime_jvm_memory_limit? Moreover what if the attributes that uniquely identify the metrics are not predictable?

Having said that, you made a good point about the lack of consistency in the memory pools. It does not make sense to report a normalized metric per memory pool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover what if the attributes that uniquely identify the metrics are not predictable?

If it's not possible to recognize what instance emitted particular metrics, then I'd say that your data is nonsense anyway - if you can't correlate metric instruments with each other, or with a particular deployment, the telemetry is basically useless.
I think we should operate under an assumption that resource attributes uniquely identify the metrics source - and if this is not true, this is a broader problem that needs to be fixed across the board.

trask pushed a commit that referenced this pull request Nov 2, 2022
In the 10/27 java sig we discussed that it would be valuable to
enumerate the attributes reported for memory pool and gc metrics when
different gcs are used.

I've went ahead and added a readme for the runtime metrics which
includes detailed information on the attributes reported. Note that I
also have the same data for gc metrics added in #6964 and #6963, but
will wait to add until those PRs are merged.
@trask trask added this to the v1.20.0 milestone Nov 8, 2022
@jack-berg
Copy link
Member Author

FYI, the spec PR corresponding to this has been merged. I've pushed a commit that syncs this PR with the naming in the spec.

@github-actions github-actions bot requested a review from theletterf November 8, 2022 22:20
@trask trask enabled auto-merge (squash) November 8, 2022 22:22
@trask trask merged commit 177d9cd into open-telemetry:main Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants