-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record memory usage after garbage collection #6963
Conversation
.buildWithCallback(callback(poolBeans, MemoryPoolMXBean::getUsage, MemoryUsage::getMax)); | ||
|
||
meter | ||
.upDownCounterBuilder("process.runtime.jvm.memory.usage_after_gc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reporting this metric as absolute value might not be helpful when you have a fleet of hosts each with JVMs with different max heap sizes. I understand that you wanted to align with JMX metric conventions but I think there is more value in a normalized metric. Maybe we can add an extra metric? process.runtime.jvm.memory.usage_after_gc_percentage
?
Calculating the percentage after the metric is emitted will not be trivial or feasible in all backends for the case that you want to analyze data from multiple hosts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminded me of open-telemetry/opentelemetry-specification#2392 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calculating the percentage after the metric is emitted will not be trivial or feasible in all backends for the case that you want to analyze data from multiple hosts.
Is this really a challenge for backends? Feels like doing operations across metrics like dividing one by another is pretty standard, and probably table stakes for a metrics backend. For example, if you exported these metrics to prometheus its trivial to compute a percentage with:
(process_runtime_jvm_memory_usage_after_gc / process_runtime_jvm_memory_limit) * 100
If we buy this argument wouldn't we also want to report the percentage for usage and committed as well? And if we report percentage instead of of the absolute value, doesn't that stand frustrate people on the other side that want to analyze their absolute figures instead of relative percentages?
Another argument against reporting utilization is that we don't actually have the data to report utilization because not all memory pools report memory usage after gc, and not all memory pools report a limit. Here's the sets of pool
attribute values that are reported for usage, usage_after_gc, and limit for an app running with g1 garbage collector:
process.runtime.jvm.memory.usage
reports values for 8 pools:CodeHeap 'non-nmethods'
,Code Heap 'non-profiled nmethods'
,CodeHeap 'profiled nmethods'
,Compressed Class Space
,G1 Eden Space
,G1 Old Gen
,G1 Survivor Space
,Metaspace
process.runtime.jvm.memory.usage_after_gc
reports values for 3 pools:G1 Eden Space
,G1 Old Gen
,G1 Survivor Space
process.runtime.jvm.memory.limit
reports values for 5 pools:CodeHeap 'non-nmethods'
,Code Heap 'non-profiled nmethods'
,CodeHeap 'profiled nmethods'
,G1 Old Gen
Notice how there's not usage_after_gc or limit values for all the pools. Reporting utilization would limit the set of pools to those that have both usage_after_gc and limit values, which is only G1 Old Gen
. Same argument applies to reporting utilization instead of usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really a challenge for backends? Feels like doing operations across metrics like dividing one by another is pretty standard, and probably table stakes for a metrics backend. For example, if you exported these metrics to prometheus its trivial to compute a percentage with:
If you have multiple jvms (say hundreds) reporting with different process_runtime_jvm_memory_limit
because each jvm is using a different -xmx
, how do you do that without manually creating a query that will match every process_runtime_jvm_memory_usage_after_gc
to the respective process_runtime_jvm_memory_limit
? Moreover what if the attributes that uniquely identify the metrics are not predictable?
Having said that, you made a good point about the lack of consistency in the memory pools. It does not make sense to report a normalized metric per memory pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover what if the attributes that uniquely identify the metrics are not predictable?
If it's not possible to recognize what instance emitted particular metrics, then I'd say that your data is nonsense anyway - if you can't correlate metric instruments with each other, or with a particular deployment, the telemetry is basically useless.
I think we should operate under an assumption that resource attributes uniquely identify the metrics source - and if this is not true, this is a broader problem that needs to be fixed across the board.
In the 10/27 java sig we discussed that it would be valuable to enumerate the attributes reported for memory pool and gc metrics when different gcs are used. I've went ahead and added a readme for the runtime metrics which includes detailed information on the attributes reported. Note that I also have the same data for gc metrics added in #6964 and #6963, but will wait to add until those PRs are merged.
FYI, the spec PR corresponding to this has been merged. I've pushed a commit that syncs this PR with the naming in the spec. |
Per conversation in #6362.