Amqp core metrics: step 1 #30583

lmolkova · 2022-08-22T18:19:08Z

Spec: https://gist.github.com/lmolkova/489a2b280b8fa68e4c3780c2afaa3b39

Adds following metrics (as step 1):

OpenTelemetry Metrics Semantic Conventions for Azure Messaging Libraries

Context: https://gist.github.com/lmolkova/b9004307a09be788af04f05ebe22ad3c

Follows general OTel Metrics conventions and depends on OTel Messaging conventions changes

AMQP-level metric instruments

Name	Description	Units	Instrument Type (*)	Value Type	Attribute Key(s)	Attribute Values
messaging.az.amqp.producer.send.duration	Measures duration of AMQP-level send call attempt	`ms`	Histogram	Double	`net.peer.name`	Broker FQDN
					`messaging.destination` (will change)	Entity name
					`amqp.delivery_state`	Delivery state
messaging.az.amqp.management.request.duration	Duration of AMQP request-response operation	`ms`	Histogram	Double	`net.peer.name`	Broker FQDN
					`messaging.destination` (will change)	Entity name
					`amqp.status_code`	AMQP (HTTP) status code
					`amqp.operation`	AMQP request/response operation name
messaging.az.amqp.client.connections.closed	Number of closed connections.	`{connections}`	Counter	Int64	`net.peer.name`	Broker FQDN
					`amqp.error_condition`	'ok' or one of AMQP error conditions
messaging.az.amqp.client.transport.errors	Number of transport errors	`{errors}`	Counter	Int64	`net.peer.name`	Broker FQDN
					`messaging.destination`	Entity name
					`amqp.error_condition`	'ok' or one of AMQP error conditions
messaging.az.amqp.client.link.errors	Number of AMQP links closed with error	`{errors}`	Counter	Int64	`net.peer.name`	Broker FQDN
					`messaging.destination`	Entity name
					`messaging.az.entity_path` (when available)	Entity path (includes partition id or subscription). It's could to be generalized in scope of OTel Messaging conventions changes
					`amqp.error_condition`	'ok' or one of AMQP error conditions
messaging.az.amqp.client.session.errors	Number of AMQP sessions closed with error	`{errors}`	Counter	Int64	`net.peer.name`	Broker FQDN
					`messaging.destination`	Entity name
					`messaging.az.entity_path` (when available)	Entity path (includes partition id or subscription)
					`amqp.error_condition`	'ok' or one of AMQP error conditions
messaging.az.amqp.consumer.lag	Approximate lag between the time message was received and the time it was enqueued on broker.	`seconds`	Histogram	Double	`net.peer.name`	Broker FQDN
					`messaging.destination`	Entity name
					`messaging.az.entity_path`	Entity path (includes partition id or subscription)
messaging.az.amqp.consumer.credits.requested	Number of requested credits	`{credits}`	Counter	Int64	`net.peer.name`	Broker FQDN
					`messaging.destination`	Entity name
					`messaging.az.entity_path`	Entity path (includes partition id or subscription)

Scenarios

messaging.az.amqp.*.duration
- count, rate, of producer send or management channel operation (peek/receive defered, renew lock, session state, etc)
- error rate by error code
- how long network requests take, percentiles
messaging.az.amqp.client.connections.closed
- rate of closed (and opened) connections, are there too many? How connections close (by AMQP error condition)
messaging.az.amqp.client.*.errors
- are there link\sessions\transport errors, how links\session end?
messaging.az.amqp.consumer.lag
- number of received messages
- how much time message spent on broker before consumer received it?
- how far consumer is behind, are there enough consumers
- are consumers catching un on the backlog or slowing down?
messaging.az.amqp.consumer.credits.requested
- is prefetch configured correctly
- are enough messaging coming? how fast consumer processed messages?
- maybe too many messages are coming and consumer is not able to catch up

all metrics:

is it specific per namespace, entity, partition, client machine?

lmolkova · 2022-08-22T18:21:09Z

Note, ServiceBus messaging.az.amqp.consumer.credits.requested are not reported with this change as servicebus does not use ReactorReceiver and implements it's own AmqpReceiveLink

azure-sdk · 2022-08-22T18:27:21Z

API change check

APIView has identified API level changes in this PR and created following API reviews.

azure-core-test

JonathanGiles · 2022-08-22T23:15:16Z

Good work! A while back I recall you mentioning there was a performance impact to having metrics in the codebase. Are you able to quantify this impact when metrics are enabled and also when they are disabled? Thanks!

lmolkova · 2022-08-29T16:52:06Z

@JonathanGiles, sorry I overlooked your question. Micro-benchmarks show that when metrics are disabled (and any code that collects e.g. duration is guarded), each measurement takes less than a nanosecond (disabledOptimizedMetrics). It's allocation-free.

Reporting a histogram with dynamic (cached in a map) attributes (`basicHistogramWithDynamicAttributes ) is ~60 ns and still allocation free after attributes are created.

Benchmark                                                          Mode  Cnt   Score   Error  Units
OpenTelemetryMetricsBenchmark.basicHistogram                       avgt    6  45.928 ± 1.420  ns/op
OpenTelemetryMetricsBenchmark.basicHistogramWithCommonAttributes   avgt    6  52.500 ± 5.388  ns/op
OpenTelemetryMetricsBenchmark.basicHistogramWithDynamicAttributes  avgt    6  59.446 ± 2.097  ns/op
OpenTelemetryMetricsBenchmark.disabledNotOptimizedMetrics          avgt    6  17.774 ± 0.431  ns/op
OpenTelemetryMetricsBenchmark.disabledOptimizedMetrics             avgt    6   0.208 ± 0.006  ns/op
OpenTelemetryMetricsBenchmark.noopMeterProviderNotOptimized        avgt    6  17.974 ± 0.553  ns/op

I've tried to measure it with EventHubs performance tests, and the problem is that they are quite unstable. They aren't CPU or memory bound. Consequent baseline measurements can differ drastically (10-20% easily). With a lot of long tests and aggregations I get something like this:

disabled metric vs baseline: 99.3%
enabled merics vs baseline: 97.4%

I also profiled perf tests with metrics enabled and during profiling metrics stacks took less than 1% of CPU time, and were even smaller with regard to allocations.

I did some investigation with metric collection in general with storage tests and it seems metric collection impact is smaller than error interval.

JonathanGiles

Looks good - a few pieces of feedback for your consideration

...re/azure-core-amqp/src/main/java/com/azure/core/amqp/implementation/AmqpMetricsProvider.java

...azure-core-amqp/src/main/java/com/azure/core/amqp/implementation/ReactorHandlerProvider.java

...azure-core-amqp/src/main/java/com/azure/core/amqp/implementation/handler/SessionHandler.java

JonathanGiles · 2022-08-29T21:55:28Z

...pentelemetry/src/main/java/com/azure/core/metrics/opentelemetry/OpenTelemetryAttributes.java

+            // TODO: by GA we need to figure out default naming (when no mapping is defined)
+            // and follow otel attributes conventions if we can or make sure mapping is defined
+            // for all attributes


We should probably tackle this ~now?

curious what's the urgency?

this mapping exists in beta metrics package with no requirement to GA in the nearest future.
However, we have an ask to GA tracing plugin in Zn, where we'll do exactly the same mapping and we'll figure it out there first. Assuming we'll resource the ask, it should happen in the next few months.

And I'd really love to take it now, but we're working with OTel community on changing these exact attributes, so anything I could write might change very soon.

...ics-opentelemetry/src/main/java/com/azure/core/metrics/opentelemetry/OpenTelemetryUtils.java

conniey

LGTM! THanks for the documentation and links in your description

conniey · 2022-08-29T23:31:04Z

...re/azure-core-amqp/src/main/java/com/azure/core/amqp/implementation/AmqpMetricsProvider.java

+    }
+
+    private TelemetryAttributes getResponseCodeAttributes(AmqpResponseCode code, String operation) {
+        int ind = code == null ? RESPONSE_CODES_COUNT - 1 : code.ordinal();


I am curious why we have the - 1 here and then + 1 when defining RESPONSE_CODES_COUNT. A comment where RESPONSE_CODES_COUNT could help in the future if someone is checking this out.

thanks for the feedback, I added comments in the code!

srnagar · 2022-08-30T06:59:49Z

...re/azure-core-amqp/src/main/java/com/azure/core/amqp/implementation/AmqpMetricsProvider.java

+    public static final String STATUS_CODE_KEY = "amqpStatusCode";
+    public static final String MANAGEMENT_OPERATION_KEY = "amqpOperation";


Do these constants have to be public or can they be private?

srnagar · 2022-08-30T07:13:03Z

...re/azure-core-amqp/src/main/java/com/azure/core/amqp/implementation/AmqpMetricsProvider.java

+        // if there was no response, state is null and indicates a network (probably) error.
+        // we don't have an enum for network issues and metric attributes cannot have arbitrary
+        // high-cardinality data, so we'll just use vague "error" for it.
+        int ind = state == null ? DELIVERY_STATES_COUNT - 1 : state.ordinal();


I am curious how this will version when we add more values to DeliveryState.DeliveryStateType. Will we then need to correlate the library version to the ordinal? DELIVERY_STATES_COUNT - 1 could be the vague "error" state in v1, but when we add a new enum value, that would be a valid enum in v2, v1's values may look like valid response states.

* Amqp core metrics

lmolkova requested review from conniey, anuchandy, ki1729, srnagar, JonathanGiles, haolingdong-msft, liukun-msft, weidongxu-microsoft, ZejiaJiang, alzimmermsft, vcolin7, mssfang, billwert and kasobol-msft as code owners August 22, 2022 18:19

ghost added the Azure.Core azure-core label Aug 22, 2022

Amqp core metrics

ed45b5e

lmolkova force-pushed the amqp-core-metrics-p1 branch from 4660ec5 to ed45b5e Compare August 29, 2022 17:17

JonathanGiles reviewed Aug 29, 2022

View reviewed changes

lmolkova added 2 commits August 29, 2022 15:51

review

bc1c89f

Bringing back lost perf optimizations

3a13201

conniey approved these changes Aug 29, 2022

View reviewed changes

add comments

24698a7

lmolkova enabled auto-merge (squash) August 30, 2022 02:12

lmolkova merged commit e26f981 into Azure:main Aug 30, 2022

srnagar reviewed Aug 30, 2022

View reviewed changes

Harshan01 pushed a commit to Harshan01/azure-sdk-for-java that referenced this pull request Aug 30, 2022

Amqp core metrics: step 1 (Azure#30583)

5f78811

* Amqp core metrics

vcolin7 pushed a commit to vcolin7/azure-sdk-for-java that referenced this pull request Sep 9, 2022

Amqp core metrics: step 1 (Azure#30583)

2ea5899

* Amqp core metrics

lmolkova mentioned this pull request Oct 13, 2022

[FEATURE REQ] Expose metrics from AMQP SDKs #25605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amqp core metrics: step 1 #30583

Amqp core metrics: step 1 #30583

lmolkova commented Aug 22, 2022 •

edited

Loading

lmolkova commented Aug 22, 2022 •

edited

Loading

azure-sdk commented Aug 22, 2022 •

edited

Loading

JonathanGiles commented Aug 22, 2022

lmolkova commented Aug 29, 2022

JonathanGiles left a comment

JonathanGiles Aug 29, 2022

lmolkova Aug 29, 2022

conniey left a comment

conniey Aug 29, 2022

lmolkova Aug 30, 2022

srnagar Aug 30, 2022

srnagar Aug 30, 2022

		public static final String STATUS_CODE_KEY = "amqpStatusCode";
		public static final String MANAGEMENT_OPERATION_KEY = "amqpOperation";

Amqp core metrics: step 1 #30583

Amqp core metrics: step 1 #30583

Conversation

lmolkova commented Aug 22, 2022 • edited Loading

OpenTelemetry Metrics Semantic Conventions for Azure Messaging Libraries

AMQP-level metric instruments

Scenarios

lmolkova commented Aug 22, 2022 • edited Loading

azure-sdk commented Aug 22, 2022 • edited Loading

JonathanGiles commented Aug 22, 2022

lmolkova commented Aug 29, 2022

JonathanGiles left a comment

Choose a reason for hiding this comment

JonathanGiles Aug 29, 2022

Choose a reason for hiding this comment

lmolkova Aug 29, 2022

Choose a reason for hiding this comment

conniey left a comment

Choose a reason for hiding this comment

conniey Aug 29, 2022

Choose a reason for hiding this comment

lmolkova Aug 30, 2022

Choose a reason for hiding this comment

srnagar Aug 30, 2022

Choose a reason for hiding this comment

srnagar Aug 30, 2022

Choose a reason for hiding this comment

lmolkova commented Aug 22, 2022 •

edited

Loading

lmolkova commented Aug 22, 2022 •

edited

Loading

azure-sdk commented Aug 22, 2022 •

edited

Loading