Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amqp core metrics: step 1 #30583

Merged
merged 4 commits into from
Aug 30, 2022
Merged

Amqp core metrics: step 1 #30583

merged 4 commits into from
Aug 30, 2022

Conversation

lmolkova
Copy link
Member

@lmolkova lmolkova commented Aug 22, 2022

Spec: https://gist.github.com/lmolkova/489a2b280b8fa68e4c3780c2afaa3b39

Adds following metrics (as step 1):

OpenTelemetry Metrics Semantic Conventions for Azure Messaging Libraries

Context: https://gist.github.com/lmolkova/b9004307a09be788af04f05ebe22ad3c

Follows general OTel Metrics conventions and depends on OTel Messaging conventions changes

AMQP-level metric instruments

Name Description Units Instrument Type (*) Value Type Attribute Key(s) Attribute Values
messaging.az.amqp.producer.send.duration Measures duration of AMQP-level send call attempt ms Histogram Double net.peer.name Broker FQDN
messaging.destination (will change) Entity name
amqp.delivery_state Delivery state
messaging.az.amqp.management.request.duration Duration of AMQP request-response operation ms Histogram Double net.peer.name Broker FQDN
messaging.destination (will change) Entity name
amqp.status_code AMQP (HTTP) status code
amqp.operation AMQP request/response operation name
messaging.az.amqp.client.connections.closed Number of closed connections. {connections} Counter Int64 net.peer.name Broker FQDN
amqp.error_condition 'ok' or one of AMQP error conditions
messaging.az.amqp.client.transport.errors Number of transport errors {errors} Counter Int64 net.peer.name Broker FQDN
messaging.destination Entity name
amqp.error_condition 'ok' or one of AMQP error conditions
messaging.az.amqp.client.link.errors Number of AMQP links closed with error {errors} Counter Int64 net.peer.name Broker FQDN
messaging.destination Entity name
messaging.az.entity_path (when available) Entity path (includes partition id or subscription). It's could to be generalized in scope of OTel Messaging conventions changes
amqp.error_condition 'ok' or one of AMQP error conditions
messaging.az.amqp.client.session.errors Number of AMQP sessions closed with error {errors} Counter Int64 net.peer.name Broker FQDN
messaging.destination Entity name
messaging.az.entity_path (when available) Entity path (includes partition id or subscription)
amqp.error_condition 'ok' or one of AMQP error conditions
messaging.az.amqp.consumer.lag Approximate lag between the time message was received and the time it was enqueued on broker. seconds Histogram Double net.peer.name Broker FQDN
messaging.destination Entity name
messaging.az.entity_path Entity path (includes partition id or subscription)
messaging.az.amqp.consumer.credits.requested Number of requested credits {credits} Counter Int64 net.peer.name Broker FQDN
messaging.destination Entity name
messaging.az.entity_path Entity path (includes partition id or subscription)

Scenarios

  • messaging.az.amqp.*.duration
    • count, rate, of producer send or management channel operation (peek/receive defered, renew lock, session state, etc)
    • error rate by error code
    • how long network requests take, percentiles
  • messaging.az.amqp.client.connections.closed
    • rate of closed (and opened) connections, are there too many? How connections close (by AMQP error condition)
  • messaging.az.amqp.client.*.errors
    • are there link\sessions\transport errors, how links\session end?
  • messaging.az.amqp.consumer.lag
    • number of received messages
    • how much time message spent on broker before consumer received it?
    • how far consumer is behind, are there enough consumers
    • are consumers catching un on the backlog or slowing down?
  • messaging.az.amqp.consumer.credits.requested
    • is prefetch configured correctly
    • are enough messaging coming? how fast consumer processed messages?
    • maybe too many messages are coming and consumer is not able to catch up

all metrics:

  • is it specific per namespace, entity, partition, client machine?

@lmolkova
Copy link
Member Author

lmolkova commented Aug 22, 2022

Note, ServiceBus messaging.az.amqp.consumer.credits.requested are not reported with this change as servicebus does not use ReactorReceiver and implements it's own AmqpReceiveLink

@azure-sdk
Copy link
Collaborator

azure-sdk commented Aug 22, 2022

API change check

APIView has identified API level changes in this PR and created following API reviews.

azure-core-test

@JonathanGiles
Copy link
Member

Good work! A while back I recall you mentioning there was a performance impact to having metrics in the codebase. Are you able to quantify this impact when metrics are enabled and also when they are disabled? Thanks!

@lmolkova
Copy link
Member Author

@JonathanGiles, sorry I overlooked your question. Micro-benchmarks show that when metrics are disabled (and any code that collects e.g. duration is guarded), each measurement takes less than a nanosecond (disabledOptimizedMetrics). It's allocation-free.

Reporting a histogram with dynamic (cached in a map) attributes (`basicHistogramWithDynamicAttributes ) is ~60 ns and still allocation free after attributes are created.

Benchmark                                                          Mode  Cnt   Score   Error  Units
OpenTelemetryMetricsBenchmark.basicHistogram                       avgt    6  45.928 ± 1.420  ns/op
OpenTelemetryMetricsBenchmark.basicHistogramWithCommonAttributes   avgt    6  52.500 ± 5.388  ns/op
OpenTelemetryMetricsBenchmark.basicHistogramWithDynamicAttributes  avgt    6  59.446 ± 2.097  ns/op
OpenTelemetryMetricsBenchmark.disabledNotOptimizedMetrics          avgt    6  17.774 ± 0.431  ns/op
OpenTelemetryMetricsBenchmark.disabledOptimizedMetrics             avgt    6   0.208 ± 0.006  ns/op
OpenTelemetryMetricsBenchmark.noopMeterProviderNotOptimized        avgt    6  17.974 ± 0.553  ns/op

I've tried to measure it with EventHubs performance tests, and the problem is that they are quite unstable. They aren't CPU or memory bound. Consequent baseline measurements can differ drastically (10-20% easily). With a lot of long tests and aggregations I get something like this:

  • disabled metric vs baseline: 99.3%
  • enabled merics vs baseline: 97.4%

I also profiled perf tests with metrics enabled and during profiling metrics stacks took less than 1% of CPU time, and were even smaller with regard to allocations.

I did some investigation with metric collection in general with storage tests and it seems metric collection impact is smaller than error interval.

@lmolkova lmolkova force-pushed the amqp-core-metrics-p1 branch from 4660ec5 to ed45b5e Compare August 29, 2022 17:17
Copy link
Member

@JonathanGiles JonathanGiles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - a few pieces of feedback for your consideration

Comment on lines +44 to +46
// TODO: by GA we need to figure out default naming (when no mapping is defined)
// and follow otel attributes conventions if we can or make sure mapping is defined
// for all attributes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably tackle this ~now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious what's the urgency?

this mapping exists in beta metrics package with no requirement to GA in the nearest future.
However, we have an ask to GA tracing plugin in Zn, where we'll do exactly the same mapping and we'll figure it out there first. Assuming we'll resource the ask, it should happen in the next few months.

And I'd really love to take it now, but we're working with OTel community on changing these exact attributes, so anything I could write might change very soon.

Copy link
Member

@conniey conniey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! THanks for the documentation and links in your description

}

private TelemetryAttributes getResponseCodeAttributes(AmqpResponseCode code, String operation) {
int ind = code == null ? RESPONSE_CODES_COUNT - 1 : code.ordinal();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why we have the - 1 here and then + 1 when defining RESPONSE_CODES_COUNT. A comment where RESPONSE_CODES_COUNT could help in the future if someone is checking this out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the feedback, I added comments in the code!

@lmolkova lmolkova enabled auto-merge (squash) August 30, 2022 02:12
@lmolkova lmolkova merged commit e26f981 into Azure:main Aug 30, 2022
Comment on lines +34 to +35
public static final String STATUS_CODE_KEY = "amqpStatusCode";
public static final String MANAGEMENT_OPERATION_KEY = "amqpOperation";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these constants have to be public or can they be private?

// if there was no response, state is null and indicates a network (probably) error.
// we don't have an enum for network issues and metric attributes cannot have arbitrary
// high-cardinality data, so we'll just use vague "error" for it.
int ind = state == null ? DELIVERY_STATES_COUNT - 1 : state.ordinal();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious how this will version when we add more values to DeliveryState.DeliveryStateType. Will we then need to correlate the library version to the ordinal? DELIVERY_STATES_COUNT - 1 could be the vague "error" state in v1, but when we add a new enum value, that would be a valid enum in v2, v1's values may look like valid response states.

Harshan01 pushed a commit to Harshan01/azure-sdk-for-java that referenced this pull request Aug 30, 2022
vcolin7 pushed a commit to vcolin7/azure-sdk-for-java that referenced this pull request Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Azure.Core azure-core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants