-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JDK Flight Recorder Extension #37900
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Hey. Thanks for this contribution. I wonder if it would be better to have it as an external extension hosted on the Quarkiverse Hub. @maxandersen WDYT? |
Hmmm. Weve talked about how jfr would fit with respect to tracing in past. Not sure where we would put it. Kinda feels like it could make sense to have in core if makes things more consistent/integrated. Does this link up with opentlemetry span id or something completely different? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I expect this extension to work on its own, or in conjunction with OpenTelemetry. For systems that already have some kind of tracing in place, the request will be linked to an X-Request-ID or TraceId/SpanId. We analyze the JFR with reference to these IDs when the time of that request becomes long. This will help in analyzing the types of problems that OpenTelemetry cannot solve. There may also be some systems where tracing is difficult to implement. In such systems, JFR alone can be used to solve performance problems. The reasons why the Request ID is needed are as follows |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @chiroito.
We need something like this but we need to discuss it at high level before we merge this.
Questions to answer:
- Does this belong in core or Quarkiverse? It's not clear now.
- How does this work from the user side? Can you please elaborate a bit on what the user can configure, where he will see the data and the format of the events?
- We need documentation.
After this we can go into the details of the PR, which is already very big.
Thank you for your reply @brunobat . Users only need to install this Extention and then start JFR. That's it, this will record JAX-RS information to JFR. Users can get the JFR dump file as usual and load it into a tool such as JDK Mission Control. I understood that the documentation we need is an introduction to these processes. Is this right? Should I document this below? /~https://github.com/quarkusio/quarkus/tree/main/docs/src/main/asciidoc |
Yes, please name it telemetry-jfr.adoc and follow a similar pattern to what we have in /~https://github.com/quarkusio/quarkus/blob/main/docs/src/main/asciidoc/opentelemetry.adoc Independently if this will leave in the core repo, we need an explanation of what it will do, which events will be recorded and where the file will be stored. I don't see any configuration classes. Would it be possible to at least create a config that disables the JFR data production at runtime? |
There's also a design problem. I would like very much to integrate this at some point: /~https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry/runtime-telemetry-java17/library There is also existing documentation about the JFR here: /~https://github.com/roberttoyonaga/quarkus/blob/main/docs/src/main/asciidoc/native-reference.adoc |
It seems like your goal is to create custom start/end JFR events for requests (containing the requestID). Then you want to use that time window to attribute other JFR events to that requestID? If so, how would this be possible if you have multiple concurrent requests resulting in multiple overlapping time-windows? |
I don't think there will be too much overlap. This PR is introducing new events, while the OTEL runtime metric gatherers only collect some existing JFR data and make it accessible through opentelemetry. |
@brunobat In addition to that, I will include the following at a minimum
I would also like to add a note to the JFR section of the Native Reference Guide. It is possible to create a config to disable JFR recording on this extension. |
@roberttoyonaga In order to link these start and end points, some kind of ID is needed. One common way of ID is to store the ID in the HTTP header in the traditional way, and a newer way would be to use OpenTelemetry's TraceId/SpanId. However, without these keys, the respective events of the JFR cannot be linked. |
Thanks very much for the reply, @chiroito. I would prefer to get the correlation IDs from OTel, if the extension is present. That can be activated in the processor class. It would be also interesting to write a motivation for the PR that will allow reviewers and users to better understand the work scope given that this somehow duplicates some of the work already done in OTel and that there are also ways to generate JFR events that can become available in OTel. In this case, the created JFR events could be used for fine grained debugging by analysing the JFR stream file. Am I right? |
Thank you for your reply, @brunobat I completely agree with you in the case of OTel extensions. However, it is true that not all applications that implement tracing use OTel. The tracing technology options would be
It may appear that the old tracing technology and the W3C Trace Context overlap. I am aware these use different headers, so I think each can store different values. If Quarkus only supports W3C Trace Context, tracing will be lost if Quarkus is deployed on the boundary between the old tracing technology and Trace Context. Therefore, both tracing technologies should be supported and their respective IDs should be recorded. Currently, OTel does not support the older tracing technology. So, I think this extension should not be completely dependent on OTel. Of course, if OTel is present, we will take advantage of it. How we should record IDs needs to be considered. If Quarkus does not get any of the IDs, then each record cannot be linked to each other, as mentioned earlier. In this PR, the extension records the old tracing technology ID, W3C TraceId, and SpanID. If none of these are available, Quarkus would be better provided with the ID. And this extension defines different JFR events with and without OTel. For simpler code, it may be possible to make it one event. Also, the extension should not propagate and should only record those information if it is available. If the Quarkus application uses OTel, then OTel will do the propagation, and if it uses older tracing technology, then the application or REST Server/Client extension should do it. Since it is JFR, the recorded information can be consumed in a variety of ways, including JMC / jfr commands / JFR Stream API. Of course, it can also be transferred to Grafana using Cryostat. We could watch the JAX-RS response time recorded by this extension, and if it exceeds a certain threshold, we could retrieve all the information associated with the corresponding ID. |
@chiroito This context propagation is a deep subject. With the quarkus-opentelemetry extension, apart from the default W3C headers we support many others. See: https://quarkus.io/guides/opentelemetry#propagators . I think we shouldn't try to boil the ocean here and try to support everything. We should instead focus on creating the right abstractions. One way would be to create some interface that abstracts the retrieval of the current span id. Each implementation could have a priority and we can start with a couple of implementations, lets say A propagation header detection could be set as a user configuration first and automated later. What do you think? |
If it's not too late, for easier maintenance purposes, I'd recommend moving this extension to Quarkiverse.
Quarkiverse extensions are consistent and integrated too, plus you have the freedom to release anytime you feel like, independent of the core releases. |
@gastaldi i dont think we have conclusion on wether it makes sense to have this in core or not. |
@brunobat In the future, if we create an IdGenerator that uses the ID of the old tracing technology, such as X-Request-ID, we can deploy the Quarkus application on the boundary between the old tracing technology and OTel and still link the tracings correctly. So how about the following policy?
I expect the following as a concrete implementation.
@gastaldi @maxandersen |
@chiroito here is the documentation about Quarkiverse and how to onboard a new extension: https://hub.quarkiverse.io/ IMHO the best part about having in Quarkiverse is that you can decide who can push to the repository, whereas if that lands in core everything needs to be reviewed and merged by a Quarkus core member, which may not always know the full context and have the necessary know-how to approve it. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@MetadataDefinition | ||
@Name("io.quarkus.SpanId") | ||
@Label("Span ID") | ||
@Description("Link events that have the common Span ID") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the OpenJDK project, we have tried to avoid using the word event in labels and descriptions. You might want to do the same.
The rationale is that data may be pushed into a database or visualized in ways where the event concept is no longer visible to the user. I would drop the description, since it is pretty obvious, or keep it more abstract. i.e. "Links spans with the same ID together". Same for the TraceID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the advice.
I have fixed both Trace and Span.
- /~https://github.com/quarkusio/quarkus/blob/49f38e3fcb4174fae042fdc1d136f3485a8dd43f/extensions/jfr/runtime/src/main/java/io/quarkus/jfr/runtime/TraceIdRelational.java
- /~https://github.com/quarkusio/quarkus/blob/49f38e3fcb4174fae042fdc1d136f3485a8dd43f/extensions/jfr/runtime/src/main/java/io/quarkus/jfr/runtime/SpanIdRelational.java
extensions/jfr/runtime/src/main/java/io/quarkus/jfr/runtime/config/JfrRuntimeConfig.java
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Status for workflow
|
Status for workflow
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!
This PR integrates the JDK Flight Recorder into Quarkus.
It will work with Resteasy Reactive and Opentelemetry.
For blocking REST, it records a single blocking event. This is a common way of recording events in JFR.
For non-blocking REST, this will record Start and End events. This is to avoid overlapping multiple events in a non-blocking process, since multiple requests may be processed at the same time, apparently due to I/O waiting.
Also, the request ID is recorded in these events.
This assumes propagation of the ID using HTTP headers, such as the common X-Request-ID or WebLogic Server's ECID-Context.
If OpenTelemetry is enabled, the Request ID also includes its TraceId and SpanID.