Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] At startup, send all locally stored traces instead of waiting the first batch + some time #1305

Open
Apokalypt opened this issue Apr 10, 2024 · 8 comments

Comments

@Apokalypt
Copy link

Context

Explication

Right now, when a NodeJS application crashes, the library gives the possibility to call the <TelemetryClient>.flush method with the isAppCrashing option set to true. This can be done

  • manually, for example by listening the "uncaughtExceptionMonitor" event emitted by "process" and calling the method when it's emitted
  • automatically by enabling auto collect exception

This method tells the Sender class to store traces in a local file if disk retry caching is enabled (cf code).

Problem

The problem is that these files stored locally will only be sent after the first batch of traces (after a restart) and after having wait for the resendInterval (by default, set to 1 minute). You can check this yourself in the code: here
This means that if there's a major bug (leading to a crash) in the application a few seconds after start-up, the traces will never be sent, making it difficult to analyze the problem. Even if the bug doesn't arrive so quickly, this can disrupt the analysis of the engineers, who will see the logs due to the application restart but not the crash logs (which will surely arrive 1 minute later).

Implementation expected

I'd like the library to force the push of ALL local files (not just the first one) right after client startup. This would make debugging easier when the application crashes and reboots...
At the same time, it might be a good idea to make sure that these files are only deleted if the send is successful, and not before sending (this will prevent loss of traces in the event of the application crashing just after the call has started).

Proposal

If the above implementation looks right to you, I'd be very happy to work on it and create a PR to improve the behavior as explained above. Let me know what you think

@Apokalypt
Copy link
Author

Apokalypt commented Apr 10, 2024

Before creating this issue, I tried to see if I could find a way to force the trace push "manually". I found 3, but this is absolutely not a good way to do it...

1st solution

  1. Create the appinsight client, set disk retry caching to true and set auto collect exceptions to true
  2. Listen the event uncaughtExceptionMonitor
  3. When it's emitted, change the resend interval to few millisecond
  4. Force the creation of a trace (e.g. with a log, with the method trackTrace, ...)
  5. Call the flush method without isAppCrashing option
  6. Wait a few seconds to be sure that all traces were pushed (even the exception ones that may be stored previously by the library inside of a local file) and then force the application to stop

--> We force the app to stay alive in a possible unstable state

2nd solution

  1. Create the appinsight client, set disk retry caching to true and set auto collect exceptions to false
  2. Listen the event uncaughtExceptionMonitor
  3. When it's emitted, track manually the exception thanks to trackException method
  4. Call the flush method without isAppCrashing option and, in callback, force the application to stop

--> We force the app to stay alive in a possible unstable state

3rd solution

  1. Create the appinsight client and set disk retry caching to true with the interval set to few MS
  2. At startup, force the creation of a telemetry and immediately call the flush method
  3. Wait few seconds to be sure that all locally file stored has been sent and then update the interval of the disk retry caching to a more normal value

@Apokalypt
Copy link
Author

What do you think @JacksonWeber ? I'm up for doing the PR, but I'd rather have an opinion before I start working on it.

@JacksonWeber
Copy link
Contributor

@Apokalypt Before you begin any work on something like this, we're migrating away from the Application Insights 2.X SDK to an OpenTelemetry based solution that will be the foundation going forward. So if you look at the behavior of /~https://github.com/Azure/azure-sdk-for-js/tree/main/sdk/monitor/monitor-opentelemetry and find you have the same issue there (I believe this logic is the same), it would be best to discuss the suggestion in that repository. Haven't gotten much time to investigate this suggestion, but I wanted to make you aware of that change first.

@Apokalypt
Copy link
Author

@Apokalypt Before you begin any work on something like this, we're migrating away from the Application Insights 2.X SDK to an OpenTelemetry based solution that will be the foundation going forward. So if you look at the behavior of /~https://github.com/Azure/azure-sdk-for-js/tree/main/sdk/monitor/monitor-opentelemetry and find you have the same issue there (I believe this logic is the same), it would be best to discuss the suggestion in that repository. Haven't gotten much time to investigate this suggestion, but I wanted to make you aware of that change first.

@JacksonWeber Thanks for your answer.
I'm already aware of this major change, however I didn't know its current status (close to a stable release or not) which is why I was suggesting it on the current version. In fact, this suggestion stems mainly from a need expressed by the team where I work, which is the second reason why I'm suggesting it on version 2.x.

Nevertheless, if version 3.x is close to a stable release (and the behavior is the same/similar) I'm okay with proposing an implementation directly on 3.x. Do you have an ETA of a 3.x stable release?

@JacksonWeber
Copy link
Contributor

@Apokalypt Version 3 is expected to release this week.

@Apokalypt
Copy link
Author

Ok thx, I will take a look at the version 3.x to implement this feature 👍

@Apokalypt
Copy link
Author

Unfortunately, I don't think I'll have time to look at the implementation at the moment. Especially as I've had a quick look at the project and I have the impression that the functionality for sending missing traces following a crash has been removed... From there, I have 2 questions:

  • Will version 2.x be maintained if there's a security flaw?
  • Do you have the time to look into these features @JacksonWeber ?

@JacksonWeber
Copy link
Contributor

@Apokalypt I believe the functionality you're referring too still exists. You should be able to see how we handle persisting here.

Regarding 2.X SDK maintenance, yes, the 2.X SDK will be updated in the case of security flaws to support customers on that version.

I can work on testing this scenario and determine if we can prioritize it. I'll let you know if development begins on this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants