[FEATURE] At startup, send all locally stored traces instead of waiting the first batch + some time #1305

Apokalypt · 2024-04-10T17:13:44Z

Context

Explication

Right now, when a NodeJS application crashes, the library gives the possibility to call the <TelemetryClient>.flush method with the isAppCrashing option set to true. This can be done

manually, for example by listening the "uncaughtExceptionMonitor" event emitted by "process" and calling the method when it's emitted
automatically by enabling auto collect exception

This method tells the Sender class to store traces in a local file if disk retry caching is enabled (cf code).

Problem

The problem is that these files stored locally will only be sent after the first batch of traces (after a restart) and after having wait for the resendInterval (by default, set to 1 minute). You can check this yourself in the code: here
This means that if there's a major bug (leading to a crash) in the application a few seconds after start-up, the traces will never be sent, making it difficult to analyze the problem. Even if the bug doesn't arrive so quickly, this can disrupt the analysis of the engineers, who will see the logs due to the application restart but not the crash logs (which will surely arrive 1 minute later).

Implementation expected

I'd like the library to force the push of ALL local files (not just the first one) right after client startup. This would make debugging easier when the application crashes and reboots...
At the same time, it might be a good idea to make sure that these files are only deleted if the send is successful, and not before sending (this will prevent loss of traces in the event of the application crashing just after the call has started).

Proposal

If the above implementation looks right to you, I'd be very happy to work on it and create a PR to improve the behavior as explained above. Let me know what you think

The text was updated successfully, but these errors were encountered:

Apokalypt · 2024-04-10T17:13:54Z

Before creating this issue, I tried to see if I could find a way to force the trace push "manually". I found 3, but this is absolutely not a good way to do it...

1st solution

Create the appinsight client, set disk retry caching to true and set auto collect exceptions to true

Listen the event uncaughtExceptionMonitor

When it's emitted, change the resend interval to few millisecond

Force the creation of a trace (e.g. with a log, with the method trackTrace, ...)

Call the flush method without isAppCrashing option

Wait a few seconds to be sure that all traces were pushed (even the exception ones that may be stored previously by the library inside of a local file) and then force the application to stop

--> We force the app to stay alive in a possible unstable state

2nd solution

Create the appinsight client, set disk retry caching to true and set auto collect exceptions to false

Listen the event uncaughtExceptionMonitor

When it's emitted, track manually the exception thanks to trackException method

Call the flush method without isAppCrashing option and, in callback, force the application to stop

--> We force the app to stay alive in a possible unstable state

3rd solution

Create the appinsight client and set disk retry caching to true with the interval set to few MS

At startup, force the creation of a telemetry and immediately call the flush method

Wait few seconds to be sure that all locally file stored has been sent and then update the interval of the disk retry caching to a more normal value

Apokalypt · 2024-04-15T13:06:57Z

What do you think @JacksonWeber ? I'm up for doing the PR, but I'd rather have an opinion before I start working on it.

JacksonWeber · 2024-04-16T22:18:45Z

@Apokalypt Before you begin any work on something like this, we're migrating away from the Application Insights 2.X SDK to an OpenTelemetry based solution that will be the foundation going forward. So if you look at the behavior of /~https://github.com/Azure/azure-sdk-for-js/tree/main/sdk/monitor/monitor-opentelemetry and find you have the same issue there (I believe this logic is the same), it would be best to discuss the suggestion in that repository. Haven't gotten much time to investigate this suggestion, but I wanted to make you aware of that change first.

Apokalypt · 2024-04-16T22:48:11Z

@Apokalypt Before you begin any work on something like this, we're migrating away from the Application Insights 2.X SDK to an OpenTelemetry based solution that will be the foundation going forward. So if you look at the behavior of /~https://github.com/Azure/azure-sdk-for-js/tree/main/sdk/monitor/monitor-opentelemetry and find you have the same issue there (I believe this logic is the same), it would be best to discuss the suggestion in that repository. Haven't gotten much time to investigate this suggestion, but I wanted to make you aware of that change first.

@JacksonWeber Thanks for your answer.
I'm already aware of this major change, however I didn't know its current status (close to a stable release or not) which is why I was suggesting it on the current version. In fact, this suggestion stems mainly from a need expressed by the team where I work, which is the second reason why I'm suggesting it on version 2.x.

Nevertheless, if version 3.x is close to a stable release (and the behavior is the same/similar) I'm okay with proposing an implementation directly on 3.x. Do you have an ETA of a 3.x stable release?

JacksonWeber · 2024-04-18T00:21:08Z

@Apokalypt Version 3 is expected to release this week.

Apokalypt · 2024-04-18T06:54:10Z

Ok thx, I will take a look at the version 3.x to implement this feature 👍

Apokalypt · 2024-05-21T08:45:05Z

Unfortunately, I don't think I'll have time to look at the implementation at the moment. Especially as I've had a quick look at the project and I have the impression that the functionality for sending missing traces following a crash has been removed... From there, I have 2 questions:

Will version 2.x be maintained if there's a security flaw?
Do you have the time to look into these features @JacksonWeber ?

JacksonWeber · 2024-05-22T00:49:42Z

@Apokalypt I believe the functionality you're referring too still exists. You should be able to see how we handle persisting here.

Regarding 2.X SDK maintenance, yes, the 2.X SDK will be updated in the case of security flaws to support customers on that version.

I can work on testing this scenario and determine if we can prioritize it. I'll let you know if development begins on this feature.

JacksonWeber added the feature-request label Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] At startup, send all locally stored traces instead of waiting the first batch + some time #1305

[FEATURE] At startup, send all locally stored traces instead of waiting the first batch + some time #1305

Apokalypt commented Apr 10, 2024

Apokalypt commented Apr 10, 2024 •

edited

Loading

Apokalypt commented Apr 15, 2024

JacksonWeber commented Apr 16, 2024

Apokalypt commented Apr 16, 2024

JacksonWeber commented Apr 18, 2024

Apokalypt commented Apr 18, 2024

Apokalypt commented May 21, 2024

JacksonWeber commented May 22, 2024

[FEATURE] At startup, send all locally stored traces instead of waiting the first batch + some time #1305

[FEATURE] At startup, send all locally stored traces instead of waiting the first batch + some time #1305

Comments

Apokalypt commented Apr 10, 2024

Context

Explication

Problem

Implementation expected

Proposal

Apokalypt commented Apr 10, 2024 • edited Loading

1st solution

2nd solution

3rd solution

Apokalypt commented Apr 15, 2024

JacksonWeber commented Apr 16, 2024

Apokalypt commented Apr 16, 2024

JacksonWeber commented Apr 18, 2024

Apokalypt commented Apr 18, 2024

Apokalypt commented May 21, 2024

JacksonWeber commented May 22, 2024

Apokalypt commented Apr 10, 2024 •

edited

Loading