Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sleeping computer causes span timings to be incorrect in browser #852

Closed
dyladan opened this issue Mar 12, 2020 · 43 comments · Fixed by #3259
Closed

Sleeping computer causes span timings to be incorrect in browser #852

dyladan opened this issue Mar 12, 2020 · 43 comments · Fixed by #3259
Assignees
Labels
bug Something isn't working priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect

Comments

@dyladan
Copy link
Member

dyladan commented Mar 12, 2020

While debugging an issue for a user here https://gitter.im/open-telemetry/opentelemetry-node?at=5e6a4b54d17593652b7c8154 it was found that while a computer is slept or hibernated, the performance.now() monotonic clock may be paused. This causes the assumption that performance.timeOrigin + performance.now() ~= Date.now() to be incorrect by some arbitrary amount of time which may be hours or days.

@dyladan dyladan added the bug Something isn't working label Mar 12, 2020
@dyladan
Copy link
Member Author

dyladan commented Mar 13, 2020

Originally reported by @nvolker

@mayurkale22
Copy link
Member

@draffensperger @obecny would like to get your expertise on this.

@dyladan
Copy link
Member Author

dyladan commented Mar 13, 2020

One solution would be to use Date.now() to capture startTime, but use performance.now() to calculate the duration of the span. This would give us still high fidelity span durations while mitigating this issue. One problem with this is that Date.now() trusts the system clock, so if the system clock ever changes, Date.now() will use the updated time. performance.timeOrigin and performance.now() are monotonic and ignore changes to the system clock, which makes them ideal for calculating durations.

@Flarna
Copy link
Member

Flarna commented Mar 16, 2020

Is this only effecting browsers or also node? As far as I know node uses also performance.timeOrigin captured at startup therefore I would expect a similar behavior. If it works in node we could take a look how they solved it there.

@obecny
Copy link
Member

obecny commented Mar 16, 2020

can you provide more details. Are spans still being created when computer sleeps or how does this really affect that ?

@dyladan
Copy link
Member Author

dyladan commented Mar 16, 2020

sleep you computer for a few hours. any spans created after you wake up have timestamps that show you a few hours in the past because the clock paused. This behavior continues until the browser is restarted.

@obecny
Copy link
Member

obecny commented Mar 16, 2020

and what system / laptop has that - as I'm working on mac and my computer sleeps every day, browser is not restarted for weeks - I think I would notice that before hmm.

@dyladan
Copy link
Member Author

dyladan commented Mar 16, 2020

The machine which had the problem was a mac using chrome. Possibly this is a sleep v hibernate issue?

@obecny
Copy link
Member

obecny commented Mar 16, 2020

ok I see it now too
new Date(performance.timeOrigin + performance.now()) is behind real time

@obecny
Copy link
Member

obecny commented Mar 16, 2020

maybe we should raise this as a chrome bug ?

@draffensperger
Copy link
Contributor

This issue of someone sleeping the computer, or leaving the tab open to go get coffee, etc. is sort of a basic limitation of measurement in the browser world. If a user starts an operation, shuts their computer, and then comes back 8 hours later and it finishes - then how long did the operation take?

Here are a few ideas for how we could deal with this:

  • Provide some documentation that this suspected to be a known issue and we suggest that people ignore large outlier spans from the client in their debugging (this is the easiest approach)
  • Attempt to do some detection in the web library itself of when the user closes the tab or moves to a different tab (see https://stackoverflow.com/questions/3888902/detect-browser-or-tab-closing), and do some handling of that, e.g. add a special tag to all spans that the browser tab was switched and possibly even end and send what spans we currently have (not sure on that).
  • Record both the high-resolution timestamp of spans (using performance.now) to make sure we aren't losing accuracy, but also record the Date.now times, and then we can do some correction computations before spans are written and if the times don't cohere, we can make a best-effort fix, and possibly add some special tag to the span to indicate that it had this monotonic clock issue.

I think it would be worth us digging more into the browser specs, docs, etc. to better understand this Chrome on Mac behavior and what was intended by the authors for performance.now. Maybe it is a Chrome bug.

When I run new Date(performance.timeOrigin+performance.now()).getTime() - new Date().getTime() in my browser, I seem to get 0.

@obecny
Copy link
Member

obecny commented Mar 16, 2020

When I run new Date(performance.timeOrigin+performance.now()).getTime() - new Date().getTime() in my browser, I seem to get 0.

just close laptop for 30 seconds, open and run it, seems like a valid bug

@obecny
Copy link
Member

obecny commented Mar 16, 2020

I think all we have to do is to keep the performance, but then when we export spans we should add delta.
const delta = new new Date().getTime() - Date(performance.timeOrigin+performance.now()).getTime()

@dyladan
Copy link
Member Author

dyladan commented Mar 16, 2020

I think you would have to add the delta at the time that you generate the timestamp. if you do it at export time the whole span will shift if it is started before the computer sleeps

@dyladan
Copy link
Member Author

dyladan commented Mar 16, 2020

@obecny do you know if this affects the CORS preflight timing?

@obecny
Copy link
Member

obecny commented Mar 16, 2020

@dyladan this might effect all, but I think we should create an issue to investigate what is affected and then fix everything, I can take it after finishing grpc

@Flarna
Copy link
Member

Flarna commented Mar 18, 2020

I tried NodeJS 13.11.0 and Firefox 74.0 on Windows 10 on my Dell E7440.
Both show no issue in case of sleep but hibernate seems to be a problem. The strange detail is that the time difference I see after 30 seconds hibernate is about 6.5 seconds and after 60 secs hibernate is's once again around 6.5 seconds.

I would not expect that NodeJs processes for servers are executed on notebooks going to hibernate frequently but AWS lambda suspends/freezes the containers used.

@dyladan
Copy link
Member Author

dyladan commented Mar 18, 2020

This is a difficult thing to test in lambda as you have so little control over the sleep

@pauldraper
Copy link
Contributor

This means there is simply no way to get true wall time on those platform.

These have the same appearance:

  1. Start span, sleeping for 30 minutes, and end.
  2. Start span, adjust system time forward 30 minutes, and end.

Though the first is 30 minutes wall time and second is not.

I think it would be worth us digging more into the browser specs, docs, etc. to better understand this Chrome on Mac behavior and what was intended by the authors for performance.now. Maybe it is a Chrome bug.

IDK, but it's a common issue. Go has the same problem.

@dyladan
Copy link
Member Author

dyladan commented Mar 24, 2020

In opentelemetry JS we use the performance timer which is not affected by adjustments to the system clock, so (2) would have the correct duration.

@dyladan
Copy link
Member Author

dyladan commented Mar 25, 2020

Here are the options so far:

  1. Document this as a known issue and move on
  2. Each time a timestamp is requested, detect if the performance timer has diverged significantly from the system clock and apply a correction
  3. Record startTime using the system clock, and calculate duration using the performance timer.

For what should be obvious reasons, I do not think we should go with (1).

On the call today we discussed options 2 and 3.

Option 2 has the advantage of having timings always accurate according to the system clock, but if the system clock changes during a span the timings could be incorrect.

Option 3 has the advantage of using a monotonic high resolution clock for duration independent from system clock changes, but may report incorrect span timings if the performance timer pauses during a span.

@pauldraper
Copy link
Contributor

Aren't (1) and (3) the same?

@dyladan
Copy link
Member Author

dyladan commented Mar 25, 2020

No. With (1) your spans may show as the whole span being in the past even if the system clock is correct, and the duration could be incorrect if the performance clock pauses during the span. With (3) you will get the span start time as the correct time (if the system clock is correct), but the duration could be incorrect if the performance timer pauses during the span.

@draffensperger
Copy link
Contributor

My suggestion would be that we try to keep the precision of the monotonic clock where possible, both for all aspects of the span, it's start/end times and it's timed events. This is because it can help people troubleshoot networking issues, or short timings in a UI to have that kind of accuracy (e.g. how long did it take to set up an SSL connection, how long did it take to render these frames, etc.)

One idea would be to focus first on detecting the condition and annotating spans about it. That way at least we are beginning to communicate the challenge to users in such a way that they could tell how often it's happening in their own data. Then in a basic way they could just exclude those spans from their queries, etc.

Then as the next phase once we have a better understanding of how often this happens and how exactly the different browser clocks relate, we could do some more effective mitigations.

Just a thought!

@dyladan
Copy link
Member Author

dyladan commented Mar 26, 2020

@draffensperger so you're suggesting allow the span timings to be incorrect, but maybe annotate spans with the offset when the system clock diverges from the performance timer by more than a second or so?

@draffensperger
Copy link
Contributor

Correct - my suggestion would be for version 1 of this work, we just detect when the spans timings are probably incorrect and we add some extra labels to the span to indicate how we think it's incorrect. For starters it could just be a likelyClientPause=true label that people could just filter out in their queries to exclude those spans that exhibit this behavior.

Maybe version 1.1 of the idea would be to add labels that communicate what we think about the pause e.g. likelyClientPauseStart and likelyClientPauseEnd or something like that.

Then in version 2, once we better understand how often this happens, and maybe do some experimentation with the performance.now clock, maybe write some design doc / RFC thing, etc. and get a solid understanding of the limitations of different browsers, time changes, users adjusting their time, etc. then we actually write code to do the correction.

I like the idea of shifting the span forward, but it feels tricky to do that well. What if the first part of a larger operation triggers an API call that finishes, but the second part (that they did after their coffee break and resumed their computer) triggers a second API call. Shifting the span forwards or backwards would dislocate it from the backend API spans that it's associated with. So I think it's hard to really win here.

But the idea would be that with the version 2, some mitigation strategy, we would still annotate every thing we know about the span as best we can. E.g. add a clientDateStartTime and clientDateEndTime labels that capture the Date.now values of the span start and end, but then the actual span start and end times might be shifted by whatever algorithm we came up with.

@dyladan
Copy link
Member Author

dyladan commented May 5, 2020

I think another user is being affected by this. This time in a kubernetes setup. It seems like for some reason the performance timer may not be reliable in his setup. I'm not familiar enough with kubernetes to know why this might be behind the scenes. Date.now() is reporting correct timestamps in his pod (currently 1588685810545), but let performance = require("perf_hooks").performance; console.log(performance.now() + performance.timeOrigin) is returning an incorrect timestamp (currently 894664825.411137).

/cc @romilpunetha

@jchamberlain
Copy link

jchamberlain commented Dec 21, 2020

Hey all, wanted to chime in since I've spent several hours highly confused as to why I couldn't see query my spans and it sounds like this might be why.

It looks to me like performance.timeOrigin is scoped per page load in Chrome, but per browser load in Firefox. That is, opening a new tab or refreshing the page gets me a new value for performance.timeOrigin in Chrome, but not in Firefox. (I'm on Firefox 84 on Mac.)

Until a few minutes ago, I hadn't restarted Firefox for a couple weeks, but had put my computer to sleep every night. As a result, all my OpenTelemtry traces had timestamps from Dec 11, and I was pulling out my hair trying to understand why I couldn't find my spans in Lightstep, Honeycomb, or Elastic. Restarting Firefox magically solved my problem.

Am I correct that this bug is probably to blame? And does that mean this bug is minor for Chrome but kinda huge for Firefox? (I.e., any Firefox user on my site will be invisible in my monitoring tool of choice unless they've recently restarted their browser.)

@ajbouh
Copy link

ajbouh commented Dec 21, 2020

I saw an issue similar to this in chrome. I ended up sampling the delta just before initializing the exporter and then adjusting starttimes before upload. I doubt this is enough to actually address all possible permutations of clockskew though :/

@dyladan
Copy link
Member Author

dyladan commented Dec 21, 2020

Am I correct that this bug is probably to blame?

Most likely

I doubt this is enough to actually address all possible permutations of clockskew though :/

Unfortunately this is also likely correct

I tried to tackle this a while ago in #1019. It is unfortunately a tricky problem. I do think it should be solved though.

@dobesv
Copy link
Contributor

dobesv commented Jan 5, 2021

I see two related but separate issues here:

  1. If the device sleeps/hibernates while the app is running, Date.getTime() and performance.timeOrigin + performance.now() can get out of sync because performance.now() only increments while the computer is awake, thus start & end times on spans show up at the wrong "wall clock time"
  2. If the device sleeps/hibernates in the middle of a span, only the end time would be affected by the first issue, resulting in a much shorter duration and earlier end time than expected

Considerations:

  • if a low resolution timer is used (e.g. Date.getTime()) then the times would be more accurate but less precise to the point of being useless for very short spans or series of short spans close together. Also if the clock is set backwards mid-operation you might get a span that ends before it starts and have to account for that
  • there's no high-resolution timer that gives accurate "wall clock" times
  • complicated logic to try and detect and compensate for drift between the performance timer and the wall clock every time a span is created is probably not desirable
  • A trace can be made up of multiple spans and ideally the spans within a single trace (or span hierarchy) will be ordered on a timeline in the same order they were executed
  • for a distributed trace it is better for the start/end times of spans to line up with their counterparts in other services
  • changes to the system clock should probably not shorten span durations or cause them to be zero or negative

Note quite sure of the solution, I suppose it is in impossible to completely solve this in general.

@dyladan
Copy link
Member Author

dyladan commented Jan 5, 2021

Note quite sure of the solution, I suppose it is in impossible to completely solve this in general.

I suspect it may be impossible to truly solve this in the general without some server to synchronize clocks. One solution is to use the low-resolution clock for start times, and the high-resolution monotonic clock for durations. This would ensure (on systems whose system time is correct) that spans have close to accurate start times and very accurate durations, while ensuring that system clock changes do not affect span timings.

@obecny
Copy link
Member

obecny commented Jan 5, 2021

I think we have been discussing this for quite long time :). I would be in favor of creating some solution and then validate it. If the only thing that is 100% up to date is new Date().getTime() then we will have to use it one way or another. We can then mix Date with performance.now calculate the deltas and return this calculation correctly. Even if we lose some microseconds it will be still much better and bullet proof comparing to what we have now. But I would be really in favour of creating some MVP as we all have some ideas how this can be done, now is probably a right time to make it happen.

@dobesv
Copy link
Contributor

dobesv commented Jan 5, 2021

Is there a way to make this configurable on the client so people can experiment?

For example, provide a clock implementation to the opentelemetry API to use to generate timestamps?

@dyladan
Copy link
Member Author

dyladan commented Jan 5, 2021

The interface for a clock implementation would have to make assumptions about the final implementation. For instance do we implement now() to get the current timestamp, or do we implement t = start(); t.end() and get durations?

@Flarna
Copy link
Member

Flarna commented Jan 5, 2021

I think we should differentiate between node and browser regarding this. I don't think we should add unneeded overhead/complexity in node because of hibernate/sleep issues effecting effectively only browsers.

Besides that I think the API should offer both, get an absolute timestamp and get a duration.

Currently users can provide timestamps for span start/end but they have no possibility via API to use the same timesource as SDK. Having a clock interface in API would improve this.
Having the possibility to install a time provider (similar as a context manager) would even allow users to use something more fancy like a clock synced with their backend.

@Oberon00
Copy link
Member

Oberon00 commented Jul 12, 2022

We are actually affected by this bug on Microsoft Azure (confirmed with Azure Functions Node.js v14.18.1 on Windows), we observe time drifts of the span start in the order of 10 seconds relative to the wall clock time. I can also reproduce the issue locally be hibernating my notebook, with Node v15.9.0 on Windows 11.

So it seems to not only affect Node 8 as well.

@Oberon00
Copy link
Member

Oberon00 commented Jul 12, 2022

I think the wall clock time could get out of sync with the high resolution timer for any number of reasons, so many that I think it can be taken for a fact that it will drift if the application is running for any longer amount of time. Examples:

Only affecting wall clock time but not high resolution timer:

  • NTP sync or other time correction
  • Leap seconds (which do affect UTC, contrary to daylight saving time changes)

Affecting high resolution timer:

  • Any kind of suspension:
    • "I closed my notebook lid" / hibernate / suspend / sleep -- this will mostly affect the browser and dev machines
    • VM being migrated to another physical host and probably other VM shenanigans you may face when running on any PaaS/FaaS product where you-knows-what may happen to the underlying VM.
    • CPU clock changes are known to cause inaccuracies in some configurations
  • Some hardware / firmware bugs & corner cases (at least historically)

It seems like performance.now is uv_hrtime: /~https://github.com/nodejs/node/blob/5fad0b93667ffc6e4def52996b9529ac99b26319/src/node_perf_common.h#L19

You found nodejs/node#17893 already, but even if the offset is calculated correctly at startup, there is just no way to solve this issue with a constant calculated-only-once offset. There is libuv/libuv#1674 for the underlying uv_hrtime API, which has been closed as stale.

uv_hrtime is implemented with QueryPerformanceCounter on Windows. /~https://github.com/libuv/libuv/blob/f250c6c73ee45aa93ec44133c9e0c635780ea741/src/win/util.c#L490-L514, which is definitely liable to the aforementioned time drift.

The Linux (actually Linux-specific, other Unix have different implementations) is more complicated: The generic Unix entrypoint has only one line /~https://github.com/libuv/libuv/blob/0b1c752b5c40a85d5c749cd30ee6811997a8f71e/src/unix/core.c#L110-L112 calling the specific part with {{UV_CLOCK_PRECISE}} /~https://github.com/libuv/libuv/blob/c40f8cb9f8ddf69d116952f8924a11ec0623b445/src/unix/linux-core.c#L121-L155. This ends up calling clock_gettime(CLOCK_MONOTONIC), which is a libc/POSIX API, described for Linux e.g. here https://man7.org/linux/man-pages/man3/clock_gettime.3.html:

This clock does not count time that the system is suspended.

On Linux there would be CLOCK_BOOTTIME which would (supposedly) solve this problem but I believe it is not accessible through Node. Also, there is no equivalent for Windows (unless you want something a bit less precise; though maybe CLOCK_BOOTTIME is also less precise than CLOCK_MONOTONIC).

You could look at how other SDKs solve this problem. Python has a different API https://docs.python.org/3/library/time.html#time.time_ns, which ends up in this implementation /~https://github.com/python/cpython/blob/0ff626f210c69643d0d5afad1e6ec6511272b3ce/Python/pytime.c#L847-L955, but that's probably not really applicable to Node.js as (an is a "a bit less precise" at least on Windows).

I think you could implement rather one to one what Java does though: There, each local root span takes a timestamp with both the most precise available wall clock time and the high resolution timer when it starts to calculate the offset. From then on, only the HR timer is used, i.e. relative offsets of local child span start times and all end times are precise. Of course, if a suspension / leap seconds / ... happens during the trace, you still have the time drift, but that is much less likely and the next trace will be correct again. This is implemented with the aptly named /~https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/AnchoredClock.java and this logic at span creation: /~https://github.com/open-telemetry/opentelemetry-java/blob/16be81aed803e15694de29c9cea25f7bcf4d77c1/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/SdkSpan.java#L151-L182

@dyladan dyladan self-assigned this Jul 29, 2022
@dyladan dyladan added the priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect label Jul 29, 2022
@guyguy333
Copy link

Is this issue fixed now that #3134 has been merged ?

@dyladan
Copy link
Member Author

dyladan commented Sep 16, 2022

It is waiting on #3259 for the release but yes

@dyladan dyladan linked a pull request Sep 16, 2022 that will close this issue
pichlermarc pushed a commit to dynatrace-oss-contrib/opentelemetry-js that referenced this issue Dec 15, 2023
* chore(deps): update dependency @types/node to v16

* fix: typescript issues
martinkuba pushed a commit to martinkuba/opentelemetry-js that referenced this issue Mar 13, 2024
* chore(deps): update dependency @types/node to v16

* fix: typescript issues
martinkuba pushed a commit to martinkuba/opentelemetry-js that referenced this issue Mar 16, 2024
* chore(deps): update dependency @types/node to v16

* fix: typescript issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect
Projects
None yet