[sketch] Timeline redux #1098

lukego · 2017-02-09T09:24:42Z

TLDR; Scroll down for pretty screenshot.

I have a new revision of the timeline code. This one is intended to come upstream very soon. This branch is a preview for early feedback especially on the interface. I have rebased and split into logical commits. I marked this branch as [sketch] to reserve the right to do so again.

Timeline is a circular event log. Calling the logger is cheap (~5 cycles) and events are timestamped with sub-nanosecond precision. The log wraps around after one million events. Instrumenting Snabb with timeline logging makes it possible to do detailed offline profiling and performance diagnostics.

Here is how it works:

Define your events

First you define an "events file" like this excerpt from core/engine.events:

4|sleep_Hz: usec Hz
The engine requests that the kernel suspend this process for a period of
microseconds in order to reduce CPU utilization and achieve a fixed
frequency of breaths per second (Hz).

4|sleep_on_idle: usec
The engine requests that the kernel suspend this process for a period
of microseconds in order to reduce CPU utilization because idleness
has been detected (a breath in which no packets were processed.)

4|wakeup_from_sleep:
The engine resumes operation after sleeping voluntarily.

This example defines three events. They all have log level 4 (medium) which determines how frequently they will be logged to the ring. Each event can have up to six 64-bit integer arguments. The long textual descriptions will be stored only once in the log file (i.e. interned.)

Instrument your code

First you create a timeline log in shared memory and load the relevant event definitions:

timeline_log = timeline.new("engine/timeline")
events = timeline.load_events(timeline_log, "core.engine")

Then you instrument your code with calls to log events as they happen:

events.sleep_on_idle(sleep) -- Log message before sleeping
C.usleep(sleep)
events.wakeup_from_sleep()  -- Log message after sleeping

These events are timestamped with sub-nanosecond precision using the RDTSCP instruction. Here the log will show both the number of microseconds you requested to sleep (value of the sleep variable) and also the number of cycles that you really did sleep (timestamp delta between log messages.)

Generate your data

Now when you run Snabb the file /var/run/snabb/$pid/engine/timeline will contain your events. You can copy this file either from a running process or one that has terminated for offline inspection. It is updated in real-time as file-backed shared memory.

The engine samples the events by randomizing the log level at the start of each breath. For 99.9% of breaths there is no logging at all. 0.1% are logged with a high level (e.g. start/end of breath), 0.01% with a medium level (e.g. individual push/pull calls), and 0.001% with a low level (e.g. logs within apps and libraries.) This is intended to produce roughly 1000 events per second and wrap the million-element log roughly every 15 minutes.

Calling the log function costs around 5 cycles on average. Most frequently logging will be disabled and the call will cost ~4 cycles. Occasionally logging will be enabled and the call will cost ~44 cycles. The logger is written in assembler and looks like a simple FFI call to the JIT.

Analyze the data

The log contains one million entries. That's too many to read by hand. The idea is to perform statistical analysis of the logs and be able to compare different sets of logs (e.g. test lab data vs production environment data.) Separate tooling will be available for this.

Here is an example analysis of the cycles per packet cost depending on which app is being called, how many packets are on its input links, and which NUMA node it is scheduled on. This data was generated from a directory containing 10 timeline files from Snabb processes with varied workloads and NUMA affinities (no special preparation needed.)

Graphs like this (and many others over time) will be easy to produce from a directory of timeline files.

This is a very useful instruction for self-benchmarking programs that want to read the CPU timestamp counter efficiently. See Intel whitepaper for details: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf

Use 'double' instead of 'uint64_t' for values in the timeline file. This change is motivated by making timeline files easier to process by R. In the future we may switch back to uint64_t for the TSC counter and/or argument values for improved precision. The major_version file header field can be used to avoid confusion. The obvious downside to using doubles is that the TSC value will lose precision as the server uptime increases (the TSC starts at zero and increases at the base frequency of the CPU e.g. 2GHz.) The impact seems to be modest though. For example a 2GHz CPU would start rounding TSC values to the nearest 128 (likely quite acceptable in practice) after approximately 2 years of operation (2^53 * 128 cycles.) So - storing the TSC as a double-float is definitely a kludge - but unlikely to cause any real harm and expedient for the short-term goal of putting this code to use without getting blocked due to e.g. my lack of sophisticated as an R hacker.

Resolved conflict in app.lua between adding timeline events and the new breath topological-sort machinery.

…x-next-opt

Simplify the code and eliminate unwanted branches from the engine loop by drawing a random timeline level from a log-uniform distribution that mathematically favors higher log levels over lower ones. Plucked log5() out of the air i.e. each log level should be enabled for 5x more breaths than the one below. Here is how the distribution of log level choice looks in practice using this algorithm: > t = {0,0,0,0,0,0,0,0,0} > for i = 1, 1e8 do local n = math.max(1,math.ceil(math.log(math.random(5^9))/math.log(5))) t[n] = t[n]+1 end > for i,n in ipairs(t) do print(i,n) end 1 560 2 2151 3 10886 4 55149 5 273376 6 1367410 7 6844261 8 34228143 9 171120244 Note: Lua provides only natural logarithm functions but it is easy to derive other bases from this (google "log change of base formula").

I suspect that it is a misfeature for the timeline to sample the contents of packets. Do we really want user data potentially appearing in debug logs? Removed for now.

…fication Fix incorrect status sent as alarm-notification

lukego added 8 commits November 8, 2016 09:34

Merge v2016.11 branch 'snabbco/master' into next

773cdcd

core.timeline: New module for "timeline" logs

95872b1

Makefile: Embed "*.events" timeline specs

930f294

engine: Add engine.events, app.events, link.events

aa9980a

core.link: Add dropped bytes counter

8530942

engine: Instrument with timeline events

9b252f0

lukego force-pushed the timeline-redux branch from f657e93 to 40ceef6 Compare February 20, 2017 09:43

lukego added 7 commits February 20, 2017 09:54

Merge branch 'next' into timeline-redux-next

945efe1

Merge snabbco/next into timeline-redux-next

276e3c5

Resolved conflict in app.lua between adding timeline events and the new breath topological-sort machinery.

engine: update timeline level probabilities

563ae7a

Merge remote-tracking branch 'wingo/topo-link-sort' into timeine-redu…

3bb71fa

…x-next-opt

Merge v2017.04 branch 'master' into timeline-redux

4f94b55

engine: Remove timeline packet payload sampling

16a6bb2

I suspect that it is a misfeature for the timeline to sample the contents of packets. Do we really want user data potentially appearing in debug logs? Removed for now.

benagricola pushed a commit to benagricola/snabb that referenced this pull request Jun 14, 2018

Merge pull request snabbco#1098 from benagricola/yang-alarms-fix-noti…

74fca9f

…fication Fix incorrect status sent as alarm-notification

eugeneia mentioned this pull request Dec 7, 2018

Merge eugeneia/snabb:timeline-raptorjit into Vita inters/vita#65

Merged

eugeneia merged commit 596d6b5 into snabbco:master Apr 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sketch] Timeline redux #1098

[sketch] Timeline redux #1098

lukego commented Feb 9, 2017 •

edited

Loading

[sketch] Timeline redux #1098

[sketch] Timeline redux #1098

Conversation

lukego commented Feb 9, 2017 • edited Loading

Define your events

Instrument your code

Generate your data

Analyze the data

lukego commented Feb 9, 2017 •

edited

Loading