- CPU or memory consumption testing with gperftools and pprof
- Methodology
- Analyzing with pprof
- Alternatives to gperftools
- Performance annotations
- Performance analysis with Perfetto
To use pprof
to analyze performance and memory consumption in Envoy, you can
use the built-in statically linked profiler provided by
gperftools, or dynamically link it in to a
specific place yourself.
Static linking is already available (because of a HeapProfilerDump()
call
inside
Envoy::Profiler::Heap::stopProfiler())
).
Build the static binary using bazel:
$ bazel build --define tcmalloc=gperftools //source/exe:envoy-static
To collect a heap profile, run a statically-linked Envoy with pprof
and run the binary with a CPUPROFILE
or HEAPPROFILE
environment variable, like so:
$ CPUPROFILE=/tmp/mybin.cpuprof bazel-bin/source/exe/envoy-static <args>
$ HEAPPROFILE=/tmp/mybin.heapprof bazel-bin/source/exe/envoy-static <args>
CPUPROFILE
or HEAPPROFILE
sets a location for the profiler output. (See Methodology.)
There are several other environment variables that can be set to tweak the behavior of gperftools. See https://gperftools.github.io/gperftools/ for more details.
pprof can be used to symbolize CPU and heap profiles. For example:
$ pprof -text bazel-bin/source/exe/envoy-static /tmp/mybin.cpuprof
The profiler library is automatically linked into envoy_cc_test targets.
Run a test with heap profiling enabled, like so:
$ bazel test --test_env=HEAPPROFILE=/tmp/heapprof --define tcmalloc=gperftools <test target>
Run a test with CPU profiling enabled, like so:
$ bazel test --test_env=CPUPROFILE=/tmp/cpuprof --define tcmalloc=gperftools <test target>
Note that heap checks and heap profile collection in tests have noticiable performance implications. Use the following command to collect a CPU profile from a test target with heap check and heap profile collection disabled:
$ bazel test --test_env=CPUPROFILE=/tmp/cpuprof --test_env=HEAPPROFILE= --test_env=HEAPCHECK= --define tcmalloc=gperftools <test target>
It is possible to start/stop the CPU or heap profiler programmatically.
The Gperftools CPU Profiler
is controlled by ProfilerStart()
/ProfilerStop()
, and the
Gperftools Heap Profiler
is controlled by HeapProfilerStart()
, HeapProfilerStop()
and HeapProfilerDump()
.
These functions are wrapped by Envoy objects defined in source/common/profiler/profiler.h
).
To enable profiling programmatically:
- Add a library dependency on "//source/common/profiler:profiler_lib" to your envoy_cc_library build rule.
- Use the
startProfiler
/stopProfiler
methods ofEnvoy::Profiler::Cpu
orEnvoy::Profiler::Heap
to collect a profile.
Note that startProfiler
should only be called if no other profile of that type is currently active (e.i. profilerEnabled()
returns false).
Example:
// includes
#include "source/common/profiler/profiler.h"
...
Function(...) {
if (!Profiler::Cpu::startProfiler(profile_path)) {
// Error handling
}
...
Do expensive stuff in one or more threads.
...
// Stop the profiler and dump output to the `profile_path` specified when profile was started.
Profiler::Cpu::stopProfiler();
}
To support memory leaks detection, tests are built with gperftools dependencies enabled by default.
Use HeapProfilerStart()
, HeapProfilerStop()
, and HeapProfilerDump()
to start, stop, and persist
memory dumps, respectively. Please see above for more details.
By default, bazel executes tests in a sandbox, which will be deleted together with memory dumps
after the test run. To preserve memory dumps, bazel can be forced to run tests without
sandboxing, by setting the TestRunner
parameter to local
:
bazel test --strategy=TestRunner=local ...
An alternative is to set HEAPPROFILE
environment variable for the test runner:
bazel test --test_env=HEAPPROFILE=/tmp/testprofile ...
For consistent testing, it makes sense to run Envoy for a constant amount of time across trials:
$ timeout <num_seconds> bazel-bin/source/exe/envoy <options>
Envoy will print to stdout something like:
Starting tracking the heap
And then a series of stdouts like:
Dumping heap profile to <heap file 0001> (100 MB currently in use)
Dumping heap profile to <heap file 0002> (200 MB currently in use)
...
This will generate a series of files; if you statically-linked, these are
wherever HEAPPROFILE
points to. Otherwise, they are in the current directory
by default. They'll be named something like main_common_base.0001.heap
,
main_common_base.0002.heap
, etc.
NB: There is no reason this needs to be titled main_common_base
. Whatever
flag you supply HeapProfilerStart
/ HeapProfilerDump
will become the
filename. Multiple sections of code could be profiled simultaneously by setting
multiple HeapProfilerStart()
/ HeapProfilerStop()
breakpoints with unique
identifiers.
pprof can read these heap files in a
number of ways. Most convenient for first-order inspection might be pprof -top
or pprof -text
:
$ pprof -text bazel-bin/source/exe/envoy main_common_base* | head -n5
File: envoy
Build ID: ...
Type: inuse_space
Showing nodes accounting for 6402800.62kB, 98.59% of 6494044.58kB total
Dropped ... nodes (cum <= ...kB)
More complex flame/graph charts can be generated and viewed in a browser, which is often more helpful than text-based output:
$ pprof -http=localhost:9999 bazel-bin/source/exe/envoy main_common_base*
By default Envoy is built without gperftools. In this case the same results can be
achieved for On-CPU analysis with the perf
tool. For this there is no need to tweak
Envoy's environment, you can even do measurements for an instance running in production
(beware of possible performance hit though). Simply run:
$ perf record -g -F 99 -p `pgrep envoy`
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.694 MB perf.data (1532 samples) ]
The program will store the collected sampling data in the file perf.data
. After installing
perf_to_profile this format
is also understood by recent enough versions of pprof
:
$ pprof -http=localhost:9999 /path/to/envoy perf.data
Note that to see correct function names you need to pass an Envoy binary with debug symbols retained. Its version must be the same as the version of the profiled Envoy binary. You can get it from envoyproxy/envoy:debug.
Alternatively, you can use allegro/envoy-perf-pprof which wraps the pprof setup mentioned above (installing perf_to_profile, pprof and getting the proper Envoy debug version) in a Dockerfile.
Unfortunately perf
doesn't support heap profiling analogous to gperftools
, but still
we can get some insight into memory allocations with
Brendan Gregg's tools.
You'll need to have bcc installed in your system and a
copy of FlameGraph:
$ git clone /~https://github.com/brendangregg/FlameGraph
$ sudo /usr/share/bcc/tools/stackcount -p `pgrep envoy` \
-U "/full/path/to/envoy/bazel-bin/source/exe/envoy-static:_Znwm" > out.stack
$ ./FlameGraph/stackcollapse.pl < out.stacks | ./FlameGraph/flamegraph.pl --color=mem \
--title="operator new(std::size_t) Flame Graph" --countname="calls" > out.svg
The stackcount
utility counts function calls and their stack traces using eBPF probes.
Since Envoy by default links statically to tcmalloc which provides its own implementation
of memory management functions the used uprobe looks like
/full/path/to/envoy/bazel-bin/source/exe/envoy-static:_Znwm
. The part before
the colon is a library name (a path to Envoy's binary in our case). The part after the
colon is a function name as it looks like in the output of objdump -tT /path/to/lib
,
that is mangled in our case. To get an idea how your compiler mangles the name you
can use this one-liner:
$ echo -e "#include <new>\n void* operator new(std::size_t) {} " | g++ -x c++ -S - -o- 2> /dev/null
.file ""
.text
.globl _Znwm
.type _Znwm, @function
_Znwm:
.LFB73:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp)
nop
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE73:
.size _Znwm, .-_Znwm
.ident "GCC: (GNU) 10.2.1 20201016 (Red Hat 10.2.1-6)"
.section .note.GNU-stack,"",@progbits
WARNING: The name is going to be different on 32-bit and 64-bit platforms due to different sizes
of size_t
. Also void* operator new[](std::size_t)
is a separate function as well as malloc()
.
The latter is a C function and hence not mangled by the way.
stackcount
doesn't count how much memory is allocated, but how often. To answer the "how much"
question you could use Brendan's
mallocstacks
tool, but it works only for malloc()
calls. You need to modify it to take into
account other memory allocating functions.
In case there is a need to measure how long a code path takes time to execute in Envoy you may resort to instrumenting the code with the performance annotations.
There are two types of the annotations. The first one is used to measure operations limited by a common lexical scope. For example:
void doHeavyLifting() {
PERF_OPERATION(op);
bool success = doSomething();
if (success) {
finalizeSuccessfulOperation();
PERF_RECORD(op, "successful", "heavy lifting");
} else {
recoverFromFailure();
PERF_RECORD(op, "failed", "heavy lifting")
}
}
The recorded performance data can be dumped to stdout with a call to PERF_DUMP()
:
Duration(us) # Calls Mean(ns) StdDev(ns) Min(ns) Max(ns) Category Description
2837 22 128965 37035.5 109731 241957 successful heavy lifting
204 13 15745 2449.4 13323 20446 failed heavy lifting
The second type is performance annotations owned by a class instance. They can measure operations spanned across the instance's methods:
class CustomFilter : public Http::StreamEncoderFilter {
public:
...
// Http::StreamEncoderFilter
Http::FilterHeadersStatus encodeHeaders(Http::ResponseHeaderMap& headers,
bool end_stream) override {
PERF_OWNED_OPERATION(perf_operation_);
return Http::FilterHeadersStatus::Continue;
}
Http::FilterDataStatus encodeData(Buffer::Instance& buffer, bool end_stream) override {
if (end_stream) {
PERF_OWNED_RECORD(perf_operation_, "without trailers", "stream encoding")
}
return Http::FilterDataStatus::Continue;
}
Http::FilterTrailersStatus encodeTrailers(Http::ResponseTrailerMap&) override {
PERF_OWNED_RECORD(perf_operation_, "with trailers", "stream encoding");
return Http::FilterTrailersStatus::Continue;
}
...
private:
...
PERF_OWNER(perf_operation_);
};
Similar results can be achieved with Perfetto tracing macros enabled with
bazel --define=perf_tracing=enabled ...
Perfetto is a production-grade open-source stack for performance instrumentation and trace analysis. It offers services and libraries for recording system-level and app-level traces, a library for analyzing traces using SQL and a web-based UI to visualize and explore multi-GB traces.
Currently when the Perfetto support is enabled the tracing data in binary Protobuf
format is dumped into envoy.pftrace
upon process termination. The file
can be analyzed online at https://ui.perfetto.dev/ or with a custom tool.
To generate a scoped trace event which uses C++ RAII under the hood add the
TRACE_EVENT
macro to the block of your interest:
#include "source/common/common/perf_tracing.h"
RequestDecoder& ConnectionManagerImpl::newStream(ResponseEncoder& response_encoder,
bool is_internally_created) {
TRACE_EVENT("core", "ConnectionManagerImpl::newStream"); // Begin "ConnectionManagerImpl::newStream" slice.
...
// End "ConnectionManagerImpl::newStream" slice.
}
For events that don't follow function scoping, use TRACE_EVENT_BEGIN
and
TRACE_EVENT_END
. Please be careful with these events as all events on a given
thread share the same stack. This means that it's not recommended to have a matching
pair of TRACE_EVENT_BEGIN
and TRACE_EVENT_END
markers in separate functions,
since an unrelated event might terminate the original event unexpectedly; for events
that cross function boundaries it's usually best to emit them on a separate track.
Below is an example for a trace event covering an object's life span:
#include "source/common/common/perf_tracing.h"
Http::Request::Request(int request_id)
: request_id_(request_id) {
TRACE_EVENT_BEGIN("core", "Http::Request",
perfetto::Track(request_id_, perfetto::ThreadTrack::Current()));
...
}
Http::Request::~Request() {
...
TRACE_EVENT_END("core", perfetto::Track(request_id_, perfetto::ThreadTrack::Current()));
}
Unfortunately this may lead to excessive number of tracks if they are unique for every pair of emitted events. The existing visualization tools may not work well if the number of tracks is too big. In this case the resulting trace data needs to be processed differently. Alternatively, if you are interested in benchmarking only and don't need any tracing capabilities, then you can resort to the Performance Annotation system mentioned above which supports cross-scoped events too, but doesn't require any post-processing to get a benchmark's final report.
Time-varying numeric data can be recorded with the TRACE_COUNTER
macro:
TRACE_COUNTER("extensions", "MemoryAllocated",
tcmalloc::MallocExtension::GetNumericProperty("generic.current_allocated_bytes"));
For more details please refer https://perfetto.dev/docs/instrumentation/track-events.