Uses blocking syscalls to write a page of 4096 bytes, fsync the write, read the page back in, and repeats to iterate across a large file of 256 MB. This benchmark, while obviously artificial and stressing the absolute minimum Advanced Format sector size of 4 KB instead of a larger more efficient block size of 64 KB, is otherwise not too far removed from what some storage systems in paranoid mode might do to protect against misdirected writes or LSEs for critical data. This should be the fastest candidate for the task on Linux, faster than Zig's evented I/O since that incurs additional context switches to the userspace I/O thread.
Uses io_uring
syscalls to write a page of 4096 bytes, fsync the write, read the page back in, and repeats to iterate across a large file of 256 MB. This is all non-blocking, with a pure single-threaded event loop, something that is not otherwise possible on Linux for file system I/O apart from O_DIRECT
. This is how we solve #1908 and #5962 for Zig.
There are also further io_uring
optimizations that we don't take advantage of here in this benchmark:
SQPOLL
to eliminate submission syscalls entirely.- Registered file descriptors to eliminate atomic file referencing in the kernel.
- Registered buffers to eliminate page mapping in the kernel.
These were taken on a 2020 MacBook Air running Ubuntu 20.04 with a 5.7.15 kernel. Some Linux machines may further show an order of magnitude worse performance for fs_blocking.zig
:
$ zig run fs_blocking.zig -O ReleaseFast
fs blocking: write(4096)/fsync/read(4096) * 65536 pages = 196608 syscalls in 7018ms
fs blocking: write(4096)/fsync/read(4096) * 65536 pages = 196608 syscalls in 7082ms
fs blocking: write(4096)/fsync/read(4096) * 65536 pages = 196608 syscalls in 7050ms
fs blocking: write(4096)/fsync/read(4096) * 65536 pages = 196608 syscalls in 7305ms
fs blocking: write(4096)/fsync/read(4096) * 65536 pages = 196608 syscalls in 7961ms
$ zig run fs_io_uring.zig -O ReleaseFast
fs io_uring: write(4096)/fsync/read(4096) * 65536 pages = 386 syscalls in 3755ms
fs io_uring: write(4096)/fsync/read(4096) * 65536 pages = 386 syscalls in 3469ms
fs io_uring: write(4096)/fsync/read(4096) * 65536 pages = 386 syscalls in 3656ms
fs io_uring: write(4096)/fsync/read(4096) * 65536 pages = 386 syscalls in 3881ms
fs io_uring: write(4096)/fsync/read(4096) * 65536 pages = 386 syscalls in 4196ms
As you can see, io_uring
can drastically amortize the cost of syscalls. What you don't see here, though, is that your single-threaded application is now also non-blocking, so you could spend the I/O time doing CPU-intensive work such as encrypting your next write, or authenticating your last read, while you wait for I/O to complete. This is cycles for jam.
Uses io_uring
syscalls to accept
one or more TCP connections and then recv
/send
up to 1000 bytes per message on these connections as an echo server. This server is non-blocking and takes advantage of kernel 5.6 or higher with support for IORING_FEAT_FAST_POLL
.
Note that kernel 5.7.16 and up introduces a network performance regression that is being patched. If you can't reproduce these network performance results then make sure that you are on kernel 5.7.15.
We also throw two C contenders in the ring:
...and a Node.js candidate that simply does socket.pipe(socket)
. The Node.js candidate is intended for those coming from Node.js, and is provided only to show the full range of the spectrum, so that we can have low-level and high-level echo-server candidates. JavaScript is not a slow language in itself, and major network performance improvements for Node.js are still possible.
We use rust_echo_bench
to send and receive 64-byte messages for 20 seconds, varying the number of connections, doing several runs per benchmark and taking the best of each run:
cargo run --release -- --address "localhost:3001" --number 1 --duration 20 --length 64
cargo run --release -- --address "localhost:3001" --number 2 --duration 20 --length 64
cargo run --release -- --address "localhost:3001" --number 50 --duration 20 --length 64
These were taken on a 2020 MacBook Air running Ubuntu 20.04 with a 5.7.15 kernel:
Connections | 1 | 2 | 50 |
epoll.c | 31210 | 99696 | 120420 |
io_uring.c | 30025 | 87950 | 140668 |
node.js | 43694 | 52635 | 52977 |
io_uring.zig | 31754 | 94731 | 144332 |
An echo server benchmark is a tough benchmark for io_uring
because it's read-then-write, read-then-write, so for a single connection there's no opportunity for io_uring
to amortize syscalls across connections. However, as your server becomes more busy and the number of connections increases, io_uring
should outperform epoll
in most cases.
Bear in mind that these network benchmarks are not as stable across machines as the file system benchmarks, so you may get different numbers.
What is unique about io_uring
here is that the same simple interface can be used for both file system and networking I/O on Linux without resorting to user-space thread pools to emulate async file system I/O.
We are also relying on the kernel to do fast polling for us thanks to IORING_FEAT_FAST_POLL
, without having to use epoll
. IORING_FEAT_FAST_POLL
effectively combines two syscalls into a single syscall. It's more efficient to perform a single truly asynchronous read/write instead of monitoring file descriptor activity and then calling read/write.
Rough benchmarks aside, more importantly, io_uring
optimizations such as registered fds, registered buffers enable a whole new way of doing I/O syscalls not possible with the blocking syscall alternatives, allowing the kernel to take long term references to internal data structures or create long term mappings of application memory, greatly reducing per-I/O overhead.
In the past, event loops took existing interfaces for blocking syscalls and made them asynchronous. But now, the io_uring
interfaces are more powerful than the blocking alternatives, so this changes everything.
This means that future event loop designs may need to:
-
Design for
io_uring
first, and fallback to older polling methods and blocking syscall threadpools second. -
Use shims for advanced
io_uring
features such as automatic buffer selection as fallbacks whereio_uring
is not available, to ensure that the event loop abstraction can expose the full interface ofio_uring
without compromising performance for platforms whereio_uring
is available.
In other words, if you want to know what an I/O interface on Linux should look like, start with io_uring
, don't retrofit io_uring
onto an existing event loop design, and think of how your event loop or networking library will expose features such as automatic buffer selection.