-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XDP app: driver for Linux AF_XDP sockets #1450
Conversation
e587f8c55 Merge pull request snabbco#225 from vavrusa/master 5ea3a881e bpf: add missing constants for linux 4.10 - 4.15 2c691e5a7 Merge pull request snabbco#224 from wingo/lseek-syscall-tweak ae38bdbd7 Make "offset" arg to lseek a signed integer 5cb3b6950 Merge pull request snabbco#221 from wingo/util-ls-fix 8e0874609 Promptly close util.ls() dir fd; fix bug with deleted entries 277517436 Merge pull request snabbco#220 from sileht/master 8e48fd094 linux/nl.lua: Use ndmsg struct instead of rtmsg for neighbors 3e482bc4e Merge pull request snabbco#215 from qsn/gettid 57520cce3 expose gettid db1a88e94 Merge pull request snabbco#214 from jsitnicki/sof-flags-linux 270a6e611 Add missing type for struct scm_timestamping for Linux e49232047 Add missing SCM_* constants for Linux timestamping API 60fcc6b48 Add constants for SOF_* flags for Linux 26ac34851 Merge pull request snabbco#210 from alexandergall/linux-if-ioctls 3e6d3e27c Merge pull request snabbco#211 from fperrad/deb d425a22b2 dummy changelog b27eca538 update .gitignore 92292aa4b debian files be257a7e1 debian files generated by lua-create-gitbuildpackage-layout 50a02b94b Add some SIOC{G,S}IF* ioctls for Linux ee90324 Merge pull request snabbco#209 from justincormack/osx_clock ee17863 Add CLOCK_ constants for OSX 178d244 Merge pull request snabbco#208 from justincormack/holes 61450f5 Add SEEK_DATA and SEEK_HOLE constants 9a7b584 Add memfd fnctl sealing support 99beaf5 more test fixes for memfd cc221e4 fix ctest for new fcntl changes, typowq 56c4c76 fix ctest for new fcntl changes 96073cc fix typo 7a73e8a Add more constants for fcntl, memfd 8d3034c Fix ppc64le syscall numbers for newer calls 62828f6 Merge pull request snabbco#206 from johnae/master c4002b6 The spook project is using ljsyscall. 0b266e8 Merge pull request snabbco#205 from lukego/close-fd-safely ad91aa9 Add more protection against fd double-close c8baf9e Merge pull request snabbco#202 from justincormack/dockerignore 66843c5 Use dockerignore to simplify Dockerfile 7b7211d Merge pull request snabbco#200 from justincormack/redo-dockerfile a779caa Docker Cloud does not start processes at priority 0, remove from test b85382d Rework Dockerfile and tests 24f7789 Merge pull request snabbco#199 from vavrusa/master 2ecf486 linux/constants: added new BPF map and prog types 00c1949 use Alpine 3.4 for docker build 0bcafc6 Merge pull request snabbco#196 from kbara/removecunused a4217a8 Merge pull request snabbco#195 from kbara/fixgetcpu ee67430 Remove unused variables from c.lua 4b2e0b2 Fix getcpu: the cpu argument was incorrectly given the node variable previously 214550a update changelog and rockspec for 0.12 release 21f3fd8 Merge pull request snabbco#192 from vavrusa/master 0da437f linux/bpf_prog_load: support custom kernel builds 9981190 fix missing vhangup f245114 Merge pull request snabbco#191 from vavrusa/linux-perf-open 93558c1 linux: added support for tracing/performance counters 0511fb8 Merge pull request snabbco#190 from vavrusa/master 1f141ca linux: added new constants (e.g. attach BPF to socket) 36274f3 linux/bpf: added strflag support 4fd3bd6 Merge pull request snabbco#189 from vavrusa/master aaa89cb linux: added support for eBPF e095295 linux: added new syscall numbers (up to __nr_bpf) 1e079c4 test calls container so just needs file to run d92625e define a docker compose test 5f14711 Merge pull request snabbco#188 from aperezdc/fix-if-nameindex 8915a83 Close socket immediately after error in if_nametoindex() 96c0286 use addons for travis, as learned at fosdem 71241e0 update changelog for unreleased changes 559b499 ignore audit arch constant 5d867d3 new architectures do not have open, will use openat 49d9ff9 test issues with new constants 7cd460e aarch64 audit constants for seccomp 6249e99 more docker examples dd00af9 rockspec fix d20033b rockspec for new release fb17244 update copyright years 9d87597 more Changelog tweaks before release 53856b0 typo 1881526 sometimes winsz is 0, eg if terminal not set up ab0c08d add error message a1c207e update Changelog for forthcoming 0.11 release ce12fb2 addDocker hub to README 602a2b3 fix osx fstatat to use 32 bit stat type, as cannot find how to call to get 64 bit one 8a0a6ad Now have arm machines with working seccomp d14bd38 Appears that setting maac address on lo often works 6e878a1 remove debug print from test 30e9b5b allow skip on EPERM for adjtimex bff3e90 Add strace in Docker image for convenient debugging e67fa31 Use alpine 3.3 for Docker 5148bc3 fix ipv6 tests 185c1a6 more failures with no ipv6 98cc9a2 more fixes for ipv4 only environments for netlink tests 248a935 fix bind errors in environments that do not support ipv6 0eacd64 clean up travis file 5fb71b6 switch to newer Ubuntu in travis 8235724 fix more constant checks not in headers 8dff4ef constants missing on travis db51b08 update Changelog 7540b04 add new rtnetlink values, so tests work under docker 22604b8 remove test that fails in some environments d431693 fix waitid test under docker 1595b7d fix swap test under docker 321fdd2 Now an alpine package available 984b533 Add Dockerfile, fix some of the tests that made unreasonable assumptions 9aeff88 recent osx has *at functions 18cd829 better handling for xattr errors b6bb892 freebsd 11 now has utimensat 7065b0d on freebsd/zfs chflags will fail, skip git-subtree-dir: lib/ljsyscall git-subtree-split: e587f8c55aad3955dddab3a4fa6c1968037b5c6e
Ah and I forgot to mention I only tested this on a 4.19 kernel. AFAIU there are significant improvements to XDP shipping in recent kernels. Maybe I will even be able to get the forcing the
|
(random_v4_packets generated wrong sized packets)
Some thoughts: performance is not amazing at this point. The XDP app pushes something around 1Mpps and spends about half its time in the kernel (tx "kick" is done via a Might however also have something to do with the @bjoto do you happen to know about plans to make the AF_XDP transmit path less syscall-heavy, and/or what are the requirements to be able to pass |
One thing I missed is to reclaim (i.e., not leak) packet buffers (UMEM chunks) after the XDP app is stopped. Having troubles with this as I can’t figure out how to get a consistent ring state after closing the socket. I.e., some packets left on the fill ring after closing the socket might have already been read off the rx ring (and freed by Snabb). I suppose that is the catch of using multiple sockets with different lifetimes with a single UMEM, no sound way to reclaim chunks used by individual sockets? Cc @bjoto
|
What NIC are you using? Is that a NIC with zero-copy support? As of today, only ixgbe, i40e and mlnx5 has zero-copy support. To make AF_XDP a bit less syscall heavy, there's a feature called "need wakeup": https://lore.kernel.org/bpf/1565767643-4908-1-git-send-email-magnus.karlsson@intel.com/ The idea is that you only need to do a syscall when the kernel isn't processing. DPDKs AF_XDP driver uses that. |
Hmm, I'm not familiar with the snabb internals, so I might be off here. |
I am currently using ixgbe with Intel 82599ES NICs for testing (currently on a 4.19 kernel). Not doing any special
IIUC this land(s/ed) in the 5.4 kernel?
Currently this driver registers one UMEM per socket, but all UMEMs share the same memory which is managed by Snabb’s packet freelist. Snabb also has the notion of re-configuring apps at runtime so in a nutshell what we need to be able to do is close an XDP sockets and reclaim the packets currently held by it. Reclaiming the whole UMEM region at once to too coarse for Snabb because parts of that region will be held by other components in the app network. IIUC this could be solved for us if XDP ensured that the ring cursors are synchronized before returning from
Then we could close the socket, grab the in-flight chunks remaining on the rings, and unmap the rings? Not pretty but workable. |
Hmm, and zero-copy is not working there? The support was pretty shaky for 4.19, so I'd move up some versions.
Yes!
Hmm. Thanks for clarifying. Let me think a bit, and get back! |
I found a hack that seems to work for us: since this driver uses a single UMEM (fq,cq pair) for each socket, we can maintain a pair of counters that tracks how many packets are “in-flight” on the rx/tx paths individually for each socket. Under the assumption that chunks are not reordered and the kernel doesn’t do crazy things to the descriptor rings, it seems that we can “rewind” transmit/fill operations reliably enough to reclaim just the packets that haven’t been moved to the rx and completion rings yet. I.e., if we keep book of the number of packets put into the fill ring minus the number of packets received in the rx ring, we can then use that tally to reclaim the packets Maybe. ;-) |
On the perf side, I retested with a 5.3 kernel and that shows more promising numbers around 4 Mpps duplex on a single queue. Not sure if this is a bug or if I am forgetting something, but on the 5.3 kernels I get “Device or resource busy” errors from |
This commit uses BPF object pinning to expose per-interface xskmaps to other Snabb processes so they can attach to free queues on interfaces initialized by other Snabb processes.
On the UMEM/DMA compatibility side: if the I suspect this currently fails for two reasons: 1) that range is generally not mapped at the time of the UMEM registration, 2) possibly the length of region exceeds some kernel-internal limit. @bjoto you might be able to explain what the requirements for UMEM are and why the API is the way it is. From my lack of understanding the design constraints I perceive the UMEM aspect of XDP mostly as arbitrary restrictions. Why can I not just pass any page over the XDP socket? (I.e., we are used to allocate huge pages and pass those to PCI devices that write to them via DMA, why can’t the kernel do the same?) FWIW I see the current UMEM allocator as a workable solution, because it seems useful to provide an exclusive Snabb-XDP mode that should support plenty types of applications. However, the requirement to allocate all memory upfront and the in-ablity to share packet buffers within the Snabb process group are notable downsides. And of course if would be cool to able to mix the XDP and other DMA drivers in the same application. |
Yes, unfortunately, this is a know thing. The socket release can take some time (resetting NIC and cleanup from workqueue), and it's done asynchronous. That means, that the bind() -- as you noted -- will fail until everything is cleaned up. A partial fix for this is part of the 5.4 kernel, which will enable a synchronous cleanup at a later point. I'm hoping to get that in for 5.6. |
Good! Discussed this back and forth with Magnus, and doing the per-chunk householding in the kernel is really something we'd like to avoid. IOW, if you can do that in your userspace allocator, that's great! :-) |
Yikes, I realized I've missed a lot of questions/thoughts! Apologies for the slow turnaround... Great input!
That's a big range (16TB, right). The current AF_XDP UMEM pins all pages, and maps them into kernel VA. The remap range on x86-64 is 32TB, so at least one mapping should be ok.
Ah, yes, the range need to be mapped, so I'd guess that 1 fails.
There are no other constraints than: 1. The range/mapping has to exist in userspace. 2. Not larger than 32 TB. Hugepages should work, if not, that's a bug (there are some optimizations that we haven't done for HP, but it should work).
That should work, as long as you comply to the constraints above!
|
@eugeneia What do you think about adding a feature where you can grow the UMEM, and have non-contiguous UMEM regions? So, you can start small, and then expand after bind()-point? Would that help? We're working on improving the hugepage support on the kernel side (but note that it should definitely still work!). |
I feel like for Snabb ideally there could be an option to enforce the UMEM restrictions lazily on use. Maybe I should explain how the Snabb DMA memory management works at the moment:
So if I understand correctly, while the whole region is not mapped say at the time we do the UMEM registration, the used chunks will be eventually mapped before we put them on any descriptor ring, and we do comply with all the requirements for UMEM (known userspace addresses, pinned), just not necessarily at
Sounds like that’s what we would need. I’d be happy to ponder on concrete API suggestions and play them through in my head. By the way, thanks for the thorough Q&A. Your help is much appreciated! Open source is fun! |
If we could rely on the descriptor ring state being stable after |
This fixes a bug where packets would not be transmitted unless the tx queue was full.
Just a heads up that I have integrated this driver into Vita. Seems to work. :-) I cautiously propose to include this branch in the next Snabb release as an experimental feature. |
…ctions Recent changes ([1], [2], [3]) exposed some previously internal variables and functions from core.packet. This patch restores local bindings for those within core.packet in order to maintain performance. [1] 6174563 (snabbco#1450) [2] ee4e42d (snabbco#1450) [3] 9f0694f (snabbco#1288)
This is an iteration on #1417 that (on request) swaps out
core.memory.dma_alloc
an alternative UMEM allocator to implement a zero-copy driver for AF_XDP sockets. Also: doesn’t link with any shared libraries, only dependency is the Linux kernel.Cc @dpino @bjoto
There are some major caveats (incompatible with DMA drivers, interlink) which are documented inline. I haven’t given a lot of thought about how to consolidate the UMEM restrictions with our existing DMA allocator, maybe there is a way to merge them?
Another restriction is that UMEM chunks used in AF_XDP packet descriptors are smaller that our
struct packet
. This restricts the XDP app’s MTU primarily. There are some notes on this in the code, but much of this is untested.Add tests for effective MTU of XDP app(didn’t add a test but the maximum MTU settable by viaip link
seems to be 3060)Research if/how the kernel deals with huge packets in AF_XDP sockets(seems like it does not at the moment?)References:
This PR also syncs our vendored ljsyscall with upstream and adds some BPF/XDP ljsyscall related extensions and fixes.
Cc @justincormack