Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XDP app: driver for Linux AF_XDP sockets #1450

Merged
merged 29 commits into from
Jan 21, 2022
Merged

Conversation

eugeneia
Copy link
Member

@eugeneia eugeneia commented Nov 14, 2019

This is an iteration on #1417 that (on request) swaps out core.memory.dma_alloc an alternative UMEM allocator to implement a zero-copy driver for AF_XDP sockets. Also: doesn’t link with any shared libraries, only dependency is the Linux kernel.

Cc @dpino @bjoto

There are some major caveats (incompatible with DMA drivers, interlink) which are documented inline. I haven’t given a lot of thought about how to consolidate the UMEM restrictions with our existing DMA allocator, maybe there is a way to merge them?

  • Look into ways for UMEM and DMA allocator compatibility

Another restriction is that UMEM chunks used in AF_XDP packet descriptors are smaller that our struct packet. This restricts the XDP app’s MTU primarily. There are some notes on this in the code, but much of this is untested.

  • Add tests for effective MTU of XDP app (didn’t add a test but the maximum MTU settable by via ip link seems to be 3060)
  • Research if/how the kernel deals with huge packets in AF_XDP sockets (seems like it does not at the moment?)

References:

This PR also syncs our vendored ljsyscall with upstream and adds some BPF/XDP ljsyscall related extensions and fixes.

Cc @justincormack

e587f8c55 Merge pull request snabbco#225 from vavrusa/master
5ea3a881e bpf: add missing constants for linux 4.10 - 4.15
2c691e5a7 Merge pull request snabbco#224 from wingo/lseek-syscall-tweak
ae38bdbd7 Make "offset" arg to lseek a signed integer
5cb3b6950 Merge pull request snabbco#221 from wingo/util-ls-fix
8e0874609 Promptly close util.ls() dir fd; fix bug with deleted entries
277517436 Merge pull request snabbco#220 from sileht/master
8e48fd094 linux/nl.lua: Use ndmsg struct instead of rtmsg for neighbors
3e482bc4e Merge pull request snabbco#215 from qsn/gettid
57520cce3 expose gettid
db1a88e94 Merge pull request snabbco#214 from jsitnicki/sof-flags-linux
270a6e611 Add missing type for struct scm_timestamping for Linux
e49232047 Add missing SCM_* constants for Linux timestamping API
60fcc6b48 Add constants for SOF_* flags for Linux
26ac34851 Merge pull request snabbco#210 from alexandergall/linux-if-ioctls
3e6d3e27c Merge pull request snabbco#211 from fperrad/deb
d425a22b2 dummy changelog
b27eca538 update .gitignore
92292aa4b debian files
be257a7e1 debian files generated by lua-create-gitbuildpackage-layout
50a02b94b Add some SIOC{G,S}IF* ioctls for Linux
ee90324 Merge pull request snabbco#209 from justincormack/osx_clock
ee17863 Add CLOCK_ constants for OSX
178d244 Merge pull request snabbco#208 from justincormack/holes
61450f5 Add SEEK_DATA and SEEK_HOLE constants
9a7b584 Add memfd fnctl sealing support
99beaf5 more test fixes for memfd
cc221e4 fix ctest for new fcntl changes, typowq
56c4c76 fix ctest for new fcntl changes
96073cc fix typo
7a73e8a Add more constants for fcntl, memfd
8d3034c Fix ppc64le syscall numbers for newer calls
62828f6 Merge pull request snabbco#206 from johnae/master
c4002b6 The spook project is using ljsyscall.
0b266e8 Merge pull request snabbco#205 from lukego/close-fd-safely
ad91aa9 Add more protection against fd double-close
c8baf9e Merge pull request snabbco#202 from justincormack/dockerignore
66843c5 Use dockerignore to simplify Dockerfile
7b7211d Merge pull request snabbco#200 from justincormack/redo-dockerfile
a779caa Docker Cloud does not start processes at priority 0, remove from test
b85382d Rework Dockerfile and tests
24f7789 Merge pull request snabbco#199 from vavrusa/master
2ecf486 linux/constants: added new BPF map and prog types
00c1949 use Alpine 3.4 for docker build
0bcafc6 Merge pull request snabbco#196 from kbara/removecunused
a4217a8 Merge pull request snabbco#195 from kbara/fixgetcpu
ee67430 Remove unused variables from c.lua
4b2e0b2 Fix getcpu: the cpu argument was incorrectly given the node variable previously
214550a update changelog and rockspec for 0.12 release
21f3fd8 Merge pull request snabbco#192 from vavrusa/master
0da437f linux/bpf_prog_load: support custom kernel builds
9981190 fix missing vhangup
f245114 Merge pull request snabbco#191 from vavrusa/linux-perf-open
93558c1 linux: added support for tracing/performance counters
0511fb8 Merge pull request snabbco#190 from vavrusa/master
1f141ca linux: added new constants (e.g. attach BPF to socket)
36274f3 linux/bpf: added strflag support
4fd3bd6 Merge pull request snabbco#189 from vavrusa/master
aaa89cb linux: added support for eBPF
e095295 linux: added new syscall numbers (up to __nr_bpf)
1e079c4 test calls container so just needs file to run
d92625e define a docker compose test
5f14711 Merge pull request snabbco#188 from aperezdc/fix-if-nameindex
8915a83 Close socket immediately after error in if_nametoindex()
96c0286 use addons for travis, as learned at fosdem
71241e0 update changelog for unreleased changes
559b499 ignore audit arch constant
5d867d3 new architectures do not have open, will use openat
49d9ff9 test issues with new constants
7cd460e aarch64 audit constants for seccomp
6249e99 more docker examples
dd00af9 rockspec fix
d20033b rockspec for new release
fb17244 update copyright years
9d87597 more Changelog tweaks before release
53856b0 typo
1881526 sometimes winsz is 0, eg if terminal not set up
ab0c08d add error message
a1c207e update Changelog for forthcoming 0.11 release
ce12fb2 addDocker hub to README
602a2b3 fix osx fstatat to use 32 bit stat type, as cannot find how to call to get 64 bit one
8a0a6ad Now have arm machines with working seccomp
d14bd38 Appears that setting maac address on lo often works
6e878a1 remove debug print from test
30e9b5b allow skip on EPERM for adjtimex
bff3e90 Add strace in Docker image for convenient debugging
e67fa31 Use alpine 3.3 for Docker
5148bc3 fix ipv6 tests
185c1a6 more failures with no ipv6
98cc9a2 more fixes for ipv4 only environments for netlink tests
248a935 fix bind errors in environments that do not support ipv6
0eacd64 clean up travis file
5fb71b6 switch to newer Ubuntu in travis
8235724 fix more constant checks not in headers
8dff4ef constants missing on travis
db51b08 update Changelog
7540b04 add new rtnetlink values, so tests work under docker
22604b8 remove test that fails in some environments
d431693 fix waitid test under docker
1595b7d fix swap test under docker
321fdd2 Now an alpine package available
984b533 Add Dockerfile, fix some of the tests that made unreasonable assumptions
9aeff88 recent osx has *at functions
18cd829 better handling for xattr errors
b6bb892 freebsd 11 now has utimensat
7065b0d on freebsd/zfs chflags will fail, skip

git-subtree-dir: lib/ljsyscall
git-subtree-split: e587f8c55aad3955dddab3a4fa6c1968037b5c6e
@eugeneia
Copy link
Member Author

Ah and I forgot to mention I only tested this on a 4.19 kernel. AFAIU there are significant improvements to XDP shipping in recent kernels. Maybe I will even be able to get the forcing the XDP_ZEROCOPY flag (my current setup refused this flags with "operation not supported").

  • Test with 5.3+ kernels

@eugeneia
Copy link
Member Author

Some thoughts: performance is not amazing at this point. The XDP app pushes something around 1Mpps and spends about half its time in the kernel (tx "kick" is done via a sendto(2) syscall). I am hoping that this is relaxed in 5.4 kernels.

Might however also have something to do with the XDP_ZEROCOPY flag I haven’t been able to set. I was able to do some testing with a 5.3 kernel but that one didn’t change much AFAICT.

@bjoto do you happen to know about plans to make the AF_XDP transmit path less syscall-heavy, and/or what are the requirements to be able to pass XDP_ZEROCOPY on bind(2)?

@eugeneia
Copy link
Member Author

eugeneia commented Nov 19, 2019

One thing I missed is to reclaim (i.e., not leak) packet buffers (UMEM chunks) after the XDP app is stopped. Having troubles with this as I can’t figure out how to get a consistent ring state after closing the socket. I.e., some packets left on the fill ring after closing the socket might have already been read off the rx ring (and freed by Snabb).

I suppose that is the catch of using multiple sockets with different lifetimes with a single UMEM, no sound way to reclaim chunks used by individual sockets? Cc @bjoto

  • Figure out how to reclaim UMEM from kernel after closing socket

@bjoto
Copy link

bjoto commented Nov 19, 2019

Some thoughts: performance is not amazing at this point. The XDP app pushes something around 1Mpps and spends about half its time in the kernel (tx "kick" is done via a sendto(2) syscall). I am hoping that this is relaxed in 5.4 kernels.

Might however also have something to do with the XDP_ZEROCOPY flag I haven’t been able to set. I was able to do some testing with a 5.3 kernel but that one didn’t change much AFAICT.

@bjoto do you happen to know about plans to make the AF_XDP transmit path less syscall-heavy, and/or what are the requirements to be able to pass XDP_ZEROCOPY on bind(2)?

What NIC are you using? Is that a NIC with zero-copy support? As of today, only ixgbe, i40e and mlnx5 has zero-copy support.

To make AF_XDP a bit less syscall heavy, there's a feature called "need wakeup": https://lore.kernel.org/bpf/1565767643-4908-1-git-send-email-magnus.karlsson@intel.com/

The idea is that you only need to do a syscall when the kernel isn't processing. DPDKs AF_XDP driver uses that.

@bjoto
Copy link

bjoto commented Nov 19, 2019

One thing I missed is to reclaim (i.e., not leak) packet buffers (UMEM chunks) after the XDP app is stopped. Having troubles with this as I can’t figure out how to get a consistent ring state after closing the socket. I.e., some packets left on the fill ring after closing the socket might have already been read off the rx ring (and freed by Snabb).

I suppose that is the catch of using multiple sockets with different lifetimes with a single UMEM, no sound way to reclaim chunks used by individual sockets? Cc @bjoto

* [ ]  Figure out how to reclaim UMEM from kernel after closing socket

Hmm, I'm not familiar with the snabb internals, so I might be off here.
If you're using shared UMEM (i.e. using the bind flag), you cannot be sure that the kernel is done with the entries in the fill ring/completed until the sockets are closed. Or is it another mechanism you're looking for?

@eugeneia
Copy link
Member Author

What NIC are you using? Is that a NIC with zero-copy support? As of today, only ixgbe, i40e and mlnx5 has zero-copy support.

I am currently using ixgbe with Intel 82599ES NICs for testing (currently on a 4.19 kernel). Not doing any special ip-link/ethtool configuration on the devices, except setting mac addresses and number of channels/queues.

"need wakeup"

IIUC this land(s/ed) in the 5.4 kernel?

Hmm, I'm not familiar with the snabb internals, so I might be off here.
If you're using shared UMEM (i.e. using the bind flag), you cannot be sure that the kernel is done with the entries in the fill ring/completed until the sockets are closed. Or is it another mechanism you're looking for?

Currently this driver registers one UMEM per socket, but all UMEMs share the same memory which is managed by Snabb’s packet freelist. Snabb also has the notion of re-configuring apps at runtime so in a nutshell what we need to be able to do is close an XDP sockets and reclaim the packets currently held by it. Reclaiming the whole UMEM region at once to too coarse for Snabb because parts of that region will be held by other components in the app network.

IIUC this could be solved for us if XDP ensured that the ring cursors are synchronized before returning from close(2). I.e.,

txq->consumer_pub = txq->consumer_priv
rxq->producer_pub = rxq->producer_priv
umem->fq->consumer_pub = txq->umem->fq->consumer_priv
umem->cq->producer_pub = umem->cq->producer_priv
return-from-close(2)

Then we could close the socket, grab the in-flight chunks remaining on the rings, and unmap the rings? Not pretty but workable.

@bjoto
Copy link

bjoto commented Nov 20, 2019

What NIC are you using? Is that a NIC with zero-copy support? As of today, only ixgbe, i40e and mlnx5 has zero-copy support.

I am currently using ixgbe with Intel 82599ES NICs for testing (currently on a 4.19 kernel). Not doing any special ip-link/ethtool configuration on the devices, except setting mac addresses and number of channels/queues.

Hmm, and zero-copy is not working there? The support was pretty shaky for 4.19, so I'd move up some versions.

"need wakeup"

IIUC this land(s/ed) in the 5.4 kernel?

Yes!

Hmm, I'm not familiar with the snabb internals, so I might be off here.
If you're using shared UMEM (i.e. using the bind flag), you cannot be sure that the kernel is done with the entries in the fill ring/completed until the sockets are closed. Or is it another mechanism you're looking for?

Currently this driver registers one UMEM per socket, but all UMEMs share the same memory which is managed by Snabb’s packet freelist. Snabb also has the notion of re-configuring apps at runtime so in a nutshell what we need to be able to do is close an XDP sockets and reclaim the packets currently held by it. Reclaiming the whole UMEM region at once to too coarse for Snabb because parts of that region will be held by other components in the app network.

IIUC this could be solved for us if XDP ensured that the ring cursors are synchronized before returning from close(2). I.e.,

txq->consumer_pub = txq->consumer_priv
rxq->producer_pub = rxq->producer_priv
umem->fq->consumer_pub = txq->umem->fq->consumer_priv
umem->cq->producer_pub = umem->cq->producer_priv
return-from-close(2)

Then we could close the socket, grab the in-flight chunks remaining on the rings, and unmap the rings? Not pretty but workable.

Hmm. Thanks for clarifying. Let me think a bit, and get back!

@eugeneia
Copy link
Member Author

I found a hack that seems to work for us: since this driver uses a single UMEM (fq,cq pair) for each socket, we can maintain a pair of counters that tracks how many packets are “in-flight” on the rx/tx paths individually for each socket. Under the assumption that chunks are not reordered and the kernel doesn’t do crazy things to the descriptor rings, it seems that we can “rewind” transmit/fill operations reliably enough to reclaim just the packets that haven’t been moved to the rx and completion rings yet.

I.e., if we keep book of the number of packets put into the fill ring minus the number of packets received in the rx ring, we can then use that tally to reclaim the packets (fq->producer-1)..(fq->producer-tally). Same goes for tx and completion rings.

Maybe. ;-)

@eugeneia
Copy link
Member Author

On the perf side, I retested with a 5.3 kernel and that shows more promising numbers around 4 Mpps duplex on a single queue.

Not sure if this is a bug or if I am forgetting something, but on the 5.3 kernels I get “Device or resource busy” errors from bind(2) when trying to bind to a queue repeatedly, even after the process exits. I do close(2) the socket and munmap(2) the rings, and this didn’t happen with 4.19 kernels. The queue gets freed up either by rebinding the device via /sys/bus/pci/drivers/ixgbe/{bind,unbind} or waiting a rather long time.

This commit uses BPF object pinning to expose per-interface xskmaps to other
Snabb processes so they can attach to free queues on interfaces initialized
by other Snabb processes.
@eugeneia
Copy link
Member Author

On the UMEM/DMA compatibility side: if the xdp_umem_reg sockopt allowed us to register the whole 0x500000000000-0x5fffffffffff range as UMEM then I think we could do away with the UMEM allocator and use the usual Snabb DMA allocator (which mmaps phyiscal addresses in that range). Also it would need to be happy with hugepages.

I suspect this currently fails for two reasons: 1) that range is generally not mapped at the time of the UMEM registration, 2) possibly the length of region exceeds some kernel-internal limit.

@bjoto you might be able to explain what the requirements for UMEM are and why the API is the way it is. From my lack of understanding the design constraints I perceive the UMEM aspect of XDP mostly as arbitrary restrictions. Why can I not just pass any page over the XDP socket?

(I.e., we are used to allocate huge pages and pass those to PCI devices that write to them via DMA, why can’t the kernel do the same?)

FWIW I see the current UMEM allocator as a workable solution, because it seems useful to provide an exclusive Snabb-XDP mode that should support plenty types of applications. However, the requirement to allocate all memory upfront and the in-ablity to share packet buffers within the Snabb process group are notable downsides. And of course if would be cool to able to mix the XDP and other DMA drivers in the same application.

@bjoto
Copy link

bjoto commented Nov 27, 2019

On the perf side, I retested with a 5.3 kernel and that shows more promising numbers around 4 Mpps duplex on a single queue.

Not sure if this is a bug or if I am forgetting something, but on the 5.3 kernels I get “Device or resource busy” errors from bind(2) when trying to bind to a queue repeatedly, even after the process exits. I do close(2) the socket and munmap(2) the rings, and this didn’t happen with 4.19 kernels. The queue gets freed up either by rebinding the device via /sys/bus/pci/drivers/ixgbe/{bind,unbind} or waiting a rather long time.

Yes, unfortunately, this is a know thing. The socket release can take some time (resetting NIC and cleanup from workqueue), and it's done asynchronous. That means, that the bind() -- as you noted -- will fail until everything is cleaned up. A partial fix for this is part of the 5.4 kernel, which will enable a synchronous cleanup at a later point. I'm hoping to get that in for 5.6.

@bjoto
Copy link

bjoto commented Nov 27, 2019

I found a hack that seems to work for us: since this driver uses a single UMEM (fq,cq pair) for each socket, we can maintain a pair of counters that tracks how many packets are “in-flight” on the rx/tx paths individually for each socket. Under the assumption that chunks are not reordered and the kernel doesn’t do crazy things to the descriptor rings, it seems that we can “rewind” transmit/fill operations reliably enough to reclaim just the packets that haven’t been moved to the rx and completion rings yet.

I.e., if we keep book of the number of packets put into the fill ring minus the number of packets received in the rx ring, we can then use that tally to reclaim the packets (fq->producer-1)..(fq->producer-tally). Same goes for tx and completion rings.

Maybe. ;-)

Good! Discussed this back and forth with Magnus, and doing the per-chunk householding in the kernel is really something we'd like to avoid. IOW, if you can do that in your userspace allocator, that's great! :-)

@bjoto
Copy link

bjoto commented Nov 27, 2019

Yikes, I realized I've missed a lot of questions/thoughts! Apologies for the slow turnaround...

Great input!

On the UMEM/DMA compatibility side: if the xdp_umem_reg sockopt allowed us to register the whole 0x500000000000-0x5fffffffffff range as UMEM then I think we could do away with the UMEM allocator and use the usual Snabb DMA allocator (which mmaps phyiscal addresses in that range). Also it would need to be happy with hugepages.

That's a big range (16TB, right). The current AF_XDP UMEM pins all pages, and maps them into kernel VA. The remap range on x86-64 is 32TB, so at least one mapping should be ok.

I suspect this currently fails for two reasons: 1) that range is generally not mapped at the time of the UMEM registration, 2) possibly the length of region exceeds some kernel-internal limit.

Ah, yes, the range need to be mapped, so I'd guess that 1 fails.

@bjoto you might be able to explain what the requirements for UMEM are and why the API is the way it is. From my lack of understanding the design constraints I perceive the UMEM aspect of XDP mostly as arbitrary restrictions. Why can I not just pass any page over the XDP socket?

There are no other constraints than: 1. The range/mapping has to exist in userspace. 2. Not larger than 32 TB. Hugepages should work, if not, that's a bug (there are some optimizations that we haven't done for HP, but it should work).

(I.e., we are used to allocate huge pages and pass those to PCI devices that write to them via DMA, why can’t the kernel do the same?)

That should work, as long as you comply to the constraints above!

FWIW I see the current UMEM allocator as a workable solution, because it seems useful to provide an exclusive Snabb-XDP mode that should support plenty types of applications. However, the requirement to allocate all memory upfront and the in-ablity to share packet buffers within the Snabb process group are notable downsides. And of course if would be cool to able to mix the XDP and other DMA drivers in the same application.

@bjoto
Copy link

bjoto commented Nov 28, 2019

@eugeneia What do you think about adding a feature where you can grow the UMEM, and have non-contiguous UMEM regions? So, you can start small, and then expand after bind()-point? Would that help?

We're working on improving the hugepage support on the kernel side (but note that it should definitely still work!).

@eugeneia
Copy link
Member Author

I feel like for Snabb ideally there could be an option to enforce the UMEM restrictions lazily on use. Maybe I should explain how the Snabb DMA memory management works at the moment:

  • We allocate (file-backed) huge pages (on demand) and map them into the process memory at 0x500000000000&physicaladdr. I.e., we get the physical address (which we need to give to PCI NICs anyways) and tag them with a prefix we know is not mapped by the kernel.
  • We also allow sharing these tagged pointers with other Snabb processes which will then map the underlying pages on demand (at the same, predictable addresses).

So if I understand correctly, while the whole region is not mapped say at the time we do the UMEM registration, the used chunks will be eventually mapped before we put them on any descriptor ring, and we do comply with all the requirements for UMEM (known userspace addresses, pinned), just not necessarily at bind time?

@eugeneia What do you think about adding a feature where you can grow the UMEM, and have non-contiguous UMEM regions? So, you can start small, and then expand after bind()-point? Would that help?

Sounds like that’s what we would need. I’d be happy to ponder on concrete API suggestions and play them through in my head.

By the way, thanks for the thorough Q&A. Your help is much appreciated! Open source is fun! ☺️

@eugeneia
Copy link
Member Author

I found a hack that seems to work for us: since this driver uses a single UMEM (fq,cq pair) for each socket, we can maintain a pair of counters that tracks how many packets are “in-flight” on the rx/tx paths individually for each socket. Under the assumption that chunks are not reordered and the kernel doesn’t do crazy things to the descriptor rings, it seems that we can “rewind” transmit/fill operations reliably enough to reclaim just the packets that haven’t been moved to the rx and completion rings yet.
I.e., if we keep book of the number of packets put into the fill ring minus the number of packets received in the rx ring, we can then use that tally to reclaim the packets (fq->producer-1)..(fq->producer-tally). Same goes for tx and completion rings.
Maybe. ;-)

Good! Discussed this back and forth with Magnus, and doing the per-chunk householding in the kernel is really something we'd like to avoid. IOW, if you can do that in your userspace allocator, that's great! :-)

If we could rely on the descriptor ring state being stable after close(2) that would work. However, the achilles heel here is that if the kernel ever decides to say clear the fill ring descriptors on close our scheme would break. I suppose these properties should be documented and ideally stable.

eugeneia added a commit to eugeneia/snabb that referenced this pull request Dec 11, 2019
This fixes a bug where packets would not be transmitted unless the tx queue was
full.
@eugeneia eugeneia mentioned this pull request Dec 13, 2019
@eugeneia
Copy link
Member Author

Just a heads up that I have integrated this driver into Vita. Seems to work. :-)

I cautiously propose to include this branch in the next Snabb release as an experimental feature.

eugeneia added a commit to eugeneia/snabb that referenced this pull request Jan 18, 2022
…ctions

Recent changes ([1], [2], [3]) exposed some previously internal variables
and functions from core.packet. This patch restores local bindings for
those within core.packet in order to maintain performance.

[1] 6174563 (snabbco#1450)
[2] ee4e42d (snabbco#1450)
[3] 9f0694f (snabbco#1288)
@eugeneia eugeneia merged commit 928c456 into snabbco:master Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants