Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for lwAFTR in many-worker configurations #1480

Merged
merged 13 commits into from
Jul 4, 2022

Conversation

eugeneia
Copy link
Member

Depends on #1473

eugeneia and others added 13 commits March 9, 2022 13:57
For each configured queue, add a stats counter rxdrop_<queue_id> that
reflects the hardware per-queue drop counter where <queue_id> is the
queue id in the ConnectX configuration as defined by the user.

# Conflicts:
#	src/apps/mellanox/connectx.lua
# Conflicts:
#	src/apps/mellanox/connectx.lua
…ling

Rename cxq.next_rx_cqeid to cxq.rx_cqcc (completion queue consumer
counter) and expand to 32 bit. This counter is not wrapped around the
size of the cq. It is used to calculate the value of the SW ownership
bit directly with efficient bitops and eliminates the need for the
rx_mine member of cxq.

# Conflicts:
#	src/apps/mellanox/connectx.lua
The event queue is polled at the frequency of sync_timer(). The
following events are supported:

   * CQError. Prints the CQ number and syndrome, then aborts.

   * PortStateChange. Prints the port number and new state
     (up/down). This could be used to replace the get_port_status()
     call in sync_stats()

   * PageRequest. Allocates/deallocates the requested number of pages.
This works around an issue in lwaftr where workers hang on
NUMA migrations.

Possibly, this only fixes the issue on flat NUMA systems.

See 89c48fc
Currenty, we create HCAs per stats request, and create indepedent
stats requests for each per-queue counter set.

HCAs are a limited resource, and many-queue configurations might run out
of HCAs.

As a workaround, allow counter set creation to be disabled per-queue.
- larger send/receive queues
- force flow controll off
- limit per-queue counters to not exceed HCA capacity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants