Intel10G: better initialization / closing #360

javierguerragiraldez · 2015-02-03T07:18:11Z

unmaps PCI resource on device close
deallocates buffers on (vf) app close
don't (always) force sfi mode, autonegotiation handles this.
if the link doesn't come up, reset the chip and try once more.

The last point in particular made all cards on both davos and grindelwald pass all tests. I couldn't test on chur because it's all tied up in some long-term process.

No more "flaky hardware" excuses!

lukego · 2015-02-03T10:19:17Z

Looks promising! Do you see why SnabbBot failed this on the Intel selftest?

lukego · 2015-02-03T10:38:20Z

Test case is looking really good with so many iterations!

I tested this branch (on interlaken) though and I see the same error as SnabbBot.

javierguerragiraldez · 2015-02-03T14:30:53Z

Do you see why SnabbBot failed this on the Intel selftest?

the manyreconf() test aborts when an iteration fails to advance the counters. In this case each iteration passed almost 1M each, but # 48 saw only 4873 packets and # 49 not even one, so the test failed.

This is typically what happens when some allocated resource taps out and the packet flow grinds to a halt, meaning that i'm still not releasing or reusing everything on each reconfig.

I think this test run churns packets faster than what i've seen manually, thus maybe in my test i didn't manage to totally leak out and with 'just' 100 iterations of 0.25sec each.

I'll try to make it happen earlier changing the preallocated amounts, time of iterations, size of packets, etc. to find exactly what resource is leaking.

lukego · 2015-02-04T07:38:21Z

src/lib/hardware/pci.c

@@ -54,9 +54,14 @@ uint32_t volatile *map_pci_resource(int fd)
  }
 }

-void close_pci_resource(int fd)
+void close_pci_resource(int fd, uint32_t volatile *addr)


I have been clarifying volatile and related issues with Mike Pall on the LuaJIT list.

Conclusions:

There is a risk that the JIT optimizations will eliminate or reorder loads and stores.

volatile is not the solution: LuaJIT ignores that. Probably we should not use it: I see a risk that we will expect it to behave in a way that it does not.

Compiler barriers are a solution. For example, immediately before reading or writing a memory mapped register we could always call a lib.c function:

void compiler_barrier() {}

and LuaJIT will flush loads/stores from registers to memory before making an FFI call.

You know, this could really be relevant to the bugs that you are seeing. The first times the NIC is initialised the code will be running interpreted. Once the init code gets hot and compiled then the behaviour could change.

About the volatile keword, it's only there to be consistent with the return value of map_pci_resource(). I agree that it's superfluous and would be better removed. Even in C it's dubious that would be any benefit as an argument or return value.

About barriers in initialization code, most of the places where order is important, the Intel docs advise to insert a wait, sometimes a few microseconds, sometimes up to a millisecond. In those cases we use C.usleep(), which should have the barrier effect as well. (and it's hard to imagine some reordering/pending write to last as long as a microsecond).

Where we could be missing some defensive coding is more in the data handling itself. Maybe the M_sf:sync_transmit () race is related to this?

Finally, I don't think the initialization code is ever compiled. Not even in the manyreconf() calls it tightly enough. What would be a simple way to check that?

v2.9 changelog

javierguerragiraldez added 6 commits February 1, 2015 13:05

unmaps PCI resources

15ecc15

better app closing

05fd98b

remove harmful parts of initialization

5ac268d

if 500msec isn't enough, reset and retry

d3c8d9e

only recheck once, avoid infinite loops on hopeless cases

0c2797e

reenable the "many reconfigs" test

2eb4e0c

lukego reviewed Feb 4, 2015
View reviewed changes

no real free for memory.dma_alloc(), so set a reuse pool

18fe95c

lukego mentioned this pull request Feb 10, 2015

Merge: Intel10G: better initialization / closing #367

Merged

lukego merged commit 18fe95c into snabbco:master Feb 10, 2015

dpino pushed a commit to dpino/snabb that referenced this pull request Jun 9, 2016

Merge pull request snabbco#360 from Igalia/changelog-v2.9

fdc1af1

v2.9 changelog

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel10G: better initialization / closing #360

Intel10G: better initialization / closing #360

javierguerragiraldez commented Feb 3, 2015

lukego commented Feb 3, 2015

lukego commented Feb 3, 2015

javierguerragiraldez commented Feb 3, 2015

lukego Feb 4, 2015

lukego Feb 4, 2015

javierguerragiraldez Feb 4, 2015

Intel10G: better initialization / closing #360

Intel10G: better initialization / closing #360

Conversation

javierguerragiraldez commented Feb 3, 2015

lukego commented Feb 3, 2015

lukego commented Feb 3, 2015

javierguerragiraldez commented Feb 3, 2015

lukego Feb 4, 2015

Choose a reason for hiding this comment

lukego Feb 4, 2015

Choose a reason for hiding this comment

javierguerragiraldez Feb 4, 2015

Choose a reason for hiding this comment