Description
Today I want to do a simple experiment to improve my mental model of how Snabb Switch code is just-in-time compiled into traces by LuaJIT. I am continuing with simple artificial examples like the ones in #6 and #8.
This is the code I want to look at today:
local counter = require("core.counter")
local n = 1e9
local c = counter.open("test")
local t = { [0] = 1, [1] = 10 } -- this table is added since #8
for i = 1,n do
counter.add(c, t[i%2])
end
which is functionally equivalent to the code in #8: it loops one billion times and increments a counter alternatively by 1 and 10. The difference is that the decision of how much to increment the counter now depends on data (table lookup) rather than control (if
statement). This is an optimization that I have made to help the tracing JIT: I moved the variability from control (branching) into data (table lookup).
Here is how it runs:
$ perf stat -e instructions,cycles ./snabb snsh -jdump=+r,dump.txt script3.lua
15,023,389,554 instructions # 3.74 insns per cycle
4,021,039,287 cycles
Each iteration takes 4 cycles (and executes 15 instructions). This is indeed in between the version from #6 that runs in one cycle (5 instructions) and the version in #8 that runs in 15 cycles (36 instructions).
Here is what I am going to do now:
- Examine the machine code together with the LuaJIT Intermediate Representation (IR) for the loop.
- Make some simple tweaks and see what happens.
IR and mcode
Let me present three exhibits: the Intermediate Representation for the loop, the corresponding machine code, and a hand-interleaved combination of the two with some comments.
Here is the Intermediate Representation of the loop:
0030 ------------ LOOP ------------
0031 r15 int BAND 0028 +1
0032 > int ABC 0013 0031
0033 p32 AREF 0015 0031
0034 xmm7 > num ALOAD 0033
0035 r15 u64 CONV 0034 u64.num
0036 rbp + u64 ADD 0035 0025
0037 u64 XSTORE 0021 0036
0038 rbx + int ADD 0028 +1
0039 > int LE 0038 0001
0040 rbx int PHI 0028 0038
0041 rbp u64 PHI 0025 0036
Here is the machine code that was generated for that IR code:
0bcafa70 mov r15d, ebx
0bcafa73 and r15d, +0x01
0bcafa77 cmp r15d, esi
0bcafa7a jnb 0x0bca0018 ->2
0bcafa80 cmp dword [rdx+r15*8+0x4], 0xfffeffff
0bcafa89 jnb 0x0bca0018 ->2
0bcafa8f movsd xmm7, [rdx+r15*8]
0bcafa95 cvttsd2si r15, xmm7
0bcafa9a test r15, r15
0bcafa9d jns 0x0bcafaad
0bcafa9f addsd xmm7, [0x41937b60]
0bcafaa8 cvttsd2si r15, xmm7
0bcafaad add rbp, r15
0bcafab0 mov [rcx+0x8], rbp
0bcafab4 add ebx, +0x01
0bcafab7 cmp ebx, eax
0bcafab9 jle 0x0bcafa70 ->LOOP
Here is an interleaved combination of the two where I have added comments explaining my interpretation of what is going on:
0030 ------------ LOOP ------------
;; Calculate i%2
0031 r15 int BAND 0028 +1
0bcafa70 mov r15d, ebx
0bcafa73 and r15d, +0x01
;; Array Bounds Check (ABC) of t[...]
0032 > int ABC 0013 0031
0bcafa77 cmp r15d, esi
0bcafa7a jnb 0x0bca0018 ->2
0bcafa80 cmp dword [rdx+r15*8+0x4], 0xfffeffff
0bcafa89 jnb 0x0bca0018 ->2
;; Lookup array element location (AREF) and value (ALOAD)
0033 p32 AREF 0015 0031
0034 xmm7 > num ALOAD 0033
0bcafa8f movsd xmm7, [rdx+r15*8]
;; Convert array value from double float (Lua's native number format) into a uint64_t.
0035 r15 u64 CONV 0034 u64.num
0bcafa95 cvttsd2si r15, xmm7
0bcafa9a test r15, r15integ
0bcafa9d jns 0x0bcafaad
;; Add the value fromt he table (xmm7) to the counter value (rbp).
;;
;; XXX Is there duplicate work here with a second 'cvttsd2si r15,xmm7'?
;; (that converts the double float in xmm7 to an integer in r15)
0036 rbp + u64 ADD 0035 0025
0bcafa9f addsd xmm7, [0x40111b60]
0bcafaa8 cvttsd2si r15, xmm7
0bcafaad add rbp, r15
;; Store the updated counter value to memory.
0037 u64 XSTORE 0021 0036
0bcafab0 mov [rcx+0x8], rbp
;; Increment the loop index.
0038 rbx + int ADD 0028 +1
0bcafab4 add ebx, +0x01
;; Check for loop termination.
0039 > int LE 0038 0001
0bcafab7 cmp ebx, eax
0bcafab9 jle 0x0bcafa70 ->LOOP
0040 rbx int PHI 0028 0038
0041 rbp u64 PHI 0025 0036
I may well have made a mistake in this interpretation. I am also not certain whether the machine code does strictly match the IR code or to what extent it can be merged and shuffled around. I would like to understand this better because I have a fantasy that LuaJIT could automatically generate the interleaved view and that this might make traces easier for me to read.
So what jumps out from this?
- The IR and the machine code do seem to match up pretty neatly. I am glad to have both because I would have had a hard time recognizing the ABC (Array Bounds Check) just from looking at the machine code.
- The loop is quite long: 15 instructions. That is triple the length of the simpler loop from Tracing JITs and modern CPUs: Part 2 #6.
- The loop executes in four cycles. That is an average of 3.75 instructions executed per cycle. I am still really impressed with Intel CPUs.
- There seems to be some busy-work that could be optimized away: the array bounds check and the float/integer conversions.
Tweak 1: LUAJIT_NUMMODE
First I take the opportunity to try a little bit of voodoo. LuaJIT supports several number modes that can be chosen at compile time. What is a number mode? I don't really know. Mike Pall has commented that on x86_64 there are several options and some may be faster than others depending on the mix of integer and floating point operations.
Just for fun I tried them all. Turned out that compiling LuaJIT with -DLUAJIT_NUMMODE=2
improved this example significantly:
12,022,432,242 instructions # 3.98 insns per cycle
3,020,610,875 cycles
Now we are down to 3 cycles per iteration (for 12 instructions).
Here is the IR:
0030 ------------ LOOP ------------
0031 r15 int BAND 0028 +1
0032 > int ABC 0013 0031
0033 p32 AREF 0015 0031
0034 > int ALOAD 0033
0035 r15 u64 CONV 0034 u64.int sext
0036 rbp + u64 ADD 0035 0025
0037 u64 XSTORE 0021 0036
0038 rbx + int ADD 0028 +1
0039 > int LE 0038 0001
0040 rbx int PHI 0028 0038
0041 rbp u64 PHI 0025 0036
Here is the mcode:
->LOOP:
0bcafaa0 mov r15d, ebx
0bcafaa3 and r15d, +0x01
0bcafaa7 cmp r15d, esi
0bcafaaa jnb 0x0bca0018 ->2
0bcafab0 cmp dword [rdx+r15*8+0x4], 0xfffeffff
0bcafab9 jnz 0x0bca0018 ->2
0bcafabf movsxd r15, dword [rdx+r15*8]
0bcafac3 add rbp, r15
0bcafac6 mov [rcx+0x8], rbp
0bcafaca add ebx, +0x01
0bcafacd cmp ebx, eax
0bcafacf jle 0x0bcafaa0 ->LOOP
Interesting. I am tempted to submit a Pull Request to Snabb Switch that enables -DLUAJIT_NUMMODE=2
and see what impact that has on the performance tests that our CI runs. However, I am generally reluctant to apply optimizations that I don't understand reasonably well.
Tweak 2: FFI table
This time I will try a more straightforward change.
The problem I see is that we are doing a bunch of work to check array bounds and convert the table values from floats to ints. Let us try to avoid this by replacing the high-level Lua table with a low-level FFI array of integers.
local counter = require("core.counter")
local ffi = require("ffi")
local n = 1e9
local c = counter.open("test")
local t = ffi.new("int[2]", 1, 10) -- allocate table as FFI object
for i = 1,n do
counter.add(c, t[i%2])
end
This actually works pretty well:
9,022,173,328 instructions # 4.46 insns per cycle
2,022,321,563 cycles
Now we are down to two cycles per iteration (for 9 instructions).
Here is the IR:
0032 ------------ LOOP ------------
0033 r15 int BAND 0030 +1
0034 r15 i64 CONV 0033 i64.int sext
0035 i64 BSHL 0034 +2
0036 p64 ADD 0035 0012
0037 p64 ADD 0036 +8
0038 int XLOAD 0037
0039 r15 u64 CONV 0038 u64.int sext
0040 rbp + u64 ADD 0039 0027
0041 u64 XSTORE 0023 0040
0042 rbx + int ADD 0030 +1
0043 > int LE 0042 0001
0044 rbx int PHI 0030 0042
0045 rbp u64 PHI 0027 0040
Here is the mcode:
->LOOP:
0bcafaa0 mov r15d, ebx
0bcafaa3 and r15d, +0x01
0bcafaa7 movsxd r15, r15d
0bcafaaa movsxd r15, dword [rdx+r15*4+0x8]
0bcafaaf add rbp, r15
0bcafab2 mov [rcx+0x8], rbp
0bcafab6 add ebx, +0x01
0bcafab9 cmp ebx, eax
0bcafabb jle 0x0bcafaa0 ->LOOP
0bcafabd jmp 0x0bca001c ->3
This experiment feels more satisfying. I was able to identify redundant code, eliminate it in a sensible way, and verify that performance improved.
The end
Morals of this story:
- Trace compilers and CPUs are still fun and interesting.
- Getting the right information is really important: what is actually running (mcode) and why (IR). The process of piecing this together is quite interesting and may lead to automated shortcuts in the future.
- Relatively naive optimization techniques can be effective. In this case we have 7.5x performance by simply looking at what happens and asking ourselves "can we massage this to run inside one trace?", "can we get rid of the Array Bounds Check (ABC)?", "can we get rid of the float to integer conversion (cvttsd2si)?".