Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctable performance issue related to pointer arithmetic #1207

Open
alexandergall opened this issue Aug 11, 2017 · 3 comments
Open

ctable performance issue related to pointer arithmetic #1207

alexandergall opened this issue Aug 11, 2017 · 3 comments

Comments

@alexandergall
Copy link
Contributor

I started to use ctable indirectly through the IPv6 fragmentation code from apps.lwaftr. I was surprised to see much less performance than expected on the reassembly side. The profiler showed substantial amounts of interpreted code and GC occuring in the fast path.

A trace dump revealed a lot of instances of trace aborts like this

...
0014  . . . . TGETS    4   0   3  ; "remove_ptr"
0015  . . . . MOV      6   3
0016  . . . . CALL     4   1   3
0000  . . . . . FUNCF    7          ; ctable.lua:392
0001  . . . . . TGETS    2   0   0  ; "scale"
0002  . . . . . TGETS    3   0   1  ; "entries"
0003  . . . . . SUBVV    3   1   3
0000  . . . . . . . FUNCC               ; ffi.meta.__sub
---- TRACE 125 abort ctable.lua:394 -- bad argument type

This led to blacklisting of crucial parts of the reassembly code by the compiler.

After a bit of googling, I found a report that described this symptom: (https://experilous.com/1/blog/post/lua-optimizations-first-pass):

"This one was nasty, and the trace abort message looked nasty too: “bad argument type”. It turns out that LuaJIT really strongly dislikes taking the difference between two pointers that point to some type whose size is not a power of 2."

The remove_ptr method of ctable uses precisely this kind of pointer arithmetic

function CTable:remove_ptr(entry)
   local scale = self.scale
   local index = entry - self.entries
...

Here, entry is an element of the hash table constructed by the make_entry_type() function in lib.ctable.

In case of the fragmenter, the size of such an element is not a power of 2. I then applied a simple patch that pads the element to the next power of two:

local function make_entry_type(key_type, value_type)
   local cache = entry_types[key_type]
   if cache then
      cache = cache[value_type]
      if cache then return cache end
   else
      entry_types[key_type] = {}
   end
   local raw_size = ffi.sizeof(key_type) + ffi.sizeof(value_type) + 4
  local padding = 2^ceil(math.log(raw_size)/math.log(2)) - raw_size
   local ret = ffi.typeof([[struct {
         uint32_t hash;
         $ key;
         $ value;
         uint8_t padding[$];
      } __attribute__((packed))]],
      key_type,
      value_type,
      padding)
   entry_types[key_type][value_type] = ret
   return ret
end

This got immediately rid of trace aborts and blacklisting with a huge performance boost. So, it seems to be true that LuaJIT behaves as described in the article, though I wasn't able to find the code in LuaJIT that actually does this.

I guess this is something that should go into our "performance hacks" notebook.

Instead of padding the data structure, one could also use a performance-safe replacement of the built-in subtraction meta-method (as described in the article).

While the fragmentation code runs much better now, it still suffers quite a bit from GC in my use case. Still work to do :)

@wingo
Copy link
Contributor

wingo commented Aug 11, 2017

Wow that's nuts, good catch! I guess our main uses have been power-of-2 sized. Another possibility would be to hack LuaJIT to turn the division into a multiplication, perhaps; http://www.hackersdelight.org/divcMore.pdf.

Regarding reassembly, incidentally I had a properly extracted version of the IPv4 reassembler that I made on Monday, moments before my laptop was stolen :P Oh well.

@lukego
Copy link
Member

lukego commented Aug 11, 2017

I created raptorjit/raptorjit#85 to track the JIT issue. I think that I found the line of code responsible at least so far.

@alexandergall
Copy link
Contributor Author

Any plans to fix this in LuaJIT or are we waiting for the transition to RaptorJIT?

dpino pushed a commit to dpino/snabb that referenced this issue Apr 3, 2019
Re-enable NUMA memory binding on newer kernels
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants