-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ctable performance issue related to pointer arithmetic #1207
Comments
Wow that's nuts, good catch! I guess our main uses have been power-of-2 sized. Another possibility would be to hack LuaJIT to turn the division into a multiplication, perhaps; http://www.hackersdelight.org/divcMore.pdf. Regarding reassembly, incidentally I had a properly extracted version of the IPv4 reassembler that I made on Monday, moments before my laptop was stolen :P Oh well. |
I created raptorjit/raptorjit#85 to track the JIT issue. I think that I found the line of code responsible at least so far. |
Any plans to fix this in LuaJIT or are we waiting for the transition to RaptorJIT? |
Re-enable NUMA memory binding on newer kernels
I started to use
ctable
indirectly through the IPv6 fragmentation code fromapps.lwaftr
. I was surprised to see much less performance than expected on the reassembly side. The profiler showed substantial amounts of interpreted code and GC occuring in the fast path.A trace dump revealed a lot of instances of trace aborts like this
This led to blacklisting of crucial parts of the reassembly code by the compiler.
After a bit of googling, I found a report that described this symptom: (https://experilous.com/1/blog/post/lua-optimizations-first-pass):
"This one was nasty, and the trace abort message looked nasty too: “bad argument type”. It turns out that LuaJIT really strongly dislikes taking the difference between two pointers that point to some type whose size is not a power of 2."
The
remove_ptr
method ofctable
uses precisely this kind of pointer arithmeticHere,
entry
is an element of the hash table constructed by themake_entry_type()
function inlib.ctable
.In case of the fragmenter, the size of such an element is not a power of 2. I then applied a simple patch that pads the element to the next power of two:
This got immediately rid of trace aborts and blacklisting with a huge performance boost. So, it seems to be true that LuaJIT behaves as described in the article, though I wasn't able to find the code in LuaJIT that actually does this.
I guess this is something that should go into our "performance hacks" notebook.
Instead of padding the data structure, one could also use a performance-safe replacement of the built-in subtraction meta-method (as described in the article).
While the fragmentation code runs much better now, it still suffers quite a bit from GC in my use case. Still work to do :)
The text was updated successfully, but these errors were encountered: