-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-113464: Speed up JIT builds #122839
GH-113464: Speed up JIT builds #122839
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only question is how, if at all, memory usage is impacted by this change, given the independent compilation.
Otherwise, this is just a tiny comment about adding a comment.
for opname in opnames: | ||
coro = self._compile(opname, TOOLS_JIT_TEMPLATE_C, work) | ||
template = TOOLS_JIT_TEMPLATE_C.read_text() | ||
for case, opname in cases_and_opnames: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be good to add a comment here about why each of these is compiled independently (e.g., because of the performance benefit)? I'm not sure that would be abundantly clear if I just stumbled upon this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll add a comment.
The good news is that memory usage should go down, though temporary disk usage will go up. We were always compiling each opcode independently, but all of the other C code for all of the opcodes not being compiled was part of the same file. Now we are writing out files to disk containing just the C code for each opcode. As an example, before, we would compile this three times and extract the machine code of each opcode afterwards: _Py_CODEUNIT *
_JIT_ENTRY(...)
{
// ...
switch (FOO) {
case FOO: {
// Code for FOO...
}
case BAR: {
// Code for BAR (unused here)...
}
case BAZ: {
// Code for BAZ (unused here)...
}
}
// ...
} _Py_CODEUNIT *
_JIT_ENTRY(...)
{
// ...
switch (BAR) {
case FOO: {
// Code for FOO (unused here)...
}
case BAR: {
// Code for BAR...
}
case BAZ: {
// Code for BAZ (unused here)...
}
}
// ...
} _Py_CODEUNIT *
_JIT_ENTRY(...)
{
// ...
switch (BAZ) {
case FOO: {
// Code for FOO (unused here)...
}
case BAR: {
// Code for BAR (unused here)...
}
case BAZ: {
// Code for BAZ...
}
}
// ...
} Now, we only include the necessary case each time, so the C compiler doesn't waste time parsing a bunch of dead C code for the other cases _Py_CODEUNIT *
_JIT_ENTRY(...)
{
// ...
switch (FOO) {
case FOO: {
// Code for FOO...
}
}
// ...
} _Py_CODEUNIT *
_JIT_ENTRY(...)
{
// ...
switch (BAR) {
case BAR: {
// Code for BAR...
}
}
// ...
} _Py_CODEUNIT *
_JIT_ENTRY(...)
{
// ...
switch (BAZ) {
case BAZ: {
// Code for BAZ...
}
}
// ...
} Hopefully that helps explain what's going on here? It's a bit weird. |
The current JIT build includes the code for every instruction when compiling each stencil, even though only one of the cases is used.
This extracts the desired cases and compiles them each in isolation, which makes JIT builds almost twice as fast (except on Windows, where I suspect we're still bound by subprocess creation time).