Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for switch in ceval.c when using MSVC #91719

Closed
gvanrossum opened this issue Apr 20, 2022 · 6 comments
Closed

Improve performance for switch in ceval.c when using MSVC #91719

gvanrossum opened this issue Apr 20, 2022 · 6 comments
Labels
performance Performance or resource usage type-bug An unexpected behavior, bug, or error

Comments

@gvanrossum
Copy link
Member

We've received reports of MSVC not generating optimal code, e.g. gh-89279.

One possible improvement would be to get the big switch statement in ceval.c to generate better code. It's been rumored that MSVC will generate essentially a computed goto if all cases are filled.

See my investigations at faster-cpython/ideas#321 (comment)

@gvanrossum gvanrossum added the type-bug An unexpected behavior, bug, or error label Apr 20, 2022
@AlexWaygood AlexWaygood added the performance Performance or resource usage label Apr 20, 2022
gvanrossum added a commit that referenced this issue Apr 21, 2022
Apparently a switch on an 8-bit quantity where all cases are
present generates a more efficient jump (doing only one indexed
memory load instead of two).

So we make opcode and use_tracing uint8_t, and generate a macro
full of extra `case NNN:` lines for all unused opcodes.

See faster-cpython/ideas#321 (comment)
@gvanrossum
Copy link
Member Author

Fixed by #91718.

vstinner added a commit that referenced this issue Apr 25, 2022
Move the following API from Include/opcode.h (public C API) to a new
Include/internal/pycore_opcode.h header file (internal C API):

* EXTRA_CASES
* _PyOpcode_Caches
* _PyOpcode_Deopt
* _PyOpcode_Jump
* _PyOpcode_OpName
* _PyOpcode_RelativeJump
@neonene
Copy link
Contributor

neonene commented May 22, 2022

[3.12] REPORT: Efficient switch(opcode) seems to require 182 or more effective cases. Starting from e48ac9c with 179 opcodes, the switch loads indexed memory twice, not once. I expect next alpha to have a sufficient number of opcodes.

Details of the switch in e48ac9c (release build)
lea rcx,qword ptr ds:[7FEF2F20000]     | ceval.c:1827
movzx eax,r13b                         |
mov rdi,r14                            |
mov dword ptr ss:[rbp-70],eax          |
mov rdx,r14                            |
mov qword ptr ss:[rbp-60],r12          |
mov r9,r14                             |
mov qword ptr ss:[rsp+78],r14          |
mov r11,r14                            |
movzx eax,byte ptr ds:[rcx+rax+22C7BC] | <<<<<
mov r8,r14                             |
mov qword ptr ss:[rsp+70],r12          |
mov qword ptr ss:[rsp+60],r12          |
mov qword ptr ss:[rbp-78],r12          |
mov ecx,dword ptr ds:[rcx+rax*4+22C4F0 | <<<<<
lea rax,qword ptr ds:[7FEF2F20000]     |
add rcx,rax                            |
mov qword ptr ss:[rsp+68],r12          |
movsxd rsi,r15d                        |
jmp rcx                                |

Example changes of EXTRA_CASES in pycore_opcode.h:

#define EXTRA_CASES \
    case 253: Py_FatalError("op:253"); \
    case 254: Py_FatalError("op:254"); \
    case 178: \
    case 179: \
    case 180: \
    case 181: \
    case 182: \
    case 183: \
    ...
    case 251: \
    case 252: \
        ;


lea rcx,qword ptr ds:[7FEF2F20000]     | ceval.c:1827
movzx eax,r13b                         |
mov rdi,r14                            |
mov dword ptr ss:[rbp-70],eax          |
mov r9,r14                             |
mov qword ptr ss:[rbp-60],r12          |
mov r8,r14                             |
mov qword ptr ss:[rsp+78],r14          |
mov r11,r14                            |
mov ecx,dword ptr ds:[rcx+rax*4+22C4E8 | <<<<<
mov rdx,r14                            |
lea rax,qword ptr ds:[7FEF2F20000]     |
mov qword ptr ss:[rsp+70],r12          |
add rcx,rax                            |
mov qword ptr ss:[rsp+60],r12          |
mov qword ptr ss:[rbp-78],r12          |
mov qword ptr ss:[rsp+68],r12          |
movsxd rsi,r15d                        |
jmp rcx                                |
#define EXTRA_CASES \
    case 252: Py_FatalError("op:252"); \
    case 253: Py_FatalError("op:253"); \
    case 254: Py_FatalError("op:254"); \
    default:


mov qword ptr ss:[rsp+48],r14          | ceval.c:1827
mov r9,qword ptr ss:[rsp+30]           |
lea rcx,qword ptr ds:[7FEF2F20000]     |
movzx edi,r13b                         |
mov rdx,r12                            |
mov qword ptr ss:[rsp+58],rdx          |
mov r8,r14                             |
mov qword ptr ss:[rbp-80],r14          |
cmp edi,FF                             |
ja python312.7FEF314BFE3               | unknown opcode error
mov ecx,dword ptr ds:[rcx+rdi*4+22C2E0 | <<<<<
lea rax,qword ptr ds:[7FEF2F20000]     |
add rcx,rax                            |
movsxd rsi,r15d                        |
jmp rcx                                |
#define EXTRA_CASES \
    case 178: Py_FatalError("op:178"); \
    case 179: Py_FatalError("op:179"); \
    case 180: Py_FatalError("op:180"); \
    default:


mov qword ptr ss:[rsp+48],r14          | ceval.c:1827
mov r9,qword ptr ss:[rsp+30]           |
lea rcx,qword ptr ds:[7FEF30B0000]     |
movzx edi,r13b                         |
mov rdx,r12                            |
mov qword ptr ss:[rsp+58],rdx          |
mov r8,r14                             |
mov qword ptr ss:[rbp-80],r14          |
cmp edi,FF                             |
ja python312.7FEF32DBFE3               | unknown opcode error
mov ecx,dword ptr ds:[rcx+rdi*4+22C2E0 | <<<<<
lea rax,qword ptr ds:[7FEF30B0000]     |
add rcx,rax                            |
movsxd rsi,r15d                        |
jmp rcx                                |

As for performance advantage over loading memory twice, the examples above has little for me. Each of the following was faster on PGO:

  • Avoid falling through in EXTRA_CASES, giving each an exclusive body without default.
  • Use something better than Py_FatalError().
  • Make default case unreachable.

@gvanrossum
Copy link
Member Author

Thanks for the detailed and clear report. It sounds like in the long term we would be okay without depending on the EXTRA_CASES macro, though we might want to add a check somewhere that we have at least 182 cases. (It would be better if we could directly test for the extra memory indirection but that seems beyond our current tooling. :-)

As for performance advantage over loading memory twice, the examples above has little for me. Each of the following was faster on PGO:

  • Avoid falling through in EXTRA_CASES, giving each an exclusive body without default.
  • Use something better than Py_FatalError().
  • Make default case unreachable.

I'm not sure how come any of those would make a difference given that we never execute the default case. Or am I not following you?

@neonene
Copy link
Contributor

neonene commented May 27, 2022

I'm not sure why either(EDIT: improved by #94364), so I tried making a small reproducer. After trimming the cases in eval loop, INSTRUCTION_START() macro remained loading memory twice, which is not implemented in 3.10. I'd like to look into after new opcodes are introduced into 3.12.

#include <stdint.h>
typedef uint16_t _Py_CODEUNIT;

int
main(int argc, char *argv[])
{
    uint8_t opcode = 0;
    _Py_CODEUNIT tmp[10];
    _Py_CODEUNIT *next_instr = tmp;

    switch (opcode) {
    case 255 : next_instr++; break;
    case   0 : next_instr++; break;
    ...
    }
}

Current jump behaviors using case 255:

Equivalent to if-else with 6 cases
    switch (opcode) {
    case 255 : next_instr++; break;
    case   0 : next_instr++; break;
    case   1 : next_instr++; break;
    case   2 : next_instr++; break;
    case   3 : next_instr++; break;
    case   4 : next_instr++; break;
    }


mov qword ptr ss:[rsp+10],rdx
mov dword ptr ss:[rsp+8],ecx
sub rsp,38
mov rax,qword ptr ds:[<__security_cook
xor rax,rsp
mov qword ptr ss:[rsp+28],rax
mov byte ptr ss:[rsp],0
lea rax,qword ptr ss:[rsp+10]
mov qword ptr ss:[rsp+8],rax
movzx eax,byte ptr ss:[rsp]
mov dword ptr ss:[rsp+4],eax
cmp dword ptr ss:[rsp+4],3
jg repro.13FE36EC7
cmp dword ptr ss:[rsp+4],3
je repro.13FE36F1A
cmp dword ptr ss:[rsp+4],0
je repro.13FE36EEA
cmp dword ptr ss:[rsp+4],1
je repro.13FE36EFA
cmp dword ptr ss:[rsp+4],2
je repro.13FE36F0A
jmp repro.13FE36F38
cmp dword ptr ss:[rsp+4],4
je repro.13FE36F2A
cmp dword ptr ss:[rsp+4],FF
je repro.13FE36EDA
jmp repro.13FE36F38
mov rax,qword ptr ss:[rsp+8]
add rax,2
mov qword ptr ss:[rsp+8],rax
jmp repro.13FE36F38
mov rax,qword ptr ss:[rsp+8]
add rax,2
mov qword ptr ss:[rsp+8],rax
jmp repro.13FE36F38
mov rax,qword ptr ss:[rsp+8]
add rax,2
mov qword ptr ss:[rsp+8],rax
jmp repro.13FE36F38
mov rax,qword ptr ss:[rsp+8]
add rax,2
mov qword ptr ss:[rsp+8],rax
jmp repro.13FE36F38
mov rax,qword ptr ss:[rsp+8]
add rax,2
mov qword ptr ss:[rsp+8],rax
jmp repro.13FE36F38
mov rax,qword ptr ss:[rsp+8]
add rax,2
mov qword ptr ss:[rsp+8],rax
xor eax,eax
mov rcx,qword ptr ss:[rsp+28]
xor rcx,rsp
call repro.13FE32513
add rsp,38
ret
Single memory load with 7 to 63 cases
    switch (opcode) {
    case 255 : next_instr++; break;
    case   0 : next_instr++; break;
    ...
    case   5 : next_instr++; break;  // from here
    case   6 : next_instr++; break;
    case   7 : next_instr++; break;
    ...
    case  61 : next_instr++; break;
    }


mov qword ptr ss:[rsp+10],rdx          |
mov dword ptr ss:[rsp+8],ecx           |
sub rsp,48                             |
mov rax,qword ptr ds:[<__security_cook |
xor rax,rsp                            |
mov qword ptr ss:[rsp+30],rax          |
mov byte ptr ss:[rsp],0                |
lea rax,qword ptr ss:[rsp+18]          |
mov qword ptr ss:[rsp+8],rax           |
movzx eax,byte ptr ss:[rsp]            |
mov dword ptr ss:[rsp+10],eax          |
cmp dword ptr ss:[rsp+10],FF           |
jg repro.13F3E6ED9                     |
cmp dword ptr ss:[rsp+10],FF           |
je repro.13F3E6EDB                     |
cmp dword ptr ss:[rsp+10],5            |
ja repro.13F3E6F49                     |
movsxd rax,dword ptr ss:[rsp+10]       |
lea rcx,qword ptr ds:[13F3E0000]       |
mov eax,dword ptr ds:[rcx+rax*4+6F60]  | <<<<<
add rax,rcx                            |
jmp rax                                |
Double memory load with 64 to 181 cases
    switch (opcode) {
    case 255 : next_instr++; break;
    case   0 : next_instr++; break;
    ...
    case  62 : next_instr++; break;  // from here
    case  63 : next_instr++; break;
    case  64 : next_instr++; break;
    ...
    case 179 : next_instr++; break;
    }


mov qword ptr ss:[rsp+10],rdx          |
mov dword ptr ss:[rsp+8],ecx           |
sub rsp,38                             |
mov rax,qword ptr ds:[<__security_cook |
xor rax,rsp                            |
mov qword ptr ss:[rsp+28],rax          |
mov byte ptr ss:[rsp+8],0              |
lea rax,qword ptr ss:[rsp+10]          |
mov qword ptr ss:[rsp],rax             |
movzx eax,byte ptr ss:[rsp+8]          |
mov dword ptr ss:[rsp+C],eax           |
cmp dword ptr ss:[rsp+C],FF            |
ja repro.13F4A72F1                     |
movsxd rax,dword ptr ss:[rsp+C]        |
lea rcx,qword ptr ds:[13F4A0000]       |
movzx eax,byte ptr ds:[rcx+rax+740C]   | <<<<<
mov eax,dword ptr ds:[rcx+rax*4+7308]  | <<<<<
add rax,rcx                            |
jmp rax                                |
Single memory load with 182 or more cases
    switch (opcode) {
    case 255 : next_instr++; break;
    case   0 : next_instr++; break;
    ...
    case 180 : next_instr++; break;  // from here
    case 181 : next_instr++; break;
    case 182 : next_instr++; break;
    ...
    case 254 : next_instr++; break;
    }


mov qword ptr ss:[rsp+10],rdx          |
mov dword ptr ss:[rsp+8],ecx           |
sub rsp,38                             |
mov rax,qword ptr ds:[<__security_cook |
xor rax,rsp                            |
mov qword ptr ss:[rsp+28],rax          |
mov byte ptr ss:[rsp+8],0              |
lea rax,qword ptr ss:[rsp+10]          |
mov qword ptr ss:[rsp],rax             |
movzx eax,byte ptr ss:[rsp+8]          |
mov dword ptr ss:[rsp+C],eax           |
cmp dword ptr ss:[rsp+C],FF            |
ja repro.13F847ABF                     |
movsxd rax,dword ptr ss:[rsp+C]        |
lea rcx,qword ptr ds:[13F840000]       |
mov eax,dword ptr ds:[rcx+rax*4+7AD4]  | <<<<<
add rax,rcx                            |
jmp rax                                |

Another behavior without case 255

MSVC seems to check whether existing cases are continuous or not, rather than their count. The following shows an extra memory load:

    // They are discontinuous and their maximum opcode is greater than 20
    switch (opcode) {
    case   0 : next_instr++; break;
    case   5 : next_instr++; break;
    case  18 : next_instr++; break;
    case  19 : next_instr++; break;
    case  20 : next_instr++; break;
    case  21 : next_instr++; break;
    }


mov qword ptr ss:[rsp+10],rdx          |
mov dword ptr ss:[rsp+8],ecx           |
sub rsp,48                             |
mov rax,qword ptr ds:[<__security_cook |
xor rax,rsp                            |
mov qword ptr ss:[rsp+30],rax          |
mov byte ptr ss:[rsp],0                |
lea rax,qword ptr ss:[rsp+18]          |
mov qword ptr ss:[rsp+8],rax           |
movzx eax,byte ptr ss:[rsp]            |
mov dword ptr ss:[rsp+10],eax          |
cmp dword ptr ss:[rsp+10],15           |
ja repro.13FE56F27                     |
movsxd rax,dword ptr ss:[rsp+10]       |
lea rcx,qword ptr ds:[13FE50000]       |
movzx eax,byte ptr ds:[rcx+rax+6F58]   | <<<<<
mov eax,dword ptr ds:[rcx+rax*4+6F3C]  | <<<<<
add rax,rcx                            |
jmp rax                                |
mov rax,qword ptr ss:[rsp+8]           |
add rax,2                              |
mov qword ptr ss:[rsp+8],rax           |
jmp repro.13FE56F27                    |
mov rax,qword ptr ss:[rsp+8]           |
add rax,2                              |
mov qword ptr ss:[rsp+8],rax           |
jmp repro.13FE56F27                    |
mov rax,qword ptr ss:[rsp+8]           |
add rax,2                              |
mov qword ptr ss:[rsp+8],rax           |
jmp repro.13FE56F27                    |
mov rax,qword ptr ss:[rsp+8]           |
add rax,2                              |
mov qword ptr ss:[rsp+8],rax           |
jmp repro.13FE56F27                    |
mov rax,qword ptr ss:[rsp+8]           |
add rax,2                              |
mov qword ptr ss:[rsp+8],rax           |
jmp repro.13FE56F27                    |
mov rax,qword ptr ss:[rsp+8]           |
add rax,2                              |
mov qword ptr ss:[rsp+8],rax           |
xor eax,eax                            |
mov rcx,qword ptr ss:[rsp+30]          |
xor rcx,rsp                            |
call repro.13FE52513                   |
add rsp,48                             |
ret                                    |

@erlend-aasland
Copy link
Contributor

Do you want to reopen this, or create a new issue?

@neonene
Copy link
Contributor

neonene commented May 27, 2022

The faster-cpython site would be better for more information.

gvanrossum pushed a commit that referenced this issue Jun 30, 2022
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 30, 2022
… the dispatching in ceval.c (pythonGH-94364)

(cherry picked from commit ea39b77)

Co-authored-by: neonene <53406459+neonene@users.noreply.github.com>
gvanrossum pushed a commit that referenced this issue Jun 30, 2022
…ispatching in ceval.c (GH-94364) (#94453)

(cherry picked from commit ea39b77)

Co-authored-by: neonene <53406459+neonene@users.noreply.github.com>
gvanrossum pushed a commit to gvanrossum/cpython that referenced this issue Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants