-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Newsvuvnviv taint api speedup #22662
base: blead
Are you sure you want to change the base?
Conversation
related to "SvUV() macro 100% of time calls Perl_sv_2uv_flags" Perl#22653 Until Perl#22653 is solved, clean up newSVuv() and remove branch to "newSViv()" that is unexplained by git blame. BUT, keep original intent and behaviour of "newSViv()" branch for now. Add asserts to guard against 0x8000,0000 == SVf_IVisUV changing. Value of SVf_IVisUV can change in the future, and there might be (I didn't git blame), logic that sign flag and SVf_IVisUV are equal. But these changes depend on SVf_IVisUV being 0x8000,0000 and must be updated if SVf_IVisUV changes. Change SvXXXV_set() to be an explicity bodyless SV head optimization. MSVC 2022 -O1 combined SET_SVANY_FOR_BODYLESS_IV() and SvIV_set(). But instead of hopes and prayers on "UB" or "ISB/IDB" of CCs that could change at random in any previous or future build number of a CC, do it explictly. Bodyless SV head API is defined by P5P, not CC vendors. 9155444 3/20/2022 3:05:10 PM Perl_newSViv: simplify by using (inline) newSV_type Fix deoptimized Perl_newSViv(). In that commit it forgot about Perl_newSVuv(). Since newSV_type() is a inline fn, and "inline" is CC domain UB optimization. And newSV_type() is far more complex than CPP macro new_SV(), and newSV_type() depends on 100% perfection from CC's LTO engine and ".o" disk format, and possibly depends on the CC breaking ISO C spec with -O3 or -O4. Which turn on extreme SEGV inducing C variable aliasing rules that few C code bases tolerate. Quick examples, a reddit comment (not credible), claims "uint8_t *" and "char *" can not be casted since the CC or CPU has 9 bit bytes or a 9 wire data bus, and ECC parity wire is 9 of 9 for "uint8_t" and 8 of 9 for "signed char" and wire 9 for "char" is the ECC parity wire. The platform's libc's fwritef(), hides the secretly converts 9 bit bytes, to standard 8 bit bytes, making the CC "ISO C compliant". My more realistic scenario, inside newSV_type(). How can the CC know, what if Perl_more_sv() or Perl_more_bodies(), calls mprotect(), modifies "static const struct body_details bodies_by_type [];", calls mprotect() again, and returns execution to newSV_type()? Just switch to new_SV(). Its a CPP macro, not subject to CC UB inlining, and new_SV() only has 1 fn call and is super light weight. Old P5P commits/ML/CPAN dev talk about this area of code being crucial to (CPAN XS) deserializing perf in perl, so perf considerations, with proper asserts, has priority over readability. Links to old core commits in Perl#22653 briefly discuss deserializing perf as rational, so this patch also follows that design idea. Perl_vnewSVpvf(), "malloc(1ch);" which in reality is "malloc(16ch)" makes no sense, since almost zero chance fmtstr+args+\0 <= 16, and perl malloc() round up, is semi-UB/a build flag default on anyways. Using guesstimate malloc(pat_len), increases chances far higher, that a realloc() inside sv_vcatpvfn_flags(), OS realloc(), will realloc() in place, not changing the ptr, esp assuming OS malloc() does bucket of power of 2 allocator algo. Assume, 40ch malloc() fmt string, bucket to 64ch by OS malloc(), throw in a %u 32b, that is max +10ch-2ch for "%u". So output is 48ch. realloc(48ch) is inplace, therefore it is a win.
The intention was always that calls to |
Ah, I've seen there's some discussion on #p5p. I'll try to catch up on that tonight. |
c7aaa28
to
2050add
Compare
repushed branch, fixed -DNO_TAINT_SUPPORT build failure |
This p.r. has repeatedly failed to build on one of our CI setups. Please see: |
-design and rational in src comments, this patch forces MSVC 2022 x64 to use 64b integer math/CPU ops (regs RAX/RDX/RCX), vs 2 sequences/pairs of EAX/EBX/ECX register ops removing a couple CPU instructions in filling out the SV HEAD. This optimization will translate to all OSes. It is broken out into a separate commit for git bisect reasons since it touches the alignment topic. As with part 1, some members of the community care about rapidly creating massive amounts of SVIVs/SVUVs/SVNVs in deserializing wire/protocol/disk formations, or big data sci num crunching.
This reverts commit aae9cea. Author note, hand editing required to revert since commit was from 2005 and it is 2024. Part 1 of ? to optimize and reduce overhead of SvTAINT() macro inside all SV * allocator fncs. Using taint feature is rare, and "push(sv), push(my_perl), call()" is alot smaller machine code at the many call sites, than "push(0), push(0), push(116), push(0), push(sv), push(my_perl), call()" and using taint at runtime, means the user decided perf is irrelavent vs security. newSViv()/newSVuv()/newSVnv() are malloc()-free but Perl_sv_magicext() contains "sv_upgrade(SVt_PVMG); calloc(1,0x30);" and not for taint-feat, but also a 2nd "malloc(0x****)". Factor out all those sv_magic() calls into a wrapper for the unlikely branch. SvTAINT() has many call sites in hottest parts of perl.
-make Perl_sv_taint() return the SV *, useful for a future optimization previous it was void This part 2, along with part 1. Shows improvement. Delta, after 1 & 2. previous miniperl.exe Win64 .text section, VC 2022 -O1 0x12440C bytes long after 0x1240AC bytes long 864 bytes of machine code were removed. A bin analysis tool shows has Perl_sv_taint() 62 callers in miniperl.exe
"SvTAINT();" contains "if(PL_tainting && PL_tainted) sv_taint(sv);" that is 2 One Byte reads and 2 branches. Collapse the 2 bool chars, to a U16, so it is exactly 1 read, and 1 branch. Strips complexity from the very bottom of the very hot newSVuv/newSViv/newSVuv, and other callers. sv_taint(sv) has 62 callers, not sure how many do the 2 reads, 2 branches SvTAINT(sv);, but the change decreased the size of miniperl.exe and therefore perl541.dll, and branches were removed from the newSVuv/newSViv/newSVuv trio. Delta machine code bytes, between part 2 & 3 (this commit). previous miniperl.exe Win64 .text section, VC 2022 -O1 0x1240AC bytes long after 0x12408C bytes long
Perl_newSVnv/Perl_newSViv/Perl_newSVuv, currently have to save the fresh SV *, either on C stack, or in non volatile registers, around the possible Perl_sv_taint() fn call inside SvTAINT(). If Perl_sv_taint() returns its SV * argument, and assigns it back to the same C var, now these 3 performance critical SV allocator functions, after plucking the SV head from the arena, these 3 function never ever have to store the fresh SV * back to C stack for any reason during their execution. This optimization removes pop/push pairs of the C compiler saving non-volatile registers and restoring them at function entry and exit since after SvTAINTTC() change, NO variables AT ALL, have to be saved around any function calls in Perl_newSVnv/Perl_newSViv/Perl_newSVuv. Also the SV head *, after being delinked/removed from an areana, can now be stored through the whole function, in the x86 EAX/x64 RAX register, and pass through to the caller, without a final (non vol) reg to (vol retval reg) mov/copy cpu op. Remember eax/rax/retval registers, are always wiped after each fn call, but the refactoring of SvTAINTTC() conviently returns the SV * back to us, in the ABI return register, and we let the fresh SV * glide through on the "heavy" Perl_sv_taint() branch, from Perl_sv_taint() to Perl_newSViv()'s caller, without touching it, 0 machine code ops. Few code sites were changed from SvTAINT() to SvTAINTTC(), to keep this patch smaller, and the Perl_sv_set*vXXX() category of functions, all have void return types and can't be chained. Also the Perl_sv_taint() branch can be tail called or converted to a JMP insted of CALL, if the CC/OS/ABI wants to now. This is the final part of speeding up Perl_newSVnv/Perl_newSViv/Perl_newSVuv there is nothing else to remove or optimze.
2050add
to
c77e852
Compare
fixed asserts for 32b ptr builds |
macOS (Monterey) 12 (-Uusethreads) passed. I did no changes except for moving a static assert that failed i386. macOS (Monterey) 12 (-Uusethreads) now
is this a flip flop timing test that fails regularly?
IIRC Win NT kernel API refuses to update access time any faster than 1 full second. |
related bug tickets I found for
|
Thinking about how to move this PR on: B) These changes would likely get through review much faster if the PR was split up into 3 separate PRs:
|
Ancient history P5P posts (Sarathy era/early JDB) say MSVC i386 -O1 is faster, 2-3 devs benchmarked the interp on private code. In 2010s/2020s, I would leave modern/current supported MSVC on -O1, since MSVC will inline and expand the worst possible code blocks in -O2. Like unroll all Perl_croak()s to Perl_vcroak(), or inlining the Perl_sv_magicext() loop into every XSUB inside libperl.dll (LTO visibility). MSVC -O2 also writes, x86 machine code wise, all "mov dest_reg, 1 byte (8 bits) constants" aka "imm8s", as 4 byte constants, with 3 useless null bytes. MSVC in -O1 correctly writes 1 byte operands. -O2 expands all constants to 4 bytes operands. The optimization logic there is questionable IMO. Perl is memory starved or branch miss starved, perl isn't FP/algebra/math/video codecs starved, And the interp never sits in a 100K interation loop, ontop of a fixed RW 256-1024 bytes chunk of ram. Perls not a MPEG decoder. It doesn't need x86 conditional jump's to be aligned to cache line multiples, and inside the function, 25%-40% of all bytes are NOP CPU instructions. Sounds to me MSVC had max R&D done targeting the Pentium 4 era, which is when they last redesigned/forklift-ed -O2 subsystem. Pentium 4s with very long pipelines/high latency, are long obsolete, and starting with Intel Core 1/2 to today, Smaller is better. I do plan to add -Oi (memFOO()/strFOO()) intrinsics to the -O1 MSVC perl in near future. That feature is amazing, since "unaligned" libc P5P has the macros in the source code already to do selective!!!!! (DONT YALL GET IDEAS!!!!) "emergency" inlining on MSVC platform with the current -O1 mode. I DO NOT WANT to see croak() unrolled/inlined all over the code base. So Im strongly against -O2. Sitting down with an IDE and some of my other tools, and single stepping (or any other P5P person doing), and finding individual places to unroll/expand, including a Perl dev, and his brain, and background knowledge of what is run loop code, and what is "artic" panic/assert/overflow/"bizzare copy of" code, then rational decisions can be done function by function, on what is hot, what can be changed, add ultra inline decl tag or not? It had to be done by humans, MSVC's algorithms are too generic, and were design for very tiny in disk file space video codecs drivers or gaming engines with 99% FP SMID math workloads. Not Perl's mundane ETL usage which is almost nothing but compare/jump/move branching all day long.
Ill rebase them and PR split them. I suspect Ill be facing a wall of rebase conflicts if they go in separate, but do that path anyways (3 more PRs) I', thinking of replacing some of the C switch trees (binary search), with U32es constants/imm32 of "logic" that used to be in that global table.
with code like this. O(1), not generic typica; CC way of 10 x cmp_()/jump_above()/jump_below() to do it as a C switch,. |
inspire by looking at bug #22653, and remarks in the past over these 3 fns being super
important esp for enterprise serialization/deserial/wire format decoding.
They are also sort of related to a very bad failed optimization (MSVC compiler went to "-O0" and added 100s of KBs of redundant code in perl541.dll and some KBs more in XS DLLs), done 2-3 years ago in perl core. But im still working on a fix/diag/analysis/solution for that. This branch of commits covers more about serial/deserial performance, and taking unique advantage that IV NV UVs are no-malloc SV and that they are bodyless.
Plus in one spot, my additional "bodyless" optimization from some years ago disappeared through code churn at
9155444 I put it back, since bodyless SVs
are very light weight. And newSV_type() is very heavy with many many branches inside.