Newsvuvnviv taint api speedup #22662

bulk88 · 2024-10-13T13:22:26Z

inspire by looking at bug #22653, and remarks in the past over these 3 fns being super
important esp for enterprise serialization/deserial/wire format decoding.

They are also sort of related to a very bad failed optimization (MSVC compiler went to "-O0" and added 100s of KBs of redundant code in perl541.dll and some KBs more in XS DLLs), done 2-3 years ago in perl core. But im still working on a fix/diag/analysis/solution for that. This branch of commits covers more about serial/deserial performance, and taking unique advantage that IV NV UVs are no-malloc SV and that they are bodyless.

Plus in one spot, my additional "bodyless" optimization from some years ago disappeared through code churn at
9155444 I put it back, since bodyless SVs
are very light weight. And newSV_type() is very heavy with many many branches inside.

related to "SvUV() macro 100% of time calls Perl_sv_2uv_flags" Perl#22653 Until Perl#22653 is solved, clean up newSVuv() and remove branch to "newSViv()" that is unexplained by git blame. BUT, keep original intent and behaviour of "newSViv()" branch for now. Add asserts to guard against 0x8000,0000 == SVf_IVisUV changing. Value of SVf_IVisUV can change in the future, and there might be (I didn't git blame), logic that sign flag and SVf_IVisUV are equal. But these changes depend on SVf_IVisUV being 0x8000,0000 and must be updated if SVf_IVisUV changes. Change SvXXXV_set() to be an explicity bodyless SV head optimization. MSVC 2022 -O1 combined SET_SVANY_FOR_BODYLESS_IV() and SvIV_set(). But instead of hopes and prayers on "UB" or "ISB/IDB" of CCs that could change at random in any previous or future build number of a CC, do it explictly. Bodyless SV head API is defined by P5P, not CC vendors. 9155444 3/20/2022 3:05:10 PM Perl_newSViv: simplify by using (inline) newSV_type Fix deoptimized Perl_newSViv(). In that commit it forgot about Perl_newSVuv(). Since newSV_type() is a inline fn, and "inline" is CC domain UB optimization. And newSV_type() is far more complex than CPP macro new_SV(), and newSV_type() depends on 100% perfection from CC's LTO engine and ".o" disk format, and possibly depends on the CC breaking ISO C spec with -O3 or -O4. Which turn on extreme SEGV inducing C variable aliasing rules that few C code bases tolerate. Quick examples, a reddit comment (not credible), claims "uint8_t *" and "char *" can not be casted since the CC or CPU has 9 bit bytes or a 9 wire data bus, and ECC parity wire is 9 of 9 for "uint8_t" and 8 of 9 for "signed char" and wire 9 for "char" is the ECC parity wire. The platform's libc's fwritef(), hides the secretly converts 9 bit bytes, to standard 8 bit bytes, making the CC "ISO C compliant". My more realistic scenario, inside newSV_type(). How can the CC know, what if Perl_more_sv() or Perl_more_bodies(), calls mprotect(), modifies "static const struct body_details bodies_by_type [];", calls mprotect() again, and returns execution to newSV_type()? Just switch to new_SV(). Its a CPP macro, not subject to CC UB inlining, and new_SV() only has 1 fn call and is super light weight. Old P5P commits/ML/CPAN dev talk about this area of code being crucial to (CPAN XS) deserializing perf in perl, so perf considerations, with proper asserts, has priority over readability. Links to old core commits in Perl#22653 briefly discuss deserializing perf as rational, so this patch also follows that design idea. Perl_vnewSVpvf(), "malloc(1ch);" which in reality is "malloc(16ch)" makes no sense, since almost zero chance fmtstr+args+\0 <= 16, and perl malloc() round up, is semi-UB/a build flag default on anyways. Using guesstimate malloc(pat_len), increases chances far higher, that a realloc() inside sv_vcatpvfn_flags(), OS realloc(), will realloc() in place, not changing the ptr, esp assuming OS malloc() does bucket of power of 2 allocator algo. Assume, 40ch malloc() fmt string, bucket to 64ch by OS malloc(), throw in a %u 32b, that is max +10ch-2ch for "%u". So output is 48ch. realloc(48ch) is inplace, therefore it is a win.

richardleach · 2024-10-14T09:41:05Z

newSV_type() is very heavy

The intention was always that calls to newSV_type() with a static type argument would be inlined. (Hence, your bodiless optimisation would not have disappeared.) Are you finding that this is not the case? Because of -O0?

richardleach · 2024-10-14T10:09:06Z

newSV_type() is very heavy

The intention was always that calls to newSV_type() with a static type argument would be inlined. (Hence, your bodiless optimisation would not have disappeared.) Are you finding that this is not the case? Because of -O0?

Ah, I've seen there's some discussion on #p5p. I'll try to catch up on that tonight.

bulk88 · 2024-10-14T23:13:17Z

repushed branch, fixed -DNO_TAINT_SUPPORT build failure

jkeenan · 2024-10-16T11:39:39Z

This p.r. has repeatedly failed to build on one of our CI setups. Please see:
https://github.com/Perl/perl5/actions/runs/11336336622/job/31611461445?pr=22662

-design and rational in src comments, this patch forces MSVC 2022 x64 to use 64b integer math/CPU ops (regs RAX/RDX/RCX), vs 2 sequences/pairs of EAX/EBX/ECX register ops removing a couple CPU instructions in filling out the SV HEAD. This optimization will translate to all OSes. It is broken out into a separate commit for git bisect reasons since it touches the alignment topic. As with part 1, some members of the community care about rapidly creating massive amounts of SVIVs/SVUVs/SVNVs in deserializing wire/protocol/disk formations, or big data sci num crunching.

This reverts commit aae9cea. Author note, hand editing required to revert since commit was from 2005 and it is 2024. Part 1 of ? to optimize and reduce overhead of SvTAINT() macro inside all SV * allocator fncs. Using taint feature is rare, and "push(sv), push(my_perl), call()" is alot smaller machine code at the many call sites, than "push(0), push(0), push(116), push(0), push(sv), push(my_perl), call()" and using taint at runtime, means the user decided perf is irrelavent vs security. newSViv()/newSVuv()/newSVnv() are malloc()-free but Perl_sv_magicext() contains "sv_upgrade(SVt_PVMG); calloc(1,0x30);" and not for taint-feat, but also a 2nd "malloc(0x****)". Factor out all those sv_magic() calls into a wrapper for the unlikely branch. SvTAINT() has many call sites in hottest parts of perl.

-make Perl_sv_taint() return the SV *, useful for a future optimization previous it was void This part 2, along with part 1. Shows improvement. Delta, after 1 & 2. previous miniperl.exe Win64 .text section, VC 2022 -O1 0x12440C bytes long after 0x1240AC bytes long 864 bytes of machine code were removed. A bin analysis tool shows has Perl_sv_taint() 62 callers in miniperl.exe

"SvTAINT();" contains "if(PL_tainting && PL_tainted) sv_taint(sv);" that is 2 One Byte reads and 2 branches. Collapse the 2 bool chars, to a U16, so it is exactly 1 read, and 1 branch. Strips complexity from the very bottom of the very hot newSVuv/newSViv/newSVuv, and other callers. sv_taint(sv) has 62 callers, not sure how many do the 2 reads, 2 branches SvTAINT(sv);, but the change decreased the size of miniperl.exe and therefore perl541.dll, and branches were removed from the newSVuv/newSViv/newSVuv trio. Delta machine code bytes, between part 2 & 3 (this commit). previous miniperl.exe Win64 .text section, VC 2022 -O1 0x1240AC bytes long after 0x12408C bytes long

Perl_newSVnv/Perl_newSViv/Perl_newSVuv, currently have to save the fresh SV *, either on C stack, or in non volatile registers, around the possible Perl_sv_taint() fn call inside SvTAINT(). If Perl_sv_taint() returns its SV * argument, and assigns it back to the same C var, now these 3 performance critical SV allocator functions, after plucking the SV head from the arena, these 3 function never ever have to store the fresh SV * back to C stack for any reason during their execution. This optimization removes pop/push pairs of the C compiler saving non-volatile registers and restoring them at function entry and exit since after SvTAINTTC() change, NO variables AT ALL, have to be saved around any function calls in Perl_newSVnv/Perl_newSViv/Perl_newSVuv. Also the SV head *, after being delinked/removed from an areana, can now be stored through the whole function, in the x86 EAX/x64 RAX register, and pass through to the caller, without a final (non vol) reg to (vol retval reg) mov/copy cpu op. Remember eax/rax/retval registers, are always wiped after each fn call, but the refactoring of SvTAINTTC() conviently returns the SV * back to us, in the ABI return register, and we let the fresh SV * glide through on the "heavy" Perl_sv_taint() branch, from Perl_sv_taint() to Perl_newSViv()'s caller, without touching it, 0 machine code ops. Few code sites were changed from SvTAINT() to SvTAINTTC(), to keep this patch smaller, and the Perl_sv_set*vXXX() category of functions, all have void return types and can't be chained. Also the Perl_sv_taint() branch can be tail called or converted to a JMP insted of CALL, if the CC/OS/ABI wants to now. This is the final part of speeding up Perl_newSVnv/Perl_newSViv/Perl_newSVuv there is nothing else to remove or optimze.

bulk88 · 2024-10-17T08:33:44Z

fixed asserts for 32b ptr builds

bulk88 · 2024-10-17T16:12:15Z

macOS (Monterey) 12 (-Uusethreads) passed.

I did no changes except for moving a static assert that failed i386.

macOS (Monterey) 12 (-Uusethreads)

now

#   Failed test 'write: stat and lstat returned same values'
#   at t/stat.t line 44.
#     Structures begin differing at:
#          $got->[8] = '1729155292.99179'
#     $expected->[8] = '1729155292.76969'
# Looks like you failed 1 test of 43.
../dist/Time-HiRes/t/stat.t .......................................... 
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/43 subtests

is this a flip flop timing test that fails regularly?

 8 atime    last access time in seconds since the epoch

IIRC Win NT kernel API refuses to update access time any faster than 1 full second.

bulk88 · 2024-10-17T17:04:55Z

khwilliamson@68c1b38

#19321

related bug tickets I found

for

#     Structures begin differing at:
#          $got->[8] = '1729155292.99179'
#     $expected->[8] = '1729155292.76969'

richardleach · 2024-12-14T14:04:01Z

Thinking about how to move this PR on:
A) I'm unclear as to whether MSVC uses -O1 rather than -O2 because it's faster for compiling the interpreter, or the run-time performance of the interpreter once built. Please could you clarify?

B) These changes would likely get through review much faster if the PR was split up into 3 separate PRs:

Restoring the bodyless code as-was for the benefit of MSVC (newSV_type is inlined by gcc/clang)
Changes to Perl_vnewSVpvf
The changes to taint

bulk88 · 2024-12-15T18:37:48Z

Thinking about how to move this PR on: A) I'm unclear as to whether MSVC uses -O1 rather than -O2 because it's faster for compiling the interpreter, or the run-time performance of the interpreter once built. Please could you clarify?

Ancient history P5P posts (Sarathy era/early JDB) say MSVC i386 -O1 is faster, 2-3 devs benchmarked the interp on private code. In 2010s/2020s, I would leave modern/current supported MSVC on -O1, since MSVC will inline and expand the worst possible code blocks in -O2. Like unroll all Perl_croak()s to Perl_vcroak(), or inlining the Perl_sv_magicext() loop into every XSUB inside libperl.dll (LTO visibility). MSVC -O2 also writes, x86 machine code wise, all "mov dest_reg, 1 byte (8 bits) constants" aka "imm8s", as 4 byte constants, with 3 useless null bytes. MSVC in -O1 correctly writes 1 byte operands. -O2 expands all constants to 4 bytes operands.

The optimization logic there is questionable IMO. Perl is memory starved or branch miss starved, perl isn't FP/algebra/math/video codecs starved, And the interp never sits in a 100K interation loop, ontop of a fixed RW 256-1024 bytes chunk of ram. Perls not a MPEG decoder. It doesn't need x86 conditional jump's to be aligned to cache line multiples, and inside the function, 25%-40% of all bytes are NOP CPU instructions. Sounds to me MSVC had max R&D done targeting the Pentium 4 era, which is when they last redesigned/forklift-ed -O2 subsystem. Pentium 4s with very long pipelines/high latency, are long obsolete, and starting with Intel Core 1/2 to today, Smaller is better.

I do plan to add -Oi (memFOO()/strFOO()) intrinsics to the -O1 MSVC perl in near future. That feature is amazing, since "unaligned" libc memcmp() or memcpy() or memset() functions, with const/CC frozen inputs, around 2 bytes-16 bytes, somethings 24 bytes, all of these will optimize down to 1 cpu op with -Oi. I have -Oi turned on ever since I got back into perl Makes string parsing/sorting super fast,, and super tiny in machine code. I have plans in near future to add to add -Oi and -GW (>= VC2013 only feature, the linker will now (decades too late) remove unreferenced stuff after CC link phase, Specifically MSVC with -GW will remove const or RO static structs/arrays if they are unreferenced by any other ISO C function or C data structure. I'm NOT talking about ->>>> p = "when is your friend coming"; MSVC correctly from day 1 dedupes or removed nameless/symboless double quoted strings, but it DID NOT NOT NOT ever optimize away static const char warning[] = {"the house is burning"}; even if absolutely no references to the symbol. ="the house is burning"; will be guarenteed to show up as bloat in the final binary. -GW from 2013 finally fixed this. I also need to experiment with ripping out 2KB of "profile guided optimization" metadata in my libperl.dll and XS,DLLs. MS offers no in removing that 2KB data structure except by "do PGO with simulated workloads and recompile your binary with the results from the DB, Thats the only official way to remove the PGO data. There some undocumented command line switches floating I need to experiment with more, but I'm going off topic. Summary, I think it should stay on -O1 but with a couple rationally picked add-on, maybe even -Os or -Ot modes, just NOT Flordia Spring Break -O2 mode.

P5P has the macros in the source code already to do selective!!!!! (DONT YALL GET IDEAS!!!!) "emergency" inlining on MSVC platform with the current -O1 mode. I DO NOT WANT to see croak() unrolled/inlined all over the code base. So Im strongly against -O2.

Sitting down with an IDE and some of my other tools, and single stepping (or any other P5P person doing), and finding individual places to unroll/expand, including a Perl dev, and his brain, and background knowledge of what is run loop code, and what is "artic" panic/assert/overflow/"bizzare copy of" code, then rational decisions can be done function by function, on what is hot, what can be changed, add ultra inline decl tag or not? It had to be done by humans, MSVC's algorithms are too generic, and were design for very tiny in disk file space video codecs drivers or gaming engines with 99% FP SMID math workloads. Not Perl's mundane ETL usage which is almost nothing but compare/jump/move branching all day long.

B) These changes would likely get through review much faster if the PR was split up into 3 separate PRs:

Ill rebase them and PR split them. I suspect Ill be facing a wall of rebase conflicts if they go in separate, but do that path anyways (3 more PRs)

I', thinking of replacing some of the C switch trees (binary search), with U32es constants/imm32 of "logic" that used to be in that global table.

if(MallocerBodyNoArenaFlag &( 1 << SVtPVLV)) { return foo(); }

with code like this. O(1), not generic typica; CC way of 10 x cmp_()/jump_above()/jump_below() to do it as a C switch,.

bulk88 force-pushed the newSVuv_nv_iv_taint_api_speedup branch from c7aaa28 to 2050add Compare October 14, 2024 23:11

bulk88 added 5 commits October 17, 2024 04:29

bulk88 force-pushed the newSVuv_nv_iv_taint_api_speedup branch from 2050add to c77e852 Compare October 17, 2024 08:32

github-actions bot added the hasConflicts label Nov 17, 2024

bulk88 mentioned this pull request Apr 11, 2025

Try to force inlining of newSV_type (i -> I in embed.fnc) #23190

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Newsvuvnviv taint api speedup #22662

Newsvuvnviv taint api speedup #22662

Uh oh!

bulk88 commented Oct 13, 2024

Uh oh!

richardleach commented Oct 14, 2024

Uh oh!

richardleach commented Oct 14, 2024

Uh oh!

bulk88 commented Oct 14, 2024

Uh oh!

jkeenan commented Oct 16, 2024

Uh oh!

bulk88 commented Oct 17, 2024

Uh oh!

bulk88 commented Oct 17, 2024

Uh oh!

bulk88 commented Oct 17, 2024

Uh oh!

richardleach commented Dec 14, 2024

Uh oh!

bulk88 commented Dec 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Newsvuvnviv taint api speedup #22662

Are you sure you want to change the base?

Newsvuvnviv taint api speedup #22662

Uh oh!

Conversation

bulk88 commented Oct 13, 2024

Uh oh!

richardleach commented Oct 14, 2024

Uh oh!

richardleach commented Oct 14, 2024

Uh oh!

bulk88 commented Oct 14, 2024

Uh oh!

jkeenan commented Oct 16, 2024

Uh oh!

bulk88 commented Oct 17, 2024

Uh oh!

bulk88 commented Oct 17, 2024

Uh oh!

bulk88 commented Oct 17, 2024

Uh oh!

richardleach commented Dec 14, 2024

Uh oh!

bulk88 commented Dec 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants