Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newsvuvnviv taint api speedup #22662

Open
wants to merge 6 commits into
base: blead
Choose a base branch
from

Conversation

bulk88
Copy link
Contributor

@bulk88 bulk88 commented Oct 13, 2024

inspire by looking at bug #22653, and remarks in the past over these 3 fns being super
important esp for enterprise serialization/deserial/wire format decoding.

They are also sort of related to a very bad failed optimization (MSVC compiler went to "-O0" and added 100s of KBs of redundant code in perl541.dll and some KBs more in XS DLLs), done 2-3 years ago in perl core. But im still working on a fix/diag/analysis/solution for that. This branch of commits covers more about serial/deserial performance, and taking unique advantage that IV NV UVs are no-malloc SV and that they are bodyless.

Plus in one spot, my additional "bodyless" optimization from some years ago disappeared through code churn at
9155444 I put it back, since bodyless SVs
are very light weight. And newSV_type() is very heavy with many many branches inside.

related to "SvUV() macro 100% of time calls Perl_sv_2uv_flags"
Perl#22653

Until Perl#22653 is solved, clean up newSVuv() and remove branch
to "newSViv()" that is unexplained by git blame. BUT, keep original
intent and behaviour of "newSViv()" branch for now.

Add asserts to guard against 0x8000,0000 == SVf_IVisUV changing.
Value of SVf_IVisUV can change in the future, and there might be
(I didn't git blame), logic that sign flag and SVf_IVisUV are equal.
But these changes depend on SVf_IVisUV being 0x8000,0000 and must
be updated if SVf_IVisUV changes.

Change SvXXXV_set() to be an explicity bodyless SV head optimization. MSVC
2022 -O1 combined SET_SVANY_FOR_BODYLESS_IV() and SvIV_set(). But instead
of hopes and prayers on "UB" or "ISB/IDB" of CCs that could change at
random in any previous or future build number of a CC, do it explictly.
Bodyless SV head API is defined by P5P, not CC vendors.

9155444 3/20/2022 3:05:10 PM
Perl_newSViv: simplify by using (inline) newSV_type

Fix deoptimized Perl_newSViv(). In that commit it forgot about
Perl_newSVuv(). Since newSV_type() is a inline fn, and "inline" is
CC domain UB optimization. And newSV_type() is far more complex than
CPP macro new_SV(), and newSV_type() depends on 100% perfection from
CC's LTO engine and ".o" disk format, and possibly depends on the CC
breaking ISO C spec with -O3 or -O4. Which turn on extreme SEGV inducing
C variable aliasing rules that few C code bases tolerate. Quick examples,
a reddit comment (not credible), claims "uint8_t *" and "char *" can not
be casted since the CC or CPU has 9 bit bytes or a 9 wire data bus, and
ECC parity wire is 9 of 9 for "uint8_t" and 8 of 9 for "signed char" and
wire 9 for "char" is the ECC parity wire. The platform's libc's fwritef(), hides the
secretly converts 9 bit bytes, to standard 8 bit bytes, making
the CC "ISO C compliant".

My more realistic scenario, inside newSV_type(). How can the CC know,
what if Perl_more_sv() or Perl_more_bodies(), calls mprotect(), modifies
"static const struct body_details bodies_by_type [];",
calls mprotect() again, and returns execution to newSV_type()?

Just switch to new_SV(). Its a CPP macro, not subject to CC UB inlining,
and new_SV() only has 1 fn call and is super light weight.

Old P5P commits/ML/CPAN dev talk about this area of code being crucial to
(CPAN XS) deserializing perf in perl, so perf considerations,
with proper asserts, has priority over readability.

Links to old core commits in Perl#22653
briefly discuss deserializing perf as rational, so this patch also follows
that design idea.

Perl_vnewSVpvf(), "malloc(1ch);" which in reality is "malloc(16ch)" makes
no sense, since almost zero chance fmtstr+args+\0 <= 16, and perl malloc()
round up, is semi-UB/a build flag default on anyways. Using guesstimate
malloc(pat_len), increases chances far higher, that a realloc() inside
sv_vcatpvfn_flags(), OS realloc(), will realloc() in place, not changing
the ptr, esp assuming OS malloc() does bucket of power of 2
allocator algo. Assume, 40ch malloc() fmt string, bucket to 64ch by
OS malloc(), throw in a %u 32b, that is max +10ch-2ch for "%u".
So output is 48ch. realloc(48ch) is inplace, therefore it is a win.
@richardleach
Copy link
Contributor

newSV_type() is very heavy

The intention was always that calls to newSV_type() with a static type argument would be inlined. (Hence, your bodiless optimisation would not have disappeared.) Are you finding that this is not the case? Because of -O0?

@richardleach
Copy link
Contributor

newSV_type() is very heavy

The intention was always that calls to newSV_type() with a static type argument would be inlined. (Hence, your bodiless optimisation would not have disappeared.) Are you finding that this is not the case? Because of -O0?

Ah, I've seen there's some discussion on #p5p. I'll try to catch up on that tonight.

@bulk88 bulk88 force-pushed the newSVuv_nv_iv_taint_api_speedup branch from c7aaa28 to 2050add Compare October 14, 2024 23:11
@bulk88
Copy link
Contributor Author

bulk88 commented Oct 14, 2024

repushed branch, fixed -DNO_TAINT_SUPPORT build failure

@jkeenan
Copy link
Contributor

jkeenan commented Oct 16, 2024

This p.r. has repeatedly failed to build on one of our CI setups. Please see:
https://github.com/Perl/perl5/actions/runs/11336336622/job/31611461445?pr=22662

-design and rational in src comments, this patch forces MSVC 2022 x64 to
 use 64b integer math/CPU ops (regs RAX/RDX/RCX), vs 2 sequences/pairs of
 EAX/EBX/ECX register ops removing a couple CPU instructions in filling
 out the SV HEAD. This optimization will translate to all OSes. It is
 broken out into a separate commit for git bisect reasons since it
 touches the alignment topic. As with part 1, some members of the
 community care about rapidly creating massive amounts of
 SVIVs/SVUVs/SVNVs in deserializing wire/protocol/disk formations, or
 big data sci num crunching.
This reverts commit aae9cea.

Author note, hand editing required to revert since commit was from 2005
and it is 2024. Part 1 of ? to optimize and reduce overhead of SvTAINT()
macro inside all SV * allocator fncs.

Using taint feature is rare, and "push(sv), push(my_perl), call()" is alot
smaller machine code at the many call sites, than
"push(0), push(0), push(116), push(0), push(sv), push(my_perl), call()"
and using taint at runtime, means the user decided perf is irrelavent vs
security.

newSViv()/newSVuv()/newSVnv() are malloc()-free but Perl_sv_magicext()
contains "sv_upgrade(SVt_PVMG); calloc(1,0x30);" and not for taint-feat,
but also a 2nd "malloc(0x****)". Factor out all those sv_magic() calls
into a wrapper for the unlikely branch. SvTAINT() has many call sites in
hottest parts of perl.
-make Perl_sv_taint() return the SV *, useful for a future optimization
 previous it was void

This part 2, along with part 1. Shows improvement. Delta, after 1 & 2.

previous miniperl.exe Win64 .text section, VC 2022 -O1
0x12440C bytes long

after
0x1240AC bytes long

864 bytes of machine code were removed. A bin analysis tool shows
has Perl_sv_taint() 62 callers in miniperl.exe
"SvTAINT();" contains "if(PL_tainting && PL_tainted) sv_taint(sv);"
that is 2 One Byte reads and 2 branches. Collapse the 2 bool chars, to a
U16, so it is exactly 1 read, and 1 branch. Strips complexity from the
very bottom of the very hot newSVuv/newSViv/newSVuv, and other callers.
sv_taint(sv) has 62 callers, not sure how many do the 2 reads, 2 branches
SvTAINT(sv);, but the change decreased the size of miniperl.exe and
therefore perl541.dll, and branches were removed from the
newSVuv/newSViv/newSVuv trio.

Delta machine code bytes, between part 2 & 3 (this commit).

previous miniperl.exe Win64 .text section, VC 2022 -O1
0x1240AC bytes long

after
0x12408C bytes long
Perl_newSVnv/Perl_newSViv/Perl_newSVuv, currently have to save the fresh
SV *, either on C stack, or in non volatile registers, around the
possible Perl_sv_taint() fn call inside SvTAINT(). If Perl_sv_taint()
returns its SV * argument, and assigns it back to the same C var, now
these 3 performance critical SV allocator functions, after plucking
the SV head from the arena, these 3 function never ever have to
store the fresh SV * back to C stack for any reason during their execution.

This optimization removes pop/push pairs of the C compiler saving
non-volatile registers and restoring them at function entry and exit since
after SvTAINTTC() change, NO variables AT ALL, have to be saved around
any function calls in Perl_newSVnv/Perl_newSViv/Perl_newSVuv. Also the
SV head *, after being delinked/removed from an areana, can now be
stored through the whole function, in the x86 EAX/x64 RAX register, and
pass through to the caller, without a final (non vol) reg to
(vol retval reg) mov/copy cpu op.

Remember eax/rax/retval registers, are always wiped after each fn call,
but the refactoring of SvTAINTTC() conviently returns the
SV * back to us, in the ABI return register, and we let the fresh
SV * glide through on the "heavy" Perl_sv_taint() branch, from
Perl_sv_taint() to Perl_newSViv()'s caller, without touching it, 0
machine code ops.

Few code sites were changed from SvTAINT() to SvTAINTTC(), to keep this
patch smaller, and the Perl_sv_set*vXXX() category of functions, all have
void return types and can't be chained. Also the Perl_sv_taint() branch
can be tail called or converted to a JMP insted of CALL, if the CC/OS/ABI
wants to now.

This is the final part of speeding up
Perl_newSVnv/Perl_newSViv/Perl_newSVuv there is nothing else to remove or
optimze.
@bulk88 bulk88 force-pushed the newSVuv_nv_iv_taint_api_speedup branch from 2050add to c77e852 Compare October 17, 2024 08:32
@bulk88
Copy link
Contributor Author

bulk88 commented Oct 17, 2024

fixed asserts for 32b ptr builds

@bulk88
Copy link
Contributor Author

bulk88 commented Oct 17, 2024

macOS (Monterey) 12 (-Uusethreads) passed.

I did no changes except for moving a static assert that failed i386.

macOS (Monterey) 12 (-Uusethreads)

now

#   Failed test 'write: stat and lstat returned same values'
#   at t/stat.t line 44.
#     Structures begin differing at:
#          $got->[8] = '1729155292.99179'
#     $expected->[8] = '1729155292.76969'
# Looks like you failed 1 test of 43.
../dist/Time-HiRes/t/stat.t .......................................... 
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/43 subtests 

is this a flip flop timing test that fails regularly?

 8 atime    last access time in seconds since the epoch

IIRC Win NT kernel API refuses to update access time any faster than 1 full second.

@bulk88
Copy link
Contributor Author

bulk88 commented Oct 17, 2024

khwilliamson@68c1b38

#19321

related bug tickets I found

for

#     Structures begin differing at:
#          $got->[8] = '1729155292.99179'
#     $expected->[8] = '1729155292.76969'

@richardleach
Copy link
Contributor

Thinking about how to move this PR on:
A) I'm unclear as to whether MSVC uses -O1 rather than -O2 because it's faster for compiling the interpreter, or the run-time performance of the interpreter once built. Please could you clarify?

B) These changes would likely get through review much faster if the PR was split up into 3 separate PRs:

  • Restoring the bodyless code as-was for the benefit of MSVC (newSV_type is inlined by gcc/clang)
  • Changes to Perl_vnewSVpvf
  • The changes to taint

@bulk88
Copy link
Contributor Author

bulk88 commented Dec 15, 2024

Thinking about how to move this PR on: A) I'm unclear as to whether MSVC uses -O1 rather than -O2 because it's faster for compiling the interpreter, or the run-time performance of the interpreter once built. Please could you clarify?

Ancient history P5P posts (Sarathy era/early JDB) say MSVC i386 -O1 is faster, 2-3 devs benchmarked the interp on private code. In 2010s/2020s, I would leave modern/current supported MSVC on -O1, since MSVC will inline and expand the worst possible code blocks in -O2. Like unroll all Perl_croak()s to Perl_vcroak(), or inlining the Perl_sv_magicext() loop into every XSUB inside libperl.dll (LTO visibility). MSVC -O2 also writes, x86 machine code wise, all "mov dest_reg, 1 byte (8 bits) constants" aka "imm8s", as 4 byte constants, with 3 useless null bytes. MSVC in -O1 correctly writes 1 byte operands. -O2 expands all constants to 4 bytes operands.

The optimization logic there is questionable IMO. Perl is memory starved or branch miss starved, perl isn't FP/algebra/math/video codecs starved, And the interp never sits in a 100K interation loop, ontop of a fixed RW 256-1024 bytes chunk of ram. Perls not a MPEG decoder. It doesn't need x86 conditional jump's to be aligned to cache line multiples, and inside the function, 25%-40% of all bytes are NOP CPU instructions. Sounds to me MSVC had max R&D done targeting the Pentium 4 era, which is when they last redesigned/forklift-ed -O2 subsystem. Pentium 4s with very long pipelines/high latency, are long obsolete, and starting with Intel Core 1/2 to today, Smaller is better.

I do plan to add -Oi (memFOO()/strFOO()) intrinsics to the -O1 MSVC perl in near future. That feature is amazing, since "unaligned" libc memcmp() or memcpy() or memset() functions, with const/CC frozen inputs, around 2 bytes-16 bytes, somethings 24 bytes, all of these will optimize down to 1 cpu op with -Oi. I have -Oi turned on ever since I got back into perl Makes string parsing/sorting super fast,, and super tiny in machine code. I have plans in near future to add to add -Oi and -GW (>= VC2013 only feature, the linker will now (decades too late) remove unreferenced stuff after CC link phase, Specifically MSVC with -GW will remove const or RO static structs/arrays if they are unreferenced by any other ISO C function or C data structure. I'm NOT talking about ->>>> p = "when is your friend coming"; MSVC correctly from day 1 dedupes or removed nameless/symboless double quoted strings, but it DID NOT NOT NOT ever optimize away static const char warning[] = {"the house is burning"}; even if absolutely no references to the symbol. ="the house is burning"; will be guarenteed to show up as bloat in the final binary. -GW from 2013 finally fixed this. I also need to experiment with ripping out 2KB of "profile guided optimization" metadata in my libperl.dll and XS,DLLs. MS offers no in removing that 2KB data structure except by "do PGO with simulated workloads and recompile your binary with the results from the DB, Thats the only official way to remove the PGO data. There some undocumented command line switches floating I need to experiment with more, but I'm going off topic. Summary, I think it should stay on -O1 but with a couple rationally picked add-on, maybe even -Os or -Ot modes, just NOT Flordia Spring Break -O2 mode.

P5P has the macros in the source code already to do selective!!!!! (DONT YALL GET IDEAS!!!!) "emergency" inlining on MSVC platform with the current -O1 mode. I DO NOT WANT to see croak() unrolled/inlined all over the code base. So Im strongly against -O2.

Sitting down with an IDE and some of my other tools, and single stepping (or any other P5P person doing), and finding individual places to unroll/expand, including a Perl dev, and his brain, and background knowledge of what is run loop code, and what is "artic" panic/assert/overflow/"bizzare copy of" code, then rational decisions can be done function by function, on what is hot, what can be changed, add ultra inline decl tag or not? It had to be done by humans, MSVC's algorithms are too generic, and were design for very tiny in disk file space video codecs drivers or gaming engines with 99% FP SMID math workloads. Not Perl's mundane ETL usage which is almost nothing but compare/jump/move branching all day long.

B) These changes would likely get through review much faster if the PR was split up into 3 separate PRs:

Ill rebase them and PR split them. I suspect Ill be facing a wall of rebase conflicts if they go in separate, but do that path anyways (3 more PRs)

I', thinking of replacing some of the C switch trees (binary search), with U32es constants/imm32 of "logic" that used to be in that global table.

if(MallocerBodyNoArenaFlag &( 1 << SVtPVLV)) { return foo(); }

with code like this. O(1), not generic typica; CC way of 10 x cmp_()/jump_above()/jump_below() to do it as a C switch,.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants