-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spu: LLVM arm64 + macOS port #12338
spu: LLVM arm64 + macOS port #12338
Conversation
Restricting to ARMv8.4a sounds like a bad idea. |
I have an issue after building where the app will not open because it cannot find libSDL2.dylib. After dropping in libSDL2.dylib into the app bundle and resigning, the app will open fine. Here is the terminal log for your reference: Edit: Reporting a couple of tests I did: Initial setup for the XMB worked fine: Excellent work! Edit: From:
To:
Apparently this warning had nothing to do with the libSDL2.dylib issue, as I still had to manually copy it over... |
You should be able to set m_use_fma to true for arm targets, since everything aarch64 should have FMA. Should provide a speedup for the float performance part of the spurs test. |
@Nekotekina Armv8.4a guarantees 16 byte atomic load/stores using I don't know of any Arm CPU that would be capable of running RPCS3 but does not have v8.4a support. All Apple CPUs have 8.4a support, Cortex X2 is v9, and Graviton3/Neoverse V1 is v8.6. Having 128 bit single instruction atomics is surely worth the cost of excluding some machines that would never be capable of running rpcs3 in the first place?
@kd-11 Is this using gcc? What machine are you on?
@Whatcookie Nice, setting |
Yes, gcc. The hardware is still apple M1, just running linux instead of macOS |
@sguo35 Just implement release() using stp or SIMD store, it was marked as TODO anyway. This way it will work on Linux and will inline without linking to atomic library routines. |
fc29545
to
85db13f
Compare
I've been running arm64 port of RPCS3 under qemu just fine. Just checked, qemu doesn't support atomic STP/LDP. Probably same goes for many other features. Forcing newer features will definitely break this possibility. It's always better to have support for running under qemu. #ifdef FEATURE123
do_123(); // optimized
#else
do_0(); // legacy
#endif I don't know what macro for arm64 features could be tested, it's just an example. |
Okay, I changed to check explicitly for Arm version which is (hopefully) portable also to MSVC. Also caught a bug where the rip patchpoint wasn't actually 16 byte aligned... which explains why ldaxp was segfaulting originally. |
LDP/STP may be needing explicit memory barrier instructions, need to check. |
e010d4f
to
5a121ef
Compare
16B ldp/stp are atomic on v8.4a+. See Arm Architecture Reference Manual, "Changes to single-copy atomicity in Armv8.4". Add load/release atomic impls for this instruction and add detection for 8.4a+ capability.
Mac/Arm64 pages should be R/W by default due to 16k page incompatibility. Without this there will be segfaults due to invalid permissions.
Mark external function calls as non-tail, since they aren't tail calls and assuming they are will cause returns to fail in Arm64 GHC CC.
ASMJIT can silently fail and drop instructions when invalid operations are performed (e.g. loading/storing sp). Explicitly move sp to a gp register before doing loads/stores to fix this.
Since there is not yet an arm64 version of the assembly (fast) version.
rotqby C++ implementation is broken, since replacing it with the intrinsic version reliably fixes spurs test. A conditional branch immediately after a rotqby instruction will fail using the C++ version but succeed using the intrinsic.
Implement the ubertrampoline generator for arm64. It generally follows the x86 version, but uses asmjit to generate code instead of writing raw opcodes to memory, trading memory usage for readability. Currently the trampoline implementation is fairly inefficient in terms of instruction size and is substantially larger than the x86 version.
Need to fix leaking (and also slower) build_function_asm incantations. |
I was going to fix these (and optimize code size) after basic games are working. |
spurs test is working now, but games I tested (Minecraft, DS2) are crashing on a PPU invalid jump somewhere.
Build instructions: see #12115
Note: you need to merge RPCS3/asmjit#1 before anything will compile
Changes:
bumped Arm minimum version to v8.4a for native write-free 128-bit atomicssp
was not properly restored before/after ppu gateway (asmjit silently fails under some invalid register usages)exec_rotqby
as it's broken. Replacing the intrinsic version with the native version reliably causes spurs test to fail at a specific conditional branch point right afterrotqby
instruction so I'm 100% confident it's the root causePerformance numbers on an M1 Pro:
Definitely some issues going on with the spinlock perf and high variance in general for performance.