Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spu: LLVM arm64 + macOS port #12338

Merged
merged 12 commits into from
Jul 15, 2022
Merged

spu: LLVM arm64 + macOS port #12338

merged 12 commits into from
Jul 15, 2022

Conversation

sguo35
Copy link
Contributor

@sguo35 sguo35 commented Jul 10, 2022

spurs test is working now, but games I tested (Minecraft, DS2) are crashing on a PPU invalid jump somewhere.

Build instructions: see #12115

Note: you need to merge RPCS3/asmjit#1 before anything will compile

Changes:

  • external function calls are marked non-tail as they aren't tail calls (potentially breaking)
  • bumped Arm minimum version to v8.4a for native write-free 128-bit atomics
  • fixed a bug where sp was not properly restored before/after ppu gateway (asmjit silently fails under some invalid register usages)
  • implemented Aarch64 versions of every SPU assembly function. Uses the slow LLVM recompiler rather than the fast x86 assembly one for now.
  • removed the C++ version of exec_rotqby as it's broken. Replacing the intrinsic version with the native version reliably causes spurs test to fail at a specific conditional branch point right after rotqby instruction so I'm 100% confident it's the root cause
  • default to R/W permissions on Apple+arm64 mmap calls

Performance numbers on an M1 Pro:

SPU Test v1.1.2 by GalCiv
SPU Task Avalanche completed in 12377 ms (PS3: 2740 ms)
PPU/SPU Ping-Pong completed in 4648 ms (PS3: 3045 ms)
SPU MFC 64 Bits War completed in 3768 ms (PS3: 3370 ms)
PPU/SPU MFC 64 Bits War completed in 4708 ms (PS3: 4443 ms)
SPU Integer Perf completed in 2862 ms (PS3: 8666 ms)
SPU Float Perf completed in 1213 ms (PS3: 2379 ms)
SPU SpinLock completed in 153185 ms (PS3: 4409 ms)
PUTLLUC Perf completed in 5622 ms (PS3: 3853 ms)
PUTLLC Perf completed in 4976 ms (PS3: 3364 ms)
PUT Perf completed in 3609 ms (PS3: 3984 ms)
Large PUT Perf completed in 4618 ms (PS3: 4454 ms)
--Completed 11 tests in 201590 ms--

Definitely some issues going on with the spinlock perf and high variance in general for performance.

@sguo35 sguo35 marked this pull request as draft July 10, 2022 05:44
@sguo35 sguo35 marked this pull request as ready for review July 10, 2022 06:52
@Nekotekina
Copy link
Member

Restricting to ARMv8.4a sounds like a bad idea.

@shinra-electric
Copy link
Contributor

shinra-electric commented Jul 10, 2022

I have an issue after building where the app will not open because it cannot find libSDL2.dylib.
It is looking within the app bundle, and also in /usr/local.
On Arm Macs, homebrew installs to /opt/homebrew, not /usr/local.

After dropping in libSDL2.dylib into the app bundle and resigning, the app will open fine.

Here is the terminal log for your reference:
Terminal Saved Output.txt

Edit: Reporting a couple of tests I did:
Cube test worked:
Screenshot 2022-07-10 at 12 15 21

Initial setup for the XMB worked fine:
Screenshot 2022-07-10 at 12 33 18
Screenshot 2022-07-10 at 12 33 28
However it did not display the waveform correctly, and soon froze on the XMB.

Excellent work!

Edit:
I fixed the warning: directory not found for option '-L/opt/local/lib' warning by editing the CMakeLists.txt

From:

if(APPLE)
    include_directories(/opt/local/include)
    link_directories(/opt/local/lib)
endif()

To:

if(APPLE)
    include_directories(/opt/homebrew/include)
    link_directories(/opt/homebrew/lib)
endif()

Apparently this warning had nothing to do with the libSDL2.dylib issue, as I still had to manually copy it over...
The log still shows it looking in /usr/local:
'/usr/local/lib/libSDL2.dylib' (no such file)

@kd-11
Copy link
Contributor

kd-11 commented Jul 10, 2022

Removal of atomics breaks build on linux arm64
image

@Whatcookie
Copy link
Member

You should be able to set m_use_fma to true for arm targets, since everything aarch64 should have FMA. Should provide a speedup for the float performance part of the spurs test.

@sguo35
Copy link
Contributor Author

sguo35 commented Jul 11, 2022

Maybe it works on Apple, outside of it 16-byte atomics are PITA.

Restricting to ARMv8.4a sounds like a bad idea.

@Nekotekina Armv8.4a guarantees 16 byte atomic load/stores using ldp/stp if they're 16 byte aligned. I verified and using Clang to compile, atomic_storage<u128>::release will compile to a single store instruction.

I don't know of any Arm CPU that would be capable of running RPCS3 but does not have v8.4a support. All Apple CPUs have 8.4a support, Cortex X2 is v9, and Graviton3/Neoverse V1 is v8.6. Having 128 bit single instruction atomics is surely worth the cost of excluding some machines that would never be capable of running rpcs3 in the first place?

Removal of atomics breaks build on linux arm64

@kd-11 Is this using gcc? What machine are you on?

m_use_fma

@Whatcookie Nice, setting m_use_fma and m_use_avx (doesn't use x86 intrinsics...) brought float runtime to 639ms.

@kd-11
Copy link
Contributor

kd-11 commented Jul 11, 2022

@kd-11 Is this using gcc? What machine are you on?

Yes, gcc. The hardware is still apple M1, just running linux instead of macOS

@Nekotekina
Copy link
Member

@sguo35 Just implement release() using stp or SIMD store, it was marked as TODO anyway. This way it will work on Linux and will inline without linking to atomic library routines.

@sguo35 sguo35 force-pushed the arm64 branch 2 times, most recently from fc29545 to 85db13f Compare July 11, 2022 21:45
@sguo35 sguo35 requested review from Nekotekina and Megamouse July 11, 2022 22:44
@Megamouse Megamouse removed their request for review July 12, 2022 14:15
@Nekotekina
Copy link
Member

I've been running arm64 port of RPCS3 under qemu just fine. Just checked, qemu doesn't support atomic STP/LDP. Probably same goes for many other features. Forcing newer features will definitely break this possibility. It's always better to have support for running under qemu.
We have USE_NATIVE_INSTRUCTIONS switch which sets -march=native if enabled.
Portable code for atomics will look like:

#ifdef FEATURE123
do_123(); // optimized
#else
do_0(); // legacy
#endif

I don't know what macro for arm64 features could be tested, it's just an example.

@sguo35
Copy link
Contributor Author

sguo35 commented Jul 13, 2022

Okay, I changed to check explicitly for Arm version which is (hopefully) portable also to MSVC. Also caught a bug where the rip patchpoint wasn't actually 16 byte aligned... which explains why ldaxp was segfaulting originally.

@Nekotekina
Copy link
Member

LDP/STP may be needing explicit memory barrier instructions, need to check.

@Nekotekina
Copy link
Member

Confirmed:

https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename:'1',fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,selection:(endColumn:48,endLineNumber:18,positionColumn:48,positionLineNumber:18,selectionStartColumn:48,selectionStartLineNumber:18,startColumn:48,startLineNumber:18),source:'void+_store(__uint128_t*+dst,+__uint128_t+value)%0A%7B%0A++++__atomic_store(dst,+%26value,+__ATOMIC_SEQ_CST)%3B%0A%7D%0A%0Avoid+_release(__uint128_t*+dst,+__uint128_t+value)%0A%7B%0A++++__atomic_store(dst,+%26value,+__ATOMIC_RELEASE)%3B%0A%7D%0A%0Avoid+_load(__uint128_t*+dst,+__uint128_t*+result)%0A%7B%0A++++__atomic_load(dst,+result,+__ATOMIC_SEQ_CST)%3B%0A%7D%0A%0Avoid+_observe(__uint128_t*+dst,+__uint128_t*+result)%0A%7B%0A++++__atomic_load(dst,+result,+__ATOMIC_RELAXED)%3B%0A%7D%0A'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:50,l:'4',m:51.37096774193548,n:'0',o:'',s:0,t:'0'),(g:!((g:!((h:compiler,i:(compiler:armv8-full-clang-trunk,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,libs:!(),options:'-std%3Dc%2B%2B20+-O3+-Wall+-Wextra+-pedantic',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1,tree:'1'),l:'5',n:'0',o:'armv8-a+clang+(trunk,+all+architectural+features)+(C%2B%2B,+Editor+%231,+Compiler+%231)',t:'0')),header:(),k:50,l:'4',m:74.03225806451613,n:'0',o:'',s:0,t:'0'),(g:!((h:output,i:(compilerName:'x86-64+gcc+12.1',editorid:1,fontScale:14,fontUsePx:'0',j:1,wrap:'1'),l:'5',n:'0',o:'Output+of+armv8-a+clang+(trunk,+all+architectural+features)+(Compiler+%231)',t:'0')),header:(),l:'4',m:25.96774193548387,n:'0',o:'',s:0,t:'0')),k:50,l:'3',n:'0',o:'',t:'0')),l:'2',m:100,n:'0',o:'',t:'0')),version:4

@sguo35 sguo35 force-pushed the arm64 branch 2 times, most recently from e010d4f to 5a121ef Compare July 14, 2022 18:31
sguo35 added 3 commits July 14, 2022 11:34
16B ldp/stp are atomic on v8.4a+. See Arm Architecture Reference Manual,
"Changes to single-copy atomicity in Armv8.4". Add load/release atomic
impls for this instruction and add detection for 8.4a+ capability.
Mac/Arm64 pages should be R/W by default due to 16k page
incompatibility. Without this there will be segfaults due to invalid
permissions.
Mark external function calls as non-tail, since they aren't tail calls
and assuming they are will cause returns to fail in Arm64 GHC CC.
sguo35 added 8 commits July 14, 2022 11:34
ASMJIT can silently fail and drop instructions when invalid operations
are performed (e.g. loading/storing sp). Explicitly move sp to a gp
register before doing loads/stores to fix this.
Since there is not yet an arm64 version of the assembly (fast) version.
rotqby C++ implementation is broken, since replacing it with the
intrinsic version reliably fixes spurs test. A conditional branch
immediately after a rotqby instruction will fail using the C++ version
but succeed using the intrinsic.
Implement the ubertrampoline generator for arm64. It generally follows
the x86 version, but uses asmjit to generate code instead of writing raw
opcodes to memory, trading memory usage for readability. Currently the
trampoline implementation is fairly inefficient in terms of instruction
size and is substantially larger than the x86 version.
@Nekotekina Nekotekina merged commit 73ed657 into RPCS3:master Jul 15, 2022
@Nekotekina
Copy link
Member

Need to fix leaking (and also slower) build_function_asm incantations.

@sguo35
Copy link
Contributor Author

sguo35 commented Jul 15, 2022

Need to fix leaking (and also slower) build_function_asm incantations.

I was going to fix these (and optimize code size) after basic games are working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants