Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some ffi/function_callbacks_many tests failing after changing Class::RareType() #52843

Closed
sstrickl opened this issue Jul 4, 2023 · 9 comments
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. gardening

Comments

@sstrickl
Copy link
Contributor

sstrickl commented Jul 4, 2023

There are new test failures on [vm] Make Class::RareType() the instantiated to bounds type..

The tests

ffi/function_callbacks_many_test/0 Crash (expected Pass)
ffi/function_callbacks_many_test/2 Crash (expected Pass)

are failing on configurations

vm-linux-release-riscv64-qemu

Logs:

Doesn't reproduce locally with either release or debug, but given the value of pc in an example stack trace:

stderr:
===== CRASH =====
si_signo=Segmentation fault(11), si_code=1, si_addr=0x7f6897d80000
version=3.1.0-edge.7cc005ea1a4ca2b7d245023d15f3e5055e6ae314 (be) (Mon Jul 3 14:52:19 2023 +0000) on "linux_riscv64"
pid=233432, thread=233449, isolate_group=main(0x264d5b0), isolate=main(0x2650670)
os=linux, arch=riscv64, comp=no, sim=no
isolate_instructions=281fa0, vm_instructions=281fc0
fp=402cfd8158, sp=780, pc=2
Stack dump aborted because GetAndValidateThreadStackBounds failed.
  pc 0x0000000000000002 fp 0x000000402cfd8158 Unknown symbol

--- Re-run this test:
python3 tools/test.py -n vm-linux-release-riscv64-qemu ffi/function_callbacks_many_test/2

I'm guessing a Smi got loaded in there, and since Class::RareType() is used in IL/assembly generation for Smi checking, wouldn't be surprised if it's something involved in that. However, I'd have expected that to cause more widespread failures than tests on a single emulated architecture that I can't even reproduce locally (and the broken FileIOSink benchmarks on golem), so whatever it is must be quite subtle.

@sstrickl sstrickl added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. gardening labels Jul 4, 2023
@sstrickl sstrickl self-assigned this Jul 4, 2023
@sstrickl
Copy link
Contributor Author

sstrickl commented Jul 4, 2023

/cc @dcharkes since he may have ideas of what code path is exercised in those two versions of the test but not the others. The different options at the top:

// VMOptions=--stacktrace-every=100
// VMOptions=--write-protect-code --no-dual-map-code
// VMOptions=--write-protect-code --no-dual-map-code --stacktrace-every=100
// VMOptions=--use-slow-path
// VMOptions=--use-slow-path --stacktrace-every=100
// VMOptions=--use-slow-path --write-protect-code --no-dual-map-code
// VMOptions=--use-slow-path --write-protect-code --no-dual-map-code --stacktrace-every=100
// SharedObjects=ffi_test_functions

So it's not failing with --use-slow-path, and only when called with --stacktrace-every=100.

@sstrickl
Copy link
Contributor Author

sstrickl commented Jul 4, 2023

Okay, checking the generated assembly before and after reveal a single change: the type testing stub for Pointer<NativeType>. Originally, it actually checked the instance type arguments of the instance:

Code for stub 'TypeTestingStub_dart_ffi_Pointer__dart_ffi_NativeType' (type = Type: Pointer<NativeType>): {
0x0    03b00993               li s3, 59
0x4    00157713               andi tmp2, a0, 1
0x8        c709               beqz tmp2, +10
0xa    fff56983               lwu s3, -1(a0)
0xe    00c9d993               srli s3, s3, 0xc
        ;; Checking for cid 71 (Pointer)
0x12    04700713               li tmp2, 71
0x16    08e99863               bne s3, tmp2, +144
0x1a    00f53903               ld s2, 15(a0)
0x1e    09a90463               beq s2, null, +136
        ;; Generating check for type argument 0: NativeType
0x22    02793a03               ld s4, 39(s2)
0x26    fffa6983               lwu s3, -1(s4)
0x2a    00c9d993               srli s3, s3, 0xc
0x2e    03000713               li tmp2, 48
0x32    06e99a63               bne s3, tmp2, +116
        ;; Checks for Type
0x36    00fa6983               lwu s3, 15(s4)
0x3a    0039f993               andi s3, s3, 3
0x3e    06098463               beqz s3, +104
0x42    00fa6983               lwu s3, 15(s4)
0x46    0049d993               srli s3, s3, 0x4
0x4a    0af00713               li tmp2, 175
0x4e    04e98b63               beq s3, tmp2, +86

But after the change to RareType() to instantiate to bounds, it was simplified to just a cid check, as NativeType is the bound on type parameter for Pointer and thus all Pointer instances should have a subtype of NativeType as their argument:

Code for stub 'TypeTestingStub_dart_ffi_Pointer__dart_ffi_NativeType' (type = Type: Pointer<NativeType>): {
0x0    00157713               andi tmp2, a0, 1                   
0x4    00070b63               beqz tmp2, +22                     
0x8    fff56983               lwu s3, -1(a0)
0xc    00c9d993               srli s3, s3, 0xc
        ;; Checking for cid 71 (Pointer)
0x10    04700713               li tmp2, 71
0x14    00e99363               bne s3, tmp2, +6
0x18        8082               ret
0x1a    2604b703               ld tmp2, 608(thr)                 
0x1e        8702               jr tmp2
}
ObjectPool len:0 {
}

That's the only difference I see, and that difference should be fine. This should only fail (in that the check would succeed though it shouldn't) if an invalid instantiation of Pointer (either to dynamic, thus using the null type arguments vector, or a type arguments vector that contains some non-NativeType) happens somewhere.

@sstrickl
Copy link
Contributor Author

sstrickl commented Jul 4, 2023

Actually, I just noticed another change in the generated code: the first version used LoadClassIdMayBeSmi and then compared against kPointerCid, but the second version actually checks for Smi and fails immediately before doing LoadClassId. But that isn't a change in semantics (since kSmiCid != kPointerCid) and shouldn't cause an issue either.

I wonder if somehow an unboxed ffi function pointer with an odd address is somehow getting sent to the TTS... but that shouldn't be the case, as RISCV instructions should be aligned either on 32-bit or 16-bit (for some extensions) boundaries. Hmm.

@sstrickl
Copy link
Contributor Author

sstrickl commented Jul 4, 2023

Oh, checking the results page, this change did actually have some positive improvements for the same arch: ffi/function_callbacks_many_test/4 and ffi/function_callbacks_many_test/6 both went from Crash -> Pass. So apparently whatever was failing on the slow path when --stacktrace-every=100 was used started working, and whatever was working on the fast path started failing 🙃

@dcharkes
Copy link
Contributor

dcharkes commented Jul 4, 2023

The tests succeed on all other archs: https://dart-current-results.web.app/#/filter=ffi/function_callbacks_many_test

So apparently whatever was failing on the slow path when --stacktrace-every=100 was used started working, and whatever was working on the fast path started failing 🙃

@rmacnak-google implemented the FFI on RISC-V.

We don't support everything in RISC-V yet (#48645), but this test does not exercise those signatures.

Another thing that's different on RISC-V is that TMP and TMP2 conflict with argument registers.

But this might be a completely different failure.

@sstrickl
Copy link
Contributor Author

sstrickl commented Jul 4, 2023

Yeah, I'm just concerned about it if there's some really subtle bug that somehow is only getting exercised in very rare conditions :/

But I've been looking at it off and on today while doing other stuff (especially because ReleaseXRISCV64 takes a good while to build) and I just can't see anything obviously wrong here, especially how it succeeds consistently locally but fails on the bot consistently...?

@sstrickl
Copy link
Contributor Author

sstrickl commented Jul 4, 2023

At least the FileIOSink benchmarks started working again randomly since, so it may not have been this CL's fault and we're just down to the RISCV64 weirdness.

@dcharkes
Copy link
Contributor

dcharkes commented Jul 5, 2023

Cross linking:

ffi/function_callbacks_many_test/4 Crash (expected Pass)
ffi/function_callbacks_many_test/6 Crash (expected Pass)

copybara-service bot pushed a commit that referenced this issue Jul 5, 2023
Previously, the FFI transformer could produce is checks where the
type to check against was Pointer<dynamic>. However, given that
the Pointer class is defined as:

abstract class Pointer<X extends NativeType> ...

the instantiated to bounds version of its type is Pointer<NativeType>.
Pointer<dynamic> is not a subtype of Pointer<NativeType>, and thus is an
invalid instantiation, but the only place this type could occur was as
the right hand side of an is check.

Before 7cc005e, Class::RareType() returned the class instantiated with
the null (all-dynamic) type arguments vector. Among other things, this
"rare" type was compared to the right-hand side of is checks and, if it
matched, performed a simple (cid-only) check of the instance type
arguments in unoptimized code.

Afterwards, Class::RareType() returns the class instantiated with a type
arguments vector where each type parameter is instantiated to bounds, so
now the "rare" type check fails and it falls back to the full check of
the instance type arguments, which causes a ~25% regression in some
unoptimized benchmarks.

This CL fixes the generation of those is checks in the FFI transformer
to use the instantiated to bounds version of the Pointer type instead.

TEST=pkg/front_end/test

Issue: #52843
Issue: #52848
Cq-Include-Trybots: luci.dart.try:vm-ffi-qemu-linux-release-riscv64-try,vm-ffi-qemu-linux-release-arm-try,vm-aot-linux-debug-x64-try,vm-linux-debug-x64-try
Change-Id: Ic9ac6d75ba2743e233065444fad13ab098094349
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/312400
Reviewed-by: Daco Harkes <dacoharkes@google.com>
Reviewed-by: Erik Ernst <eernst@google.com>
Auto-Submit: Tess Strickland <sstrickl@google.com>
Commit-Queue: Tess Strickland <sstrickl@google.com>
@rmacnak-google
Copy link
Contributor

Passing at top-of-tree. Probably fixed incidentally during other RISC-V fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. gardening
Projects
None yet
Development

No branches or pull requests

3 participants