Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System library build of Julia master fails on ARM macOS #44435

Closed
fxcoudert opened this issue Mar 3, 2022 · 47 comments
Closed

System library build of Julia master fails on ARM macOS #44435

fxcoudert opened this issue Mar 3, 2022 · 47 comments
Labels
system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips

Comments

@fxcoudert
Copy link
Contributor

We're trying to build Julia 1.7.2 in Homebrew (Homebrew/homebrew-core#96194) but it has several failures. The first one is this:

System library symlink failure: Unable to locate libgcc_s.1.dylib on your system!

On ARM macOS, using the GCC branch from Iain Sandoe (https://github.com/iains/gcc-darwin-arm64), the name for the libgcc_s library should be libgcc_s.1.1.dylib: version is 1.1, not 1, and not 2, as is in master:

julia/Make.inc

Line 1477 in fb4118a

LIBGCC_NAME := libgcc_s.2.$(SHLIB_EXT)

@fxcoudert
Copy link
Contributor Author

fxcoudert commented Mar 3, 2022

(I'm documenting my own tries at understanding the build logic here, to help fix the issue…)

I am missing something in the logic. Why is Make.inc defining LIBGCC_NAME and trying to find the right suffix, when base/Makefile hardcodes it?

$(eval $(call symlink_system_library,CSL,libgcc_s,1))

And given that libquadmath is not present on all targets, how is this expected to work?

$(eval $(call symlink_system_library,CSL,libquadmath,0))

@Keno
Copy link
Member

Keno commented Mar 3, 2022

We're trying to build Julia 1.7.2

macOS ARM is not stable on the 1.7 branch. Even if you get it building, we would not recommend shipping it to users.

@Keno
Copy link
Member

Keno commented Mar 3, 2022

On ARM macOS, using the GCC branch from Iain Sandoe (https://github.com/iains/gcc-darwin-arm64), the name for the libgcc_s library should be libgcc_s.1.1.dylib: version is 1.1, not 1, and not 2, as is in master:

@staticfloat - Are we using .2 for our CSL? Maybe we have an older version of the GCC branch?

@ViralBShah ViralBShah added the system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips label Mar 3, 2022
@giordano
Copy link
Contributor

giordano commented Mar 4, 2022

@fxcoudert
Copy link
Contributor Author

The commit iains/gcc-darwin-arm64@ccc57f4 is from November 2020.

That explains it. It's an old, experimental version of the toolchain, and the dylib version was changed in later development.

That still does not explain why Make.inc defines LIBGCC_NAME and tries to find the right suffix, when base/Makefile hardcodes it to 1.

@giordano
Copy link
Contributor

giordano commented Mar 4, 2022

That explains it. It's an old, experimental version of the toolchain, and the dylib version was changed in later development.

If we wanted to update our toolchain, which branch we should use? The README seems to suggest to use master-wip-arm-prep, but that's almost one-year old now

@fxcoudert
Copy link
Contributor Author

If we wanted to update our toolchain, which branch we should use?

The current branch is master-wip-apple-si, and it is relatively stable nowadays. Every one or two weeks, it gets rebased onto the latest GCC master, and the previous rebase is put into a master-wip-apple-si-on-xxxx branch.

Right now the latest rebase is at
https://github.com/iains/gcc-darwin-arm64/tree/master-wip-apple-si-on-5a9ba3f27f3

@fxcoudert
Copy link
Contributor Author

fxcoudert commented Mar 4, 2022

Even if we bypass the libgcc shared version number issue, we get another error, even using the master branch:

  error during bootstrap:
  LoadError(at "compiler/compiler.jl" line 3: LoadError(at "compiler/bootstrap.jl" line 8: BoundsError(a=Module, i=4294967296)))
  jl_bounds_error at /private/tmp/julia-20220303-14329-16dvxx4/julia-1.7.2/usr/lib/libjulia-internal.1.7.dylib (unknown line)
  get_fieldtype at /private/tmp/julia-20220303-14329-16dvxx4/julia-1.7.2/usr/lib/libjulia-internal.1.7.dylib (unknown line)

Do you have any idea what could cause this? I'm not seeing this in any of the reports here…

@giordano
Copy link
Contributor

giordano commented Mar 4, 2022

As mentioned above, Julia on aarch64 macOS has a known serious bug (namely #41440) which causes very frequent segmentation faults. That bug has been fixed as part of the upgrade to LLVM 13 and it'll be included in the next minor version v1.8.0, but it can't be backported to the v1.7 series. This means that even if you manage to build Julia v1.7 for that platform you'll have many other runtime issues.

That said, if you're building with system libraries with versions different from those tested in the official binaries, it isn't unlikely you'll run into fresh new bugs which no one else has ever faced before.

@DilumAluthge DilumAluthge changed the title Compiling Julia on ARM macOS fails System library build of Julia 1.7.2 fails on ARM macOS Mar 4, 2022
@fxcoudert
Copy link
Contributor Author

@DilumAluthge the bug reported in the title still exists on master

@giordano I'm sorry, I should have added to the bootstrap error that we also see that one on latest master sources. How can we debug it?

@DilumAluthge DilumAluthge changed the title System library build of Julia 1.7.2 fails on ARM macOS System library build of Julia master fails on ARM macOS Mar 5, 2022
@giordano
Copy link
Contributor

giordano commented Mar 6, 2022

I'm sorry, I should have added to the bootstrap error that we also see that one on latest master sources. How can we debug it?

I don't know how to debug the error during bootstrap, but I'm wondering what version of LLVM are you using?

@fxcoudert
Copy link
Contributor Author

I'm wondering what version of LLVM are you using

LLVM 13.0.1

@fxcoudert
Copy link
Contributor Author

fxcoudert commented Mar 6, 2022

OK, redoing the build on today's master, with this PR applied #44484
I now get:

 cd /private/tmp/julia-20220306-4216-xj58a5/base && /private/tmp/julia-20220306-4216-xj58a5/usr/bin/julia -C "apple-a12" --output-ji /private/tmp/julia-20220306-4216-xj58a5/usr/lib/julia/corecompiler.ji.tmp --startup-file=no --warn-overwrite=yes -g0 -O0 compiler/compiler.jl
error during bootstrap:
LoadError(at "compiler/compiler.jl" line 3: LoadError(at "compiler/bootstrap.jl" line 10: InexactError(func=:trunc, T=Int32, val=4294967295)))

make[2]: *** [/private/tmp/julia-20220306-4216-xj58a5/usr/lib/julia/corecompiler.ji] Error 1
make[1]: *** [julia-sysimg-ji] Error 2
make: *** [/private/tmp/julia-20220306-4216-xj58a5/doc/_build/html/en/index.html] Error 2

The command-line is:

make VERBOSE=1 USE_BINARYBUILDER=0 prefix=/opt/homebrew/Cellar/julia/HEAD-487757b sysconfdir=/opt/homebrew/etc USE_SYSTEM_CSL=1 USE_SYSTEM_LLVM=1 USE_SYSTEM_LIBUNWIND=1 USE_SYSTEM_PCRE=1 USE_SYSTEM_OPENLIBM=1 USE_SYSTEM_BLAS=1 USE_SYSTEM_LAPACK=1 USE_SYSTEM_GMP=1 USE_SYSTEM_MPFR=1 USE_SYSTEM_LIBSUITESPARSE=1 USE_SYSTEM_UTF8PROC=1 USE_SYSTEM_MBEDTLS=1 USE_SYSTEM_LIBSSH2=1 USE_SYSTEM_NGHTTP2=1 USE_SYSTEM_CURL=1 USE_SYSTEM_LIBGIT2=1 USE_SYSTEM_PATCHELF=1 USE_SYSTEM_ZLIB=1 USE_SYSTEM_P7ZIP=1 LIBBLAS=-lopenblas LIBBLASNAME=libopenblas LIBLAPACK=-lopenblas LIBLAPACKNAME=libopenblas USE_BLAS64=0 PYTHON=python3 MACOSX_VERSION_MIN=12 install

I'll keep the build intact, can someone please let me know how to obtain a backtrace, or debug this, to provide more information?

@giordano
Copy link
Contributor

giordano commented Mar 6, 2022

The soversion of libgcc_s will be fixed by #44487.

@fxcoudert
Copy link
Contributor Author

Thanks @giordano
As far as I can tell, the dylib version of libgcc_s needs to be changed in three places:

@giordano
Copy link
Contributor

giordano commented Mar 6, 2022

Good catch, thanks! However in base/Makefile I don't think we use the soversion, do we? at least, there wasn't a 2 before 🤔

@fxcoudert
Copy link
Contributor Author

fxcoudert commented Mar 6, 2022

In $(eval $(call symlink_system_library,CSL,libgcc_s,1)) I think 1 is the version.

It's hard for me to have a clear global view of the build system, but we have found in Homebrew builds that it is important to have 1.1 there, otherwise it fails and says it couldn't find libgcc_s.1.dylib.

@giordano
Copy link
Contributor

giordano commented Mar 6, 2022

Ok, I was confused because there wasn't a 2 before, but that's only used when not using libraries from BinaryBuilder, so it's likely an untested codepath. I now see that the third argument of symlink_system_library goes into the second argument of

julia/Make.inc

Lines 579 to 581 in 487757b

define versioned_libname
$$(if $(2),$(1).$(2).$(SHLIB_EXT),$(1).$(SHLIB_EXT))
endef
Ok, I can't fix it right now, I'll have a look later. BTW, we can continue this discussion in the PR directly.

@giordano
Copy link
Contributor

giordano commented Mar 6, 2022

@fxcoudert for the bootstrap issue, this is the makefile rule that is failing:

julia/sysimage.mk

Lines 60 to 64 in 3bcab39

$(build_private_libdir)/corecompiler.ji: $(COMPILER_SRCS)
@$(call PRINT_JULIA, cd $(JULIAHOME)/base && \
$(call spawn,$(JULIA_EXECUTABLE)) -C "$(JULIA_CPU_TARGET)" --output-ji $(call cygpath_w,$@).tmp \
--startup-file=no --warn-overwrite=yes -g0 -O0 compiler/compiler.jl)
@mv $@.tmp $@
Perhaps you may want to plug gdb/lldb in there. I believe the command being executed is basically (assuming the working directory is base/)

../usr/bin/julia -C "native" --output-ji corecompiler.ji.tmp --startup-file=no --warn-overwrite=yes -g0 -O0 compiler/compiler.jl

In particular, the offending line in the bootstrap process is

time() = ccall(:jl_clock_now, Float64, ())
but it's weird, your error message mentions Int32, I don't see an Int32 there.

@fxcoudert
Copy link
Contributor Author

@giordano yes I've run that julia command under lldb, and set breakpoints to jl_error and jl_throw but they don't get triggered. I'm not sure how to get a backtrace.

In particular, the offending line in the bootstrap process is

I don't think it's that one, because I can comment it out (and all calls to time below) and still get the same error.

@giordano
Copy link
Contributor

giordano commented Mar 6, 2022

Uhm, that line was mention in the error you posted at #44435 (comment).

@fxcoudert
Copy link
Contributor Author

I don't know Julia, but it mentioned line 10, not 8.

@giordano
Copy link
Contributor

giordano commented Mar 6, 2022

I read

LoadError(at "compiler/bootstrap.jl" line 10: InexactError(func=:trunc, T=Int32, val=4294967295))

@giordano
Copy link
Contributor

giordano commented Mar 6, 2022

Wait, I think there is a mix up here. What branch are you looking at? master or v1.7.2? I see compiler/bootstrap.jl is a bit different between the two.

@fxcoudert
Copy link
Contributor Author

I'm running master, and the error message is about line 10, which has only let on it.

@vtjnash
Copy link
Member

vtjnash commented Mar 7, 2022

In lldb/gdb the function name is ijl_throw

@fxcoudert
Copy link
Contributor Author

Sadly, a lot of frames have no debug info 😢

* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
  * frame #0: 0x0000000101036c24 libjulia-internal.1.9.dylib`ijl_throw
    frame #1: 0x0000000137e74118
    frame #2: 0x000000010177409c
    frame #3: 0x0000000137ecc0cc
    frame #4: 0x0000000107d100f8
    frame #5: 0x00000001309e8324
    frame #6: 0x000000013093c2d8
    frame #7: 0x00000001308f9ce4
    frame #8: 0x00000001308b9488
    frame #9: 0x000000013086e5e4
    frame #10: 0x0000000130818f7c
    frame #11: 0x000000013080c074
    frame #12: 0x000000010e5a212c
    frame #13: 0x000000010e55a864
    frame #14: 0x000000010e5480e0
    frame #15: 0x000000010e53c214
    frame #16: 0x000000010e48c1d0
    frame #17: 0x0000000130b3ccf4
    frame #18: 0x0000000130adee18
    frame #19: 0x00000001308bd88c
    frame #20: 0x000000013086e5e4
    frame #21: 0x0000000130818f7c
    frame #22: 0x000000013080c074
    frame #23: 0x000000010e5a212c
    frame #24: 0x000000010e55a864
    frame #25: 0x000000010e5480e0
    frame #26: 0x000000010e53c214
    frame #27: 0x000000010e48c1d0
    frame #28: 0x0000000130b3ccf4
    frame #29: 0x0000000130adee18
    frame #30: 0x00000001308bd88c
    frame #31: 0x000000013086e5e4
    frame #32: 0x0000000130818f7c
    frame #33: 0x000000013080c074
    frame #34: 0x000000010e5a212c
    frame #35: 0x000000010e55a864
    frame #36: 0x000000010e5480e0
    frame #37: 0x000000010e53c214
    frame #38: 0x000000010e48c1d0
    frame #39: 0x0000000130b3ccf4
    frame #40: 0x0000000130adee18
    frame #41: 0x00000001308bd88c
    frame #42: 0x000000013086e5e4
    frame #43: 0x0000000130818f7c
    frame #44: 0x000000013080c074
    frame #45: 0x000000010e5a212c
    frame #46: 0x000000010e559764
    frame #47: 0x000000010e5480e0
    frame #48: 0x000000010e53c214
    frame #49: 0x000000010e48c1d0
    frame #50: 0x000000010e1d41f4
    frame #51: 0x00000001080ac8d8
    frame #52: 0x0000000107f8dc3c
    frame #53: 0x000000010104af08 libjulia-internal.1.9.dylib`jl_parse_eval_all + 512
    frame #54: 0x000000010104aca8 libjulia-internal.1.9.dylib`ijl_load_ + 220
    frame #55: 0x0000000100670044
    frame #56: 0x0000000100580064
    frame #57: 0x00000001010334d0 libjulia-internal.1.9.dylib`do_call + 188
    frame #58: 0x0000000101031e8c libjulia-internal.1.9.dylib`eval_body + 1468
    frame #59: 0x00000001010324c0 libjulia-internal.1.9.dylib`jl_interpret_toplevel_thunk + 264
    frame #60: 0x0000000101049f0c libjulia-internal.1.9.dylib`jl_toplevel_eval_flex + 5040
    frame #61: 0x0000000101049a5c libjulia-internal.1.9.dylib`jl_toplevel_eval_flex + 3840
    frame #62: 0x000000010104aa98 libjulia-internal.1.9.dylib`ijl_toplevel_eval_in + 156
    frame #63: 0x0000000100418044
    frame #64: 0x00000001010334d0 libjulia-internal.1.9.dylib`do_call + 188
    frame #65: 0x0000000101031e8c libjulia-internal.1.9.dylib`eval_body + 1468
    frame #66: 0x00000001010324c0 libjulia-internal.1.9.dylib`jl_interpret_toplevel_thunk + 264
    frame #67: 0x0000000101049f0c libjulia-internal.1.9.dylib`jl_toplevel_eval_flex + 5040
    frame #68: 0x000000010104af08 libjulia-internal.1.9.dylib`jl_parse_eval_all + 512
    frame #69: 0x000000010104aca8 libjulia-internal.1.9.dylib`ijl_load_ + 220
    frame #70: 0x000000010104b08c libjulia-internal.1.9.dylib`ijl_load + 104
    frame #71: 0x00000001010698ec libjulia-internal.1.9.dylib`exec_program + 156
    frame #72: 0x00000001010695d0 libjulia-internal.1.9.dylib`true_main + 256
    frame #73: 0x0000000101069484 libjulia-internal.1.9.dylib`jl_repl_entrypoint + 180
    frame #74: 0x0000000100003f9c julia`main + 12
    frame #75: 0x00000001000110f4 dyld`start + 520

@giordano
Copy link
Contributor

giordano commented Mar 9, 2022

Sadly, a lot of frames have no debug info 😢

For a debug build you need to use JULIA_BUILD_MODE=debug.

Note that to save on disk this and the other options you can create the file Make.user with content:

override JULIA_BUILD_MODE=debug

@fxcoudert
Copy link
Contributor Author

fxcoudert commented Mar 9, 2022

New backtrace, with JULIA_BUILD_MODE=debug:

* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x0000000101039a0c libjulia-internal-debug.1.9.dylib`ijl_throw
    frame #1: 0x0000000147700118
    frame #2: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #3: 0x00000001025680ac
    frame #4: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #5: 0x00000001477580dc
    frame #6: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #7: 0x000000010ef28108
    frame #8: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #9: 0x0000000140a5c324
    frame #10: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #11: 0x0000000140a500ec
    frame #12: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #13: 0x00000001409a4314
    frame #14: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #15: 0x0000000140961d14
    frame #16: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #17: 0x00000001409214b8
    frame #18: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #19: 0x00000001408d2618
    frame #20: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #21: 0x0000000140878fac
    frame #22: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #23: 0x000000014086c084
    frame #24: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #25: 0x000000014080a168
    frame #26: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #27: 0x00000001407c288c
    frame #28: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #29: 0x00000001407b00f0
    frame #30: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #31: 0x00000001407a4228
    frame #32: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #33: 0x00000001406f4200
    frame #34: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #35: 0x0000000140bb0d24
    frame #36: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #37: 0x0000000140b52e54
    frame #38: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #39: 0x00000001409258bc
    frame #40: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #41: 0x00000001408d2618
    frame #42: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #43: 0x0000000140878fac
    frame #44: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #45: 0x000000014086c084
    frame #46: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #47: 0x000000014080a168
    frame #48: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #49: 0x00000001407c288c
    frame #50: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #51: 0x00000001407b00f0
    frame #52: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #53: 0x00000001407a4228
    frame #54: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #55: 0x00000001406f4200
    frame #56: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #57: 0x0000000140bb0d24
    frame #58: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #59: 0x0000000140b52e54
    frame #60: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #61: 0x00000001409258bc
    frame #62: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #63: 0x00000001408d2618
    frame #64: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #65: 0x0000000140878fac
    frame #66: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #67: 0x000000014086c084
    frame #68: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #69: 0x000000014080a168
    frame #70: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #71: 0x00000001407c288c
    frame #72: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #73: 0x00000001407b00f0
    frame #74: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #75: 0x00000001407a4228
    frame #76: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #77: 0x00000001406f4200
    frame #78: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #79: 0x0000000140bb0d24
    frame #80: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #81: 0x0000000140b52e54
    frame #82: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #83: 0x00000001409258bc
    frame #84: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #85: 0x00000001408d2618
    frame #86: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #87: 0x0000000140878fac
    frame #88: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #89: 0x000000014086c084
    frame #90: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #91: 0x000000014080a168
    frame #92: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #93: 0x00000001407c178c
    frame #94: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #95: 0x00000001407b00f0
    frame #96: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #97: 0x00000001407a4228
    frame #98: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #99: 0x00000001406f4200
    frame #100: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #101: 0x0000000140438224
    frame #102: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #103: 0x00000001402c4908
    frame #104: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #105: 0x00000001401a5c50
    frame #106: 0x000000010104f054 libjulia-internal-debug.1.9.dylib`jl_toplevel_eval_flex + 5424
    frame #107: 0x000000010105027c libjulia-internal-debug.1.9.dylib`jl_parse_eval_all + 564
    frame #108: 0x000000010104ffe8 libjulia-internal-debug.1.9.dylib`ijl_load_ + 220
    frame #109: 0x0000000100670044
    frame #110: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #111: 0x0000000100580074
    frame #112: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #113: 0x0000000101036130 libjulia-internal-debug.1.9.dylib`do_call + 188
    frame #114: 0x0000000101034768 libjulia-internal-debug.1.9.dylib`eval_body + 1732
    frame #115: 0x0000000101034f20 libjulia-internal-debug.1.9.dylib`jl_interpret_toplevel_thunk + 296
    frame #116: 0x000000010104f0d4 libjulia-internal-debug.1.9.dylib`jl_toplevel_eval_flex + 5552
    frame #117: 0x000000010104eb74 libjulia-internal-debug.1.9.dylib`jl_toplevel_eval_flex + 4176
    frame #118: 0x000000010104fdd0 libjulia-internal-debug.1.9.dylib`ijl_toplevel_eval_in + 156
    frame #119: 0x0000000100418044
    frame #120: 0x000000010101d968 libjulia-internal-debug.1.9.dylib`ijl_apply_generic + 1136
    frame #121: 0x0000000101036130 libjulia-internal-debug.1.9.dylib`do_call + 188
    frame #122: 0x0000000101034768 libjulia-internal-debug.1.9.dylib`eval_body + 1732
    frame #123: 0x0000000101034f20 libjulia-internal-debug.1.9.dylib`jl_interpret_toplevel_thunk + 296
    frame #124: 0x000000010104f0d4 libjulia-internal-debug.1.9.dylib`jl_toplevel_eval_flex + 5552
    frame #125: 0x000000010105027c libjulia-internal-debug.1.9.dylib`jl_parse_eval_all + 564
    frame #126: 0x000000010104ffe8 libjulia-internal-debug.1.9.dylib`ijl_load_ + 220
    frame #127: 0x0000000101050418 libjulia-internal-debug.1.9.dylib`ijl_load + 104
    frame #128: 0x0000000101071d5c libjulia-internal-debug.1.9.dylib`exec_program + 156
    frame #129: 0x0000000101071a40 libjulia-internal-debug.1.9.dylib`true_main + 256
    frame #130: 0x00000001010718f4 libjulia-internal-debug.1.9.dylib`jl_repl_entrypoint + 180
    frame #131: 0x0000000100003f9c julia-debug`main + 12
    frame #132: 0x00000001000110f4 dyld`start + 520

@maleadt
Copy link
Member

maleadt commented Mar 9, 2022

For a debug build you need to use JULIA_BUILD_MODE=debug.

That, or setting ENABLE_GDBLISTENER:

julia/src/codegen.cpp

Lines 8202 to 8212 in 6e8804b

// Register GDB event listener
#if defined(JL_DEBUG_BUILD)
jl_using_gdb_jitevents = 1;
# else
const char *jit_gdb = getenv("ENABLE_GDBLISTENER");
if (jit_gdb && atoi(jit_gdb)) {
jl_using_gdb_jitevents = 1;
}
#endif
if (jl_using_gdb_jitevents)
jl_ExecutionEngine->enableJITDebuggingSupport();

Sadly, this isn't availiable on JITLink from LLVM 13:

# warning "JIT debugging (GDB integration) not available on LLVM < 14.0 (for JITLink)"

So IIUC it's expected that the JIT frames in the backtrace above are not annotated.

@fxcoudert
Copy link
Contributor Author

The error is:

LoadError(at "compiler/compiler.jl" line 3: LoadError(at "compiler/bootstrap.jl" line 10: InexactError(func=:trunc, T=Int32, val=4294967295)))

There is no trunc on line 10 of compiler/bootstrap.jl:

Is there no way to ask Julia to tell me what trunc function is the problem, where it's called from, etc? How is Julia code debugged??

@giordano
Copy link
Contributor

giordano commented Mar 9, 2022

How is Julia code debugged??

For general usage there are different tools (see repositories in the JuliaDebug organisation, for example Debugger.jl and Infiltrator.jl), but I don't think you can use them that early in the bootstrap phase of julia itself. Unfortunately you're seeing the error at a very inconvenient point. And as Tim mentioned above, also LLVM is being non collaborative. Perhaps there are other ways to see what's going on here, but I haven't played much with julia bootstrapping.

@fxcoudert
Copy link
Contributor Author

Perhaps there are other ways to see what's going on here, but I haven't played much with julia bootstrapping.

Do you know who would? I would literally run anything to understand what happens at this stage.

@staticfloat
Copy link
Member

Inside of lldb, can you try running:

p (void)jl_gdblookup(<frame address>)

@fxcoudert
Copy link
Contributor Author

fxcoudert commented Mar 9, 2022

@staticfloat thanks. I'm not seeing much more info:

(lldb) p (void)jl_gdblookup(0x0000000101039a0c)
ijl_throw at /private/tmp/julia-20220309-37202-13b9kqy/usr/lib/libjulia-internal-debug.1.9.dylib (unknown line)
(lldb) p (void)jl_gdblookup(0x0000000137f00118)
unknown function (ip: 0x137f00118)
(lldb) p (void)jl_gdblookup(0x000000010101d968)
ijl_apply_generic at /private/tmp/julia-20220309-37202-13b9kqy/usr/lib/libjulia-internal-debug.1.9.dylib (unknown line)
(lldb) p (void)jl_gdblookup(0x00000001020b40ac)
unknown function (ip: 0x1020b40ac)
(lldb) p (void)jl_gdblookup(0x0000000137f580dc)
unknown function (ip: 0x137f580dc)

… and all the way up to:

(lldb) p (void)jl_gdblookup(0x000000010104f054)
jl_toplevel_eval_flex at /private/tmp/julia-20220309-37202-13b9kqy/usr/lib/libjulia-internal-debug.1.9.dylib (unknown line)

@staticfloat
Copy link
Member

staticfloat commented Mar 9, 2022

Ideas for further progress:

  • Build against a HEAD/prerelease version of LLVM, to hopefully get more complete debugging information
  • use printf debugging in compiler/bootstrap.jl to narrow down where the trunc() error is happening. The value you're getting is 0xffffffff, so I imagine somewhere we have a UInt32 that really should be an Int32, and when we try to convert the UInt32 to an Int32 (which is what generates the trunc call) Julia freaks out because the UInt32 actually was a -1 from some C code somewhere. So we likely need to change some ccall() signature somewhere to use Int32 instead of UInt32 in the first place.

You can scatter println(stderr, "point 1") throughout bootstrap.jl, and that should show up during the bootstrap process.

@fxcoudert
Copy link
Contributor Author

fxcoudert commented Mar 9, 2022

Should have thought of printf, it never fails!

It crashes on the first call of typeinf_type at

typeinf_type(interp, m.method, Tuple{typ...}, m.sparams)

I can print the arguments:

m.method: run_passes(Core.CodeInfo, Core.Compiler.OptimizationState, Core.Compiler.InferenceResult)
typ: Array{Any, (4,)}[
  typeof(Core.Compiler.run_passes),
  Core.CodeInfo,
  Core.Compiler.OptimizationState,
  Core.Compiler.InferenceResult]
m.sparams: svec()

I added print statements to the function typeinf_type, like this:

# compute (and cache) an inferred AST and return the inferred return type
function typeinf_type(interp::AbstractInterpreter, method::Method, @nospecialize(atype), sparams::SimpleVector)
    println(stderr, "point 1")
    if contains_is(unwrap_unionall(atype).parameters, Union{})
        return Union{} # don't ask: it does weird and unnecessary things, if it occurs during bootstrap
    end
    println(stderr, "point 2")
    mi = specialize_method(method, atype, sparams)::MethodInstance
    println(stderr, "point 3")
    for i = 1:2 # test-and-lock-and-test
        i == 2 && ccall(:jl_typeinf_begin, Cvoid, ())
        code = get(code_cache(interp), mi, nothing)
        if code isa CodeInstance
            # see if this rettype already exists in the cache
            i == 2 && ccall(:jl_typeinf_end, Cvoid, ())
            return code.rettype
        end
    end
    println(stderr, "point 4")
    result = InferenceResult(mi)
    println(stderr, "point 5")
    typeinf(interp, result, :global)
    println(stderr, "point 6")
    ccall(:jl_typeinf_end, Cvoid, ())
    println(stderr, "point 7")
    result.result isa InferenceState && return nothing
    println(stderr, "point 8")
    return widenconst(ignorelimited(result.result))
end

I reach 5, but not 6, so the issue is apparently in typeinf. I instrumented that function like this:

# Wrapper around _typeinf that optionally records the exclusive time for each invocation.
function typeinf(interp::AbstractInterpreter, frame::InferenceState)
    println(stderr, "point A")
    if __measure_typeinf__[]
        println(stderr, "point X")
        Timings.enter_new_timer(frame)
        v = _typeinf(interp, frame)
        Timings.exit_current_timer(frame)
        println(stderr, "point Y")
        return v
    else
        println(stderr, "point B")
        v = _typeinf(interp, frame)
        println(stderr, "point C")
        return v
    end
end

and there it gets weird, because:

  • this function is called multiple times, from a single call of typeinf_type
  • all paths go through A and B, but only some calls go through C?!
point 1
point 2
point 3
point 4
point 5
point A
point B
point A
point B
point A
point B
point A
point B
point C
point A
point B
point A
point B
[… many more lines …]
point A
point B
point C
error during bootstrap:
LoadError(at "compiler/compiler.jl" line 3: LoadError(at "compiler/bootstrap.jl" line 10: InexactError(func=:trunc, T=Int32, val=4294967295)))

So typeinf calls _typeinf, and that function recurses (although I can't see where that happens).

@fxcoudert
Copy link
Contributor Author

fxcoudert commented Mar 9, 2022

typeinf calls _typeinf, which calls typeinf_nocycle, which calls typeinf_local, which calls abstract_eval_statement, which calls abstract_call, which calls abstract_call_known, then abstract_call_gf_by_type, then find_matching_methods, which in turn calls findall here:

matches = findall(atype, method_table; limit = max_methods)

atype in this call is atype: Tuple{typeof(Core.Compiler.rem), Int64, Type{Int64}}

findall calls _findall, which calls _methods_by_ftype.

@staticfloat
Copy link
Member

Yes, type inference will call typeinf() multiple times. Type inference, after all, can depend on the result of other type inference.

It's curious to me that we actually pass through Point C and then error before hitting Point 6. So there must be something after we return from typeinf() in the stack frame where we hit C that causes the problem.

Do you get anything useful out of show(stderr, "text/plain", stacktrace())? I'm not certain if this requires the debugging information that we seem to be lacking for you:

julia> show(stderr, "text/plain", stacktrace())
13-element Vector{Base.StackTraces.StackFrame}:
 top-level scope at REPL[18]:1
 eval at boot.jl:368 [inlined]
 eval_user_input(ast::Any, backend::REPL.REPLBackend) at REPL.jl:151
 repl_backend_loop(backend::REPL.REPLBackend) at REPL.jl:247
 start_repl_backend(backend::REPL.REPLBackend, consumer::Any) at REPL.jl:232
 run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool) at REPL.jl:369
 run_repl(repl::REPL.AbstractREPL, consumer::Any) at REPL.jl:356
 (::Base.var"#960#962"{Bool, Bool, Bool})(REPL::Module) at client.jl:419
 #invokelatest#2 at essentials.jl:729 [inlined]
 invokelatest at essentials.jl:727 [inlined]
 run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool) at client.jl:404
 exec_options(opts::Base.JLOptions) at client.jl:318
 _start() at client.jl:522

@fxcoudert
Copy link
Contributor Author

fxcoudert commented Mar 9, 2022

I tried stacktrace but: UndefVarError(var=:stacktrace)

But after more than an hour of printing stuff, I've found the trace myself. It's:

function _methods_by_ftype(@nospecialize(t), mt::Union{Core.MethodTable, Nothing}, lim::Int, world::UInt, ambig::Bool, min::Ref{UInt
}, max::Ref{UInt}, has_ambig::Ref{Int32})
    println(stderr, "### in ###")
    tmp = ccall(:jl_matching_methods, Any, (Any, Any, Cint, Cint, UInt, Ptr{UInt}, Ptr{UInt}, Ptr{Int32}), t, mt, lim, ambig, world, min, max, has_ambig)::Union{Array{Any,1}, Bool}
    println(stderr, "### out ###")
    return tmp
end

called from (see above): typeinf > _typeinf > typeinf_nocycle > typeinf_local > abstract_eval_statement > abstract_call, > abstract_call_known > abstract_call_gf_by_type > find_matching_methods > findall > _methods_by_ftype

return ccall(:jl_matching_methods, Any, (Any, Any, Cint, Cint, UInt, Ptr{UInt}, Ptr{UInt}, Ptr{Int32}), t, mt, lim, ambig, world, min, max, has_ambig)::Union{Array{Any,1}, Bool}

I now see this in but not out, meaning the ccall is throwing.

The arguments to the call are:

t: Tuple{typeof(Core.Compiler.rem), Int64, Type{Int64}}
mt: nothing
lim: 4294967295
ambig: false
world: 0x000000000000130f
min: Core.Compiler.RefValue{UInt64}(x=0x0000000000000000)
max: Core.Compiler.RefValue{UInt64}(x=0xffffffffffffffff)
has_ambig: Core.Compiler.RefValue{Int32}(x=0)

@fxcoudert
Copy link
Contributor Author

The wrong parameter here (I think) is lim. All previous calls to this function have lim=3, for example, the previous call in the bootstrap is the related function:

t: Tuple{typeof(Core.Compiler.rem), Int32, Type{Int64}}
mt: nothing
lim: 3
ambig: false
world: 0x000000000000130f
min: Core.Compiler.RefValue{UInt64}(x=0x0000000000000000)
max: Core.Compiler.RefValue{UInt64}(x=0xffffffffffffffff)
has_ambig: Core.Compiler.RefValue{Int32}(x=0)

@Keno
Copy link
Member

Keno commented Mar 9, 2022

InexactError(func=:trunc, T=Int32, val=4294967295))

This kind of error often happens when there is an LLVM miscompilation. Do you have all our patches applied?

@fxcoudert
Copy link
Contributor Author

fxcoudert commented Mar 9, 2022

@Keno We (Homebrew) are compiling against LLVM 13.0.1. We're not seeing this issue on Intel, only ARM, so if it's a LLVM issue, it's ARM-specific.

Where can I find the list of patches against LLVM 13.0.1 that you recommend? There's no LLVM-related patch in https://github.com/JuliaLang/julia/tree/master/deps/patches

@Keno
Copy link
Member

Keno commented Mar 9, 2022

https://github.com/JuliaLang/llvm-project/commits/julia-release/13.x is the list of patches. If I had to guess, I think you may be seeing https://bugs.llvm.org/show_bug.cgi?id=49357 (which we disable in JuliaLang/llvm-project@7627d61). That said, do note that our patch set is not in general appropriate for inclusion in a general LLVM copy provided for other projects (unitl those patches are upstreamed) and we recommend vendoring a private copy of LLVM for use by Julia.

@fxcoudert
Copy link
Contributor Author

we recommend vendoring a private copy of LLVM for use by Julia

I know, every project wants their own copy of all dependencies, but for a complete distro things grow out of hand quickly. So far we have shipped Julia build against released LLVM, but I'll consider the options available…

@fxcoudert
Copy link
Contributor Author

fxcoudert commented Mar 12, 2022

We're trying to now build the patched LLVM as part of Julia, but this leads to several new builds failures… #44584 (1.8.0-beta1) and #44585 (master)

@fxcoudert
Copy link
Contributor Author

Closing this, various parts have been reported separately and fixed in different PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips
Projects
None yet
Development

No branches or pull requests

7 participants