Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

createdump fails with libunwind assert: Assertion `ip >= di->start_ip && ip < di->end_ip' failed. #64168

Closed
jkotas opened this issue Jan 23, 2022 · 8 comments · Fixed by #64220
Labels
area-PAL-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' untriaged New issue has not been triaged by the area owner

Comments

@jkotas
Copy link
Member

jkotas commented Jan 23, 2022

createdump: /__w/1/s/src/coreclr/pal/src/libunwind/src/dwarf/Gfind_proc_info-lsb.c:929: int _Uaarch64_dwarf_search_unwind_table(unw_addr_space_t, unw_word_t, unw_dyn_info_t *, unw_proc_info_t *, int, void *): Assertion `ip >= di->start_ip && ip < di->end_ip' failed.
      /root/helix/work/workitem/e/JIT/Regression/JitBlue/GitHub_17777/GitHub_17777/GitHub_17777.sh: line 411:   526 Aborted                 (core dumped) $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Jan 23, 2022
@ghost
Copy link

ghost commented Jan 23, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details
createdump: /__w/1/s/src/coreclr/pal/src/libunwind/src/dwarf/Gfind_proc_info-lsb.c:929: int _Uaarch64_dwarf_search_unwind_table(unw_addr_space_t, unw_word_t, unw_dyn_info_t *, unw_proc_info_t *, int, void *): Assertion `ip >= di->start_ip && ip < di->end_ip' failed.
      /root/helix/work/workitem/e/JIT/Regression/JitBlue/GitHub_17777/GitHub_17777/GitHub_17777.sh: line 411:   526 Aborted                 (core dumped) $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"
Author: jkotas
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@jkotas
Copy link
Member Author

jkotas commented Jan 23, 2022

More context:
#64162 (comment)
#63854 (comment)

@jkotas jkotas added area-PAL-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Jan 23, 2022
@jkotas
Copy link
Member Author

jkotas commented Jan 23, 2022

@janvorli Are you looking into this? We need to get this fixed with high priority since it prevents us from diagnosing CI crashes.

cc @agocke

@janvorli
Copy link
Member

I was looking into it. The problem is not in libunwind - runtime asks it to unwind with instruction pointer in the context set to 0.
Based on the dump from the CI I was looking at, I seems it was because the loop in the LazyMachState::unwindLazyState has iterated twice. In the first iteration, it went to the managed frame, but the ExecutionManager::IsManagedCode didn't recognize it as managed for some reason. So it did another unwind, but since it was a managed frame, there was no native unwind info, hence the 0 IP.

Although the OS in the CI has generated a dump for both the createdump and the crashed process that the createdump was attempting to dump, the dump from the crashed process didn't contain memory from the executables like libcoreclr.dylib and so SOS cannot work with it. It took me a while to figure out that it was the reason why SOS didn't work with the dump. I was hoping that SOS stack walking would repro the problem when opening the crashing process dump and I could debug it that way, but that was unfortunately not possible.

Looking at the dump from the crashing thread stack again now, I can see though that the crash must have happened when only Main was on the stack, so maybe I can simulate it by making GC always crash at the point where it crashed in the dump.

I am back on it.

@janvorli
Copy link
Member

So, I was able to make SOS work with dumps that don't contain copy of the libcoreclr.so (dotnet/diagnostics#2827). Now I can repro the issue with the SOS clrstack command.
So far I can see that my theory was incorrect, the unwinding actually returns zero PC for the first call to PAL_VirtualUnwindOutOfProc in the LazyMachState::unwindLazyState in DAC. The PC/SP in the input context are correct addresses in the JIT_NewArr1, so it is not as I have thought that we would unwind correctly to the managed frame and then fail to recognize it as managed.
The problem then seems to be either in the libunwind or in the unwind info generated by the compiler.

@janvorli
Copy link
Member

I believe this is the commit in the libunwind that has broken us:
libunwind/libunwind@a4014f3

I have enabled debugging logs in the libunwind and got this:

 >_Uaarch64_init_mem_validate: using msync to validate memory
 >_Uaarch64_init_remote: (cursor=0x7fcc4e3fc8)
 >_Uaarch64_step: (cursor=0x7fcc4e3fc8, ip=0x0000007f78e64860, cfa=0x0000007fc500e620))
 >_ULaarch64_init_mem_validate: using msync to validate memory
      >_Uaarch64_dwarf_search_unwind_table: lookup IP 0xffffff8033e42694
               >_Uaarch64_dwarf_search_unwind_table: ip=0x7f78e64860, load_offset=0x7fcc4e3678, start_ip=0x325c4c
 >_Uaarch64_dwarf_search_unwind_table: e->fde_offset = a45e4, segbase = 7f78b3eb54, debug_frame_base = 0, fde_addr = 7f78be3138
            >_Uaarch64_dwarf_extract_proc_info_from_fde: FDE @ 0x7f78be3138
               >_Uaarch64_dwarf_extract_proc_info_from_fde: looking for CIE at address 7f78bcec70
               >parse_cie: CIE parsed OK, augmentation = "zPLR", handler=0x7f796222b0
               >_Uaarch64_dwarf_extract_proc_info_from_fde: FDE covers IP 0x7f78e647a0-0x7f78e64d3c, LSDA=0x7f78a46b0c
 >_Uaarch64_step: dwarf_step()=-10
              >is_plt_entry: ip=0x7f78e64860 => 0x913c5108d00024a8 0x92800008f90053e8, ret = 0
  >_Uaarch64_step: fallback
  >_Uaarch64_step: link register (x30) = 0x0000000000000000
 >_Uaarch64_init_remote: (cursor=0x7fcc4e3fc8)
 >_Uaarch64_step: (cursor=0x7fcc4e3fc8, ip=0x0000000000000000, cfa=0x0000007fc500e620))
 >_Uaarch64_step: Invalid address found in the call stack: 0x0

Notice the IP in the _Uaarch64_dwarf_search_unwind_table: lookup IP 0xffffff8033e42694. Is is a nonsense value. It gets logged here:

Debug (6, "lookup IP 0x%lx\n", (long) (ip - ip_base - di->load_offset));

The above-mentioned commit has added the - di->load_offset part.

Dumping the values of the stuff used in the expression:

(lldb) p/x ip
(unw_word_t) $2 = 0x0000007f78e64860
(lldb) p/x ip_base
(unw_word_t) $3 = 0x0000007f78b3eb54
(lldb) p/x di->load_offset
(unw_word_t) $4 = 0x0000007fe73bc068

@janvorli
Copy link
Member

The load_offset doesn't seem to match anything sane. Here is the modules list from the dump:

(lldb) target modules list
[  0] C0DFA06C-AC29-9AD8-AF97-C9BF65E78813-571A2E8A 0x000000556c80b000 /mnt/ext/issues/mikem/corerun
      /mnt/ext/issues/mikem/corerun.dbg
[  1] E4E434D2-54B6-3631-6C50-DC0BCBD490BE-18F4F196 0x0000007f79793000 [vdso] (0x0000007f79793000)
[  2] F6581A91 0x0000007f79793000 linux-vdso.so.1 (0x0000007f79793000)
[  3] 1D906D12-95D6-9258-753C-A7F04C686746-2D13AFCD 0x0000007f79752000 /lib/aarch64-linux-gnu/libdl.so.2
[  4] F807FA3D-4C61-2E1D-9671-388DDCC7DBC4-BACB973D 0x0000007f79726000 /lib/aarch64-linux-gnu/libpthread.so.0
[  5] 9D6320AF-BBAC-86C0-54B0-FB6C107DE294-CD405E85 0x0000007f79592000 /usr/lib/aarch64-linux-gnu/libstdc++.so.6
[  6] 7255B6FD-4BA4-34F0-71A1-07418F67B63F-C0651D23 0x0000007f794d9000 /lib/aarch64-linux-gnu/libm.so.6
[  7] 2E5C7930-BB7A-275E-F213-8E4F506BB060-97420024 0x0000007f794b5000 /lib/aarch64-linux-gnu/libgcc_s.so.1
[  8] EA28B57F-9A8F-3C27-ADA8-A651BF2ACC67-20DA97F7 0x0000007f7935c000 /lib/aarch64-linux-gnu/libc.so.6
[  9] 67CFC48F-A54F-3D86-72FD-33C9E2A108D9-7E51752E 0x0000007f78999000 /mnt/ext/issues/mikem/libcoreclr.so
      /mnt/ext/issues/mikem/libcoreclr.so.dbg
[ 10] DDC0634C-F3E0-6B7B-8771-4C2C3F53E8CC-3417E080 0x0000007f78982000 /lib/aarch64-linux-gnu/librt.so.1
[ 11] 78560972-A7EB-872D-E472-92A057B89919-54ACAE97 0x0000007efe8bb000 /mnt/ext/issues/mikem/libclrjit.so
      /mnt/ext/issues/mikem/libclrjit.so.dbg
[ 12] 2EDA2314-16BD-2ED5-276E-164BD51BF606-70C54495 0x0000007efdea5000 /usr/lib/aarch64-linux-gnu/libicuuc.so.60
[ 13] EF461BA4-00A9-989C-CEF8-72D1FF3865C5-226AD997 0x0000007efc4ec000 /usr/lib/aarch64-linux-gnu/libicudata.so.60
[ 14] 08D89F5D-7FFA-DD86-FBF1-43BD86DB1CD3-7C03D32E 0x0000007efc22e000 /usr/lib/aarch64-linux-gnu/libicui18n.so.60

I am trying to investigate where that offset came from. There is a number of places in the libunwind that sets the load_offset.

@janvorli
Copy link
Member

I've found the culprit, preparing PR...

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jan 24, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 24, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Feb 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-PAL-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants