Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shadow memory layout failure on Windows Server 2016 #2328

Closed
derekbruening opened this issue Nov 30, 2020 · 3 comments · Fixed by #2332
Closed

shadow memory layout failure on Windows Server 2016 #2328

derekbruening opened this issue Nov 30, 2020 · 3 comments · Fixed by #2332

Comments

@derekbruening
Copy link
Contributor

I hit this earlier this year trying to move Appveyor to VS2017 on Server 2016 in PR #2265 for #2250. Despite Server 2016 being pretty much the same as Win10 1607, we never hit this issue on Win10 1607.

Pasting the notes from #2250 here in a new dedicated issue because I am hitting what I think is the same thing trying Github Actions on Server 2016. Here's Appveyor:

Appveyor notes

[00:07:58] ~~Dr.M~~ Running "C:/projects/drmemory/build/build_drmemory-dbg-64/tests/hello.exe"
[00:07:58] ~~Dr.M~~ WARNING: Failed to rename the symcache file.
[00:07:58] ~~Dr.M~~ unhandled application memory @0x00007ff69da307aa
[00:07:58] <Core dump file for application C:\projects\drmemory\build\build_drmemory-dbg-64\tests\hello.exe (4420) created at C:\projects\drmemory\build\build_drmemory-dbg-64\logs\dynamorio\hello.exe.4420.00000000.ldmp>
[00:07:58] ~~Dr.M~~ WARNING: application exited with abnormal code 0xffffffff

Addrs vary slightly but all end in 0x07aa.

It's a shadow memory issue.

new segment: app [0x0000000000000000, 0x0000030000000000), shadow [0x0000070000000000, 0x000007c000000000), reserve [0x000008c000000000, 0x000008f000000000)


get_shared_callstack: created pcs 0x00007ff5d7fbefc0
heap 0 0x000002e025f388c0-0x000002e025f388e0-0x000002e025f38980 0 0x0000002025f30000,0x000002e000000000 a0 0 0
heap 1 0x000002e025f38970-0x000002e025f3897a-0x000002e025f389c0 0 0x0000000a25f30000,0x000002e000000000 46 46 0
set range 0x000002e025f38970-0x000002e025f389b6 => 0x0
        set byte 0x000002e025f389b4
        set byte 0x000002e025f389b5
new pre-us alloc 0x000002e025f38970-0x000002e025f389b6-0x000002e025f389b6
get_shared_callstack: created pcs 0x00007ff5d7fbefc0
heap 1 0x000002e025f389c0-0x000002e025f389ca-0x000002e025f38aa0 0 0x0000000a25f30000,0x000002e000000000 d6 d6 0
set range 0x000002e025f389c0-0x000002e025f38a96 => 0x0
        set byte 0x000002e025f38a94
        set byte 0x000002e025f38a95
new pre-us alloc 0x000002e025f389c0-0x000002e025f38a96-0x000002e025f38a96
get_shared_callstack: created pcs 0x00007ff5d7fbefc0
heap 1 0x000002e025f38aa0-0x000002e025f38ab0-0x000002e025f38cc0 0 0x0000001025f30000,0x000002e000000000 210 210 0
set range 0x000002e025f38aa0-0x000002e025f38cb0 => 0x0
new pre-us alloc 0x000002e025f38aa0-0x000002e025f38cb0-0x000002e025f38cb0
get_shared_callstack: created pcs 0x00007ff5d7fbefc0
heap 0 0x000002e025f38cd0-0x000002e025f38cf0-0x000002e025f38fe0 0 0x0000002025f30000,0x000002e000000000 2f0 0 0
heap 1000 0x000002e025f39000-0x000002e025f39000-0x000002e02602f000 0 0x0000000025f30000,0x000002e000000000 f6000 0 0
walking heap 1 0x000002e025d70000
walking individual heap 0x000002e025d70000
adding heap region 0x000002e025d70000-0x000002e025d80000 arena
adding heap region 0x000002e025d70000-0x000002e025d80000
set heap region 0x000002e025d70000-0x000002e025d80000 Heap to 0x000002e025d70000
heap 2 0x000002e025d70000-0x000002e025d70000-0x000002e025d70720 0 0x000002e025d70740,0x000002e025d80000 720 0 0
heap 0 0x000002e025d70740-0x000002e025d70760-0x000002e025d70fe0 0 0x0000002025d70740,0x000002e000000000 880 0 0
heap 1000 0x000002e025d71000-0x000002e025d71000-0x000002e025d80000 0 0x0000000025d70000,0x000002e000000000 f000 0 0
walking heap 2 0x000002e025ed0000
skipping private heap 0x000002e025ed0000
app PEB is 0x000000fdf0ac9000-0x000000fdf0ac9388
set range 0x000000fdf0ac9000-0x000000fdf0ac9388 => 0x0
add new app segment for [0x000000fdf0ac0000, 0x000000fdf0b00000)
set range 0x000000fdf0ac9080-0x000000fdf0ac90c0 => 0x0
set range 0x000000fdf0ac9240-0x000000fdf0ac9640 => 0x0
set range 0x00007ff6578807aa-0x00007ff6578807ac => 0x0
unhandled application memory @0x00007ff6578807aa

Current shadow scheme:

 *   app1: [0x00000000'00000000, 0x00000300'00000000): exec, heap, data
 *   app2: [0x00007C00'00000000, 0x00008000'00000000): libs
 * 1B-to-1B mapping:
 *   SHDW(app) = (app & 0x00000FFF'FFFFFFFF) + 0x00000700'00000000)
 * and the result:
 *   shdw1 = SHDW(app1): [0x00000700'00000000, 0x00000a00'00000000)
 *   shdw2 = SHDW(app2): [0x00001300'00000000, 0x00001700'00000000)
 * and
 *   shdw1'= SHDW(shdw1): [0x00000e00'00000000, 0x00001100'00000000)
 *   shdw2'= SHDW(shdw2): [0x00000a00'00000000, 0x00000e00'00000000)

app PEB is 0x000000fd'f0ac9000-0x000000fd'f0ac9388 => in 1st region: not
that.

It's the 0x00007ff6578807aa which is weird: seems like a normal lib.
Need more info: need all segments.

Github Actions

2020-11-29T02:54:28.8229930Z 3: ~~Dr.M~~ Dr. Memory version 2.3.18595
2020-11-29T02:54:28.8230734Z 3: ~~Dr.M~~ Running "D:/a/drmemory/drmemory/build_drmemory-dbg-64/tests/hello.exe"
2020-11-29T02:54:28.8231567Z 3: ~~Dr.M~~ WARNING: Failed to rename the symcache file.
2020-11-29T02:54:28.8232310Z 3: ~~Dr.M~~ unhandled application memory @0x00007ff61eda07aa
2020-11-29T02:54:28.8233692Z 3: <Core dump file for application D:\a\drmemory\drmemory\build_drmemory-dbg-64\tests\hello.exe (4448) created at D:\a\drmemory\drmemory\build_drmemory-dbg-64\logs\dynamorio\hello.exe.4448.00000000.ldmp>
2020-11-29T02:54:28.8235094Z 3: ~~Dr.M~~ WARNING: application exited with abnormal code 0xffffffff
@derekbruening
Copy link
Contributor Author

It looks like the address space walk just stops when it hits the top of the 32-bit range:

2020-11-30T15:29:18.2090288Z 6: umbra_address_space_init: 0x0000000000000000-0x0000000015000000
2020-11-30T15:29:18.2091156Z 6: umbra_address_space_init: 0x0000000015000000-0x0000000015001000
2020-11-30T15:29:18.2092101Z 6: add new app segment for [0x0000000015000000, 0x0000000015001000)
<...>
2020-11-30T15:29:18.2284017Z 6: umbra_address_space_init: 0x000000007ffe1000-0x000000007ffee000
2020-11-30T15:29:18.2285192Z 6: add new app segment for [0x000000007ffe1000, 0x000000007ffee000)
2020-11-30T15:29:18.2285779Z 6: umbra_address_space_init: 0x000000007ffee000-0x000000007ffef000
2020-11-30T15:29:18.2287085Z 6: add new app segment for [0x000000007ffee000, 0x000000007ffef000)
2020-11-30T15:29:18.2287460Z 6: TLS shadow base: 0x15e0
2020-11-30T15:29:18.2287759Z 6: shadow_table_init
2020-11-30T15:29:18.2288212Z 6: new segment: app [0x0000000000000000, 0x0000030000000000), shadow [0x0000070000000000, 0x000007c000000000), reserve [0x000008c000000000, 0x000008f000000000)

That causes the high library segment to not be added, which causes
shadow_set_byte() in DrM to get back UMBRA_SHADOW_MEMORY_TYPE_NOT_SHADOW
from umbra_get_shadow_memory(), which is this failure.

So it does not seem to be a shadow mapping scheme problem.

@derekbruening
Copy link
Contributor Author

With custom diagnostics, here is the problematic sequence:

my_query: 0x000000007ffe0000
	AllocBase=7ffe0000; Base=7ffe0000; Size=1000; Type=20000
my_query: 0x000000007ffe0000 => forward query 0x000000007ffe0000
	AllocBase=7ffe0000; Base=7ffe0000; Size=1000; Type=20000
umbra_address_space_init: 0x000000007ffe0000-0x000000007ffe1000
my_query: 0x000000007ffe1000
	AllocBase=7ffe0000; Base=7ffe1000; Size=b000; Type=20000
my_query: 0x000000007ffe1000 => forward query 0x000000007ffe0000
	AllocBase=7ffe0000; Base=7ffe0000; Size=1000; Type=20000
my_query: 0x000000007ffe1000 => forward query 0x000000007ffe1000
	AllocBase=7ffe0000; Base=7ffe1000; Size=b000; Type=20000
umbra_address_space_init: 0x000000007ffe1000-0x000000007ffec000
my_query: 0x000000007ffec000
	AllocBase=7ffec000; Base=7ffec000; Size=1000; Type=20000
my_query: 0x000000007ffec000 => forward query 0x000000007ffec000
	AllocBase=7ffec000; Base=7ffec000; Size=1000; Type=20000
umbra_address_space_init: 0x000000007ffec000-0x000000007ffed000
my_query: 0x000000007ffed000
	AllocBase=7ffe0000; Base=7ffed000; Size=3000; Type=20000
my_query: 0x000000007ffed000 => backward query 0x000000007ffec000
my_query: 0x000000007ffed000 => forward query 0x000000007ffe0000
	AllocBase=7ffe0000; Base=7ffe0000; Size=1000; Type=20000
my_query: 0x000000007ffed000 => forward query 0x000000007ffe1000
	AllocBase=7ffe0000; Base=7ffe1000; Size=b000; Type=20000
my_query: 0x000000007ffed000 => forward query 0x000000007ffec000
	AllocBase=7ffec000; Base=7ffec000; Size=1000; Type=20000
my_query: 0x000000007ffed000 => fail blocks=2
umbra_address_space_init: querying AGAIN 0x000000007ffed000
my_query: 0x000000007ffed000
	AllocBase=7ffe0000; Base=7ffed000; Size=3000; Type=20000
my_query: 0x000000007ffed000 => backward query 0x000000007ffec000
my_query: 0x000000007ffed000 => forward query 0x000000007ffe0000
	AllocBase=7ffe0000; Base=7ffe0000; Size=1000; Type=20000
my_query: 0x000000007ffed000 => forward query 0x000000007ffe1000
	AllocBase=7ffe0000; Base=7ffe1000; Size=b000; Type=20000
my_query: 0x000000007ffed000 => forward query 0x000000007ffec000
	AllocBase=7ffec000; Base=7ffec000; Size=1000; Type=20000
my_query: 0x000000007ffed000 => fail blocks=2
ERROR: umbra_address_space_init failed for 0x000000007ffed000
ASSERT FAILURE (thread 172): ..\drmemory\drmemory.c:1971: false (failed to initialize Umbra)

So AllocationBase is 7ffe0000 (0-1,1-c), then 7ffec000 (c-d), but then
7ffed000 has its AllocationBase back at 7ffe0000 again!
The failure is b/c the forward loop bails if AllocationBase changes.

It seems like a kernel bug? Or at least a wart, with that weird AllocationBase sequence.
This is something that has to be fixed in DR's query routine.
Even if we special-case this umbra walk we later hit other problems with DR query failing:

2020-12-06T04:33:29.9631892Z   ~~Dr.M~~ ASSERT FAILURE (thread 3696): ..\drmemory\leak.c:1062:
2020-12-06T04:33:29.9632582Z   IF_WINDOWS_ELSE(info.type == DR_MEMTYPE_ERROR_WINKERNEL, false)
2020-12-06T04:33:29.9633142Z   (dr_query_memory_ex failed)

derekbruening added a commit to DynamoRIO/dynamorio that referenced this issue Dec 6, 2020
Uses the empirically-determined syscall numbers even for OS versions
we have in our static table, for win10-1511+.  We have seen variants
that our current version logic is not aware of, and we're using the
from-wrapper numbers on all very-recent versions anyway.

Issue: #4587, DynamoRIO/drmemory#2328
Fixes #4587
derekbruening added a commit to DynamoRIO/dynamorio that referenced this issue Dec 6, 2020
Removes what was thought of as a sanity check for the allocation base
changing in DR's internal query loop, but it turns out there are cases
of anomalous bases for which failing the query has disastrous
consequences.  Just ignoring the anomaly and moving on is the
solution.

Issue: #4588, DynamoRIO/drmemory#2328
Fixes #4588
derekbruening added a commit to DynamoRIO/dynamorio that referenced this issue Dec 6, 2020
Uses the empirically-determined syscall numbers even for OS versions
we have in our static table, for win10-1511+.  We have seen variants
that our current version logic is not aware of, and we're using the
from-wrapper numbers on all very-recent versions anyway.

Issue: #4587, DynamoRIO/drmemory#2328
Fixes #4587
derekbruening added a commit to DynamoRIO/dynamorio that referenced this issue Dec 6, 2020
Removes what was thought of as a sanity check for the allocation base
changing in DR's internal query loop, but it turns out there are cases
of anomalous bases for which failing the query has disastrous
consequences.  Just ignoring the anomaly and moving on is the
solution.

Tested on the Github Actions Windows Server 16 images.

Issue: #4588, DynamoRIO/drmemory#2328
Fixes #4588
@derekbruening
Copy link
Contributor Author

The DynamoRIO/dynamorio#4589 link was meant for #2329.

derekbruening added a commit that referenced this issue Dec 6, 2020
Updates DR to 312d24d3 to pull in two key fixes for Dr. Memory on
Github Actions Windows Server 2016:

+ DynamoRIO/dynamorio#4588: Handle anomalous alloc bases in Windows
  query loop (DynamoRIO/dynamorio#4590)

+ DynamoRIO/dynamorio#4587: Use from-wrapper syscall numbers for all
  win10 (DynamoRIO/dynamorio#4589)

Fixes #2328
Fixes #2329
derekbruening added a commit that referenced this issue Dec 6, 2020
Updates DR to 312d24d3 to pull in two key fixes for Dr. Memory on
Github Actions Windows Server 2016:

+ DynamoRIO/dynamorio#4588: Handle anomalous alloc bases in Windows
  query loop (DynamoRIO/dynamorio#4590)

+ DynamoRIO/dynamorio#4587: Use from-wrapper syscall numbers for all
  win10 (DynamoRIO/dynamorio#4589)

Fixes #2328
Fixes #2329
derekbruening added a commit that referenced this issue Dec 12, 2020
Sets up 3 jobs on Windows Server 2016 on Github Actions:
+ 32-bit debug build and tests
+ 64-bit debug build and tests
+ 32-bit and 64-bit release build plus drheapstat debug build, no tests

Adds parameters to runsuite.cmake for the control split into these 3 jobs.
Each job is roughly under 15 minutes.
Each job downloads and installs ninja, doxygen, and WiX and uses VS2017.

Adapts runsuite_wrapper.pl to use native Windows perl rather than Cygwin perl.  No fork is available, so we tee the output to a file and read the file back in.
Fixes a missed final line in the suite results processing.

Adds default suppressions for invalid heap arguments (#2339) and leaks (#2340) in the VS2017 CRT.

Generalizes test output for cs2bug and the suppress* tests in an attempt to get them to pass.

Augments umbra_address_space_init() with a better error message to avoid cases like #2328 in the future.

Shrinks the timeout to 60s for the failing umbra_client_faulty_redzone to eliminate several minutes of testing time just waiting for that test to time out; #2341 covers fixing it.

Augments the test ignore list in order to get the jobs green.

Issue: #2328, #2334, #2339, #2340, #2341
derekbruening added a commit to DynamoRIO/dynamorio that referenced this issue Dec 14, 2020
Uses the empirically-determined syscall numbers even for OS versions
we have in our static table, for win10-1511+.  We have seen variants
that our current version logic is not aware of, and we're using the
from-wrapper numbers on all very-recent versions anyway.

Issue: #4587, DynamoRIO/drmemory#2328
Fixes #4587
derekbruening added a commit to DynamoRIO/dynamorio that referenced this issue Dec 14, 2020
Removes what was thought of as a sanity check for the allocation base
changing in DR's internal query loop, but it turns out there are cases
of anomalous bases for which failing the query has disastrous
consequences.  Just ignoring the anomaly and moving on is the
solution.

Tested on the Github Actions Windows Server 16 images.

Issue: #4588, DynamoRIO/drmemory#2328
Fixes #4588
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant