Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

[Local GC] Fix (another) ScanContext layout issue when building without FEATURE_REDHAWK #14777

Merged
merged 1 commit into from
Nov 1, 2017

Conversation

swgillespie
Copy link

Fundamentally this is the same issue as #14747 - FEATURE_REDHAWK is currently defined for a standalone GC and thus _unused2 is not included in the ScanContext. This presents much differently than the previous issue, though - an out-of-bounds write to a ScanContext on the GC's stack this time clobbers the stack slot that MSVC has previously used to save rdi, which results in gc_heap::relocate_phase trashing rdi when it shouldn't. The function that calls relocate_phase has allocated the constant 0 for rdi, so all uses of the constant 0 (namely, boolean false) become not-false in the function that calls relocate_phase and false literals magically become true.

The end result is that the GC crashes because it attempted to write a (literal) 0 to a global variable, but 0 (rdi) was actually the random value that gc_heap::relocate_phase trashed it with. This non-zero global variable (the mark stack TOS) eventually causes us to read garbage memory.

This PR fixes the issue by actually fixing the underlying layout problem:

0:000> ?? sizeof(clrgc!ScanContext)
unsigned int64 0x38
0:000> ?? sizeof(coreclr!ScanContext)
unsigned int64 0x38
0:000> dt clrgc!ScanContext
   +0x000 thread_under_crawl : Ptr64 Thread
   +0x008 thread_number    : Int4B
   +0x010 stack_limit      : Uint8B
   +0x018 promotion        : Bool
   +0x019 concurrent       : Bool
   +0x020 _unused1         : Ptr64 Void
   +0x028 _unused2         : Ptr64 Void
   +0x030 _unused3         : Int4B
0:000> dt coreclr!ScanContext
   +0x000 thread_under_crawl : Ptr64 Thread
   +0x008 thread_number    : Int4B
   +0x010 stack_limit      : Uint8B
   +0x018 promotion        : Bool
   +0x019 concurrent       : Bool
   +0x020 pCurrentDomain   : Ptr64 AppDomain
   +0x028 pMD              : Ptr64 MethodDesc
   +0x030 dwEtwRootKind    : EtwGCRootKind

After this fix, checked and release standalone GCs pass all functional tests. (This doesn't repro on Debug builds).

I think this one took some time off of my life... @sergiy-k PTAL?

Copy link

@sergiy-k sergiy-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
How can we detect/prevent this type of issues in the future? Would asserts help make sure that the size of the data structures on the EE side matches the size of the same structures on the GC side?

@swgillespie
Copy link
Author

@sergiy-k The ultimate solution to this problem would be to completely ban the use of #ifdefs in the GC interface headers that define the interface. Until then, we could potentially add static asserts on both sides of the ee/gc interface that assert that e.g. ScanContext has the expected size and field offsets. If we have no ifdefs, though, all this does is ensure that the compiler that is compiling us is sane (i.e. conforms to standard c++), which isn't particularly helpful I don't think.

I think I'd be more inclined to just kill off all of the non-#define preprocessor usage in the interface headers directly - it's already problematic and it's not something that we will want to ship with at all precisely because of issues like this one.

@swgillespie swgillespie merged commit 0b86feb into dotnet:master Nov 1, 2017
@swgillespie
Copy link
Author

thanks for the review!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants