Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mono]: Fix infrequent infinite loop on Mono EventPipe streaming thread. #72517

Merged
merged 3 commits into from
Jul 21, 2022

Conversation

lateralusX
Copy link
Member

@lateralusX lateralusX commented Jul 20, 2022

As observed by #59296, EventPipe streaming thread could infrequently cause an infinite loop on Mono when cleaning up stack hash map, ep_rt_stack_hash_remove_all called from ep_file_write_sequence_point, flushing buffer memory into file stream.

Issue only occurred on Release builds and so far, only observed on OSX, and reproduced in 1 of around 100 runs of the test suite.

After debugging the assembler when hitting the hang, it turns out that one item in the hash map has a hash key, that doesn't correspond to its hash bucket, this scenario should not be possible since items get placed into buckets based on hash key value that doesn't change for the lifetime of the item. This indicates that there is some sort of corruption happening to the key, after it has been added to the hash map.

After some more instrumentation it turns out that insert into the hash map infrequently triggers a replace, but Mono hash table used in EventPipe is setup to insert without replace, meaning it will keep old key but switch and free old value. Stack hash map uses same memory for its key and value, so freeing the old value will also free the key, but since old key is kept, it will point into freed memory and future reuse of that memory region will cause corruption of the hash table key.

This scenario should not be possible since EventPipe code will only add to the hash map, if the item is not already in the hash map. After some further investigation it turns out that the call to ep_rt_stack_hash_lookup reports false, while call to ep_rt_stack_hash_add for the same key will hit replace scenario in g_hash_table_insert_replace. g_hash_table_insert_replace finds item in the hash map, using callbacks for hash and equal of hash keys. It turns out that the equal callback is defined to return gboolean, while the callback implementation used in EventPipe is defined to return bool. gboolean is typed as int32_t on Mono and this is the root cause of the complete issue. On optimized OSX build (potential on other platforms) the callback will do a memcmp (updating full eax register) and when returning from callback, callback will only update first byte of eax register to 0/1, keeping upper bits, so if memcmp returns negative value or a positive value bigger than first byte, eax will contains garbage in byte 2, 3 and 4, but since Mono's g_hash_table_insert_replace expects gboolean, it will look at complete eax content meaning if any of the bits in byte 2, 3 or 4 are still set, condition will still be true, even if byte 1 is 0, representing false, incorrectly trigger the replace logic, freeing the old value and key opening up for future corruption of the key, now reference freed memory.

Fix is to make sure the callback signatures used with hash map callbacks, match expected signatures of underlying container implementation. Fix also adds a checked build assert into hash map’s add implementation on Mono validating that the added key is not already contained in the hash map enforcing callers to check for existence before calling add on hash map.

NOTE, CoreCLR is not affected by this since the issue is in Mono specific EventPipe layer and custom hash map callbacks are not even in use by CoreCLR, instead it uses underlying C++ hash map, with EventPipeCoreCLRStackHashTraits implementing needed functionality.

Fixes #59296
Fixes #54801

As observed by dotnet#59296, EventPipe
streaming thread could infrequently cause an infinite loop on Mono
when cleaning up stack hash map, ep_rt_stack_hash_remove_all called
from ep_file_write_sequence_point, flushing buffer memory into file stream.

Issue only occurred on Release builds and so far, only observed on OSX,
and reproduced in 1 of around 100 runs of the test suite.

After debugging the assembler when hitting the hang, it turns out that
one item in the hash map has a hash key, that doesn't correspond
to its hash bucket, this scenario should not be possible
since items get placed into buckets based on hash key value that
doesn't change for the lifetime of the item. This indicates that
there is some sort of corruption happening to the key, after it
has been added to the hash map.

After some more instrumentation it turns out that insert into the
hash map infrequently triggers a replace, but Mono hash table used in
EventPipe is setup to insert without replace, meaning it will keep old
key but switch and free old value. Stack has map uses same memory
for its key and value, so freeing the old value will also free the key,
but since old key is kept, it will point into freed memory and future
reuse of that memory region will cause corruption of the hash table key.

This scenario should not be possible since EventPipe code will only add
to the hash map, if the item is not already in the hash map. After some
further investigation it turns out that the call to ep_rt_stack_hash_lookup
reports false, while call to ep_rt_stack_hash_add for the same key
will hit replace scenario in g_hash_table_insert_replace.
g_hash_table_insert_replace finds item in the hash map, using callbacks for
hash and equal of hash keys. It turns out that the equal callback is defined to return
gboolean, while the callback implementation used in EventPipe is defined to return
bool. gboolean is typed as int32_t on Mono and this is the root cause of the complete issue.
On optimized OSX build (potential on other platforms) the callback will do a memcmp
(updating full eax register) and when returning from callback, callback will only update
first byte of eax register to 0/1, keeping upper bits, so if memcmp returns negative value
or a positive value bigger than first byte, eax will contains garbage in byte 2, 3 and 4,
but since Mono's g_hash_table_insert_replace expects gboolean, it will
look at complete eax content meaning if any of the bits in byte 2, 3 or 4 are still set,
condition will still be true, even if byte 1 is 0, representing false, incorrectly trigger the
replace logic, freeing the old value and key opening up for future corruption of the key,
now reference freed memory.

Fix is to make sure the callback signatures used with hash map callbacks,
match expected signatures of underlying container implementation.

Fix also adds a checked build assert into hash map’s add implementation
on Mono validating that the added key is not already contained in the hash map
enforcing callers to check for existence before calling add on hash map.
@lambdageek
Copy link
Member

Great investigation!

@lateralusX lateralusX merged commit 8c2ef7f into dotnet:main Jul 21, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Aug 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
2 participants