Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure tracing/eventpipe/providervalidation/providervalidation/providervalidation.sh #59296

Closed
VincentBu opened this issue Sep 18, 2021 · 11 comments · Fixed by #72517
Closed
Assignees
Labels
arch-x64 area-Tracing-mono disabled-test The test is disabled in source code against the issue os-mac-os-x macOS aka OSX
Milestone

Comments

@VincentBu
Copy link
Contributor

VincentBu commented Sep 18, 2021

Run: runtime 20210917.69

Failed test:

Mono OSX x64 Release no_tiered_compilation @ OSX.1014.Amd64.Open

- tracing/eventpipe/providervalidation/providervalidation/providervalidation.sh

Error message:

cmdLine:/private/tmp/helix/working/A9650978/w/9D2F08BB/e/tracing/eventpipe/providervalidation/providervalidation/providervalidation.sh Timed Out (timeout in milliseconds: 600000 from variable __TestTimeout, start: 9/17/2021 4:28:20 PM, end: 9/17/2021 4:38:20 PM)

Return code:      -100
Raw output file:      /tmp/helix/working/A9650978/w/9D2F08BB/uploads/Reports/tracing.eventpipe/providervalidation/providervalidation/providervalidation.output.txt
Raw output:
BEGIN EXECUTION
/tmp/helix/working/A9650978/p/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false providervalidation.dll ''
0.0s: ==TEST STARTING==
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87918-1631914792-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-9161-1631920607-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-97399-1631917987-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-9405-1631920756-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-93688-1631916093-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-77966-1631909899-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-82885-1631912712-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87780-1631914677-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87286-1631914493-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-9771-1631920964-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87249-1631914489-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-93691-1631916093-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-78800-1631910942-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87347-1631914498-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87258-1631914490-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87829-1631914730-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-97728-1631918247-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-86874-1631914214-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87266-1631914491-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-98012-1631918835-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-97328-1631917929-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-80870-1631911637-socke


Stack trace
   at tracing_eventpipe._providervalidation_providervalidation_providervalidation_._providervalidation_providervalidation_providervalidation_sh()

Runfo Tracking Issue: tracing/eventpipe/providervalidation/providervalidation/providervalidation.sh

Build Definition Kind Run Name Console Core Dump Test Results Run Client
1892535 runtime PR 72517 Mono Browser wasm Release @ Ubuntu.1804.Amd64.Open console.log runclient.py
1858712 runtime PR 71567 coreclr Linux arm Checked no_tiered_compilation @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump runclient.py
1857672 runtime PR 71567 coreclr Linux arm Checked no_tiered_compilation @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump runclient.py

Build Result Summary

Day Hit Count Week Hit Count Month Hit Count
1 1 3
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Sep 18, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Sep 18, 2021

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

Issue Details

Run: runtime 20210917.69

Failed test:

Mono OSX x64 Release no_tiered_compilation @ OSX.1014.Amd64.Open

- tracing/eventpipe/providervalidation/providervalidation/providervalidation.sh

Error message:

cmdLine:/private/tmp/helix/working/A9650978/w/9D2F08BB/e/tracing/eventpipe/providervalidation/providervalidation/providervalidation.sh Timed Out (timeout in milliseconds: 600000 from variable __TestTimeout, start: 9/17/2021 4:28:20 PM, end: 9/17/2021 4:38:20 PM)

Return code:      -100
Raw output file:      /tmp/helix/working/A9650978/w/9D2F08BB/uploads/Reports/tracing.eventpipe/providervalidation/providervalidation/providervalidation.output.txt
Raw output:
BEGIN EXECUTION
/tmp/helix/working/A9650978/p/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false providervalidation.dll ''
0.0s: ==TEST STARTING==
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87918-1631914792-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-9161-1631920607-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-97399-1631917987-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-9405-1631920756-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-93688-1631916093-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-77966-1631909899-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-82885-1631912712-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87780-1631914677-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87286-1631914493-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-9771-1631920964-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87249-1631914489-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-93691-1631916093-socket
0.2s: Deleted
0.2s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-78800-1631910942-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87347-1631914498-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87258-1631914490-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87829-1631914730-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-97728-1631918247-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-86874-1631914214-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-87266-1631914491-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-98012-1631918835-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-97328-1631917929-socket
0.3s: Deleted
0.3s: Attempting to delete the zombied pipe: /var/folders/1l/16vrmj3n79bf3v35y4tn2_740000gy/T/dotnet-diagnostic-80870-1631911637-socke


Stack trace
   at tracing_eventpipe._providervalidation_providervalidation_providervalidation_._providervalidation_providervalidation_providervalidation_sh()
Author: VincentBu
Assignees: -
Labels:

os-mac-os-x, arch-x64, area-Diagnostics-coreclr, untriaged

Milestone: -

@tommcdon tommcdon added this to the 7.0.0 milestone Sep 18, 2021
@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Sep 18, 2021
@tommcdon
Copy link
Member

@josalem

@josalem
Copy link
Contributor

josalem commented Sep 20, 2021

@mikem8361 there is an interesting error in the logs from when it tried to grab a dump:

Attempting to collect crash dump: /cores/crashdump_10731.dmp
Invoking: sudo /tmp/helix/working/A9650978/p/createdump --name "/cores/crashdump_10731.dmp" 10731 --withheap
createdump stdout:
Gathering state for process 10731 

createdump stderr:
CLRDataCreateInstance(ICLRDataEnumMemoryRegions) FAILED 80131c4f

Any idea what happened?

There is not dump or artifact besides the logs indicating a timeout. I'll see if I can repro this locally and do a quick search in the Helix data to see if this a one-off failure.

@josalem josalem self-assigned this Sep 20, 2021
@mikem8361
Copy link
Member

I have no idea what happened here. This error is returned when the DAC initialization fails for various reasons usually because it can't find/get/initialize the global dac table in the runtime.

@ViktorHofer ViktorHofer added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Sep 30, 2021
@ViktorHofer
Copy link
Member

Converted to a runfo live issue, happened 6 times this week.

@lateralusX
Copy link
Member

lateralusX commented Sep 30, 2021

The 4 failures with core dumps is due to an assert where runtime tries to load an aot version of the assembly running in full aot mode, but since this is not a full aot test lane, there are no aot versions of assemblies:

Failed to load AOT module '/datadisks/disk1/work/A70408DA/w/BEDE09DE/e/baseservices/varargs/varargsupport_r/varargsupport_r.dll.so' ('/datadisks/disk1/work/A70408DA/w/BEDE09DE/e/baseservices/varargs/varargsupport_r/varargsupport_r.dll') in aot-only mode.

so this will fail several different kinds of tests, not just the test tracked by this issue, above is a failure in varargsupport_r.dll for example. For some reason the runtime gets executed in full aot mode, even if these are lanes not using that mode, so nothing have aot:ed the assemblies, causing the assert.

1388663 seems to be unrelated and fails a lot of different tests due failing to install mobile app, could be infra related.

1389780, 1390826 and 1391966 seems to be related to the same issue in rundownvalidation.sh.

I think we could disable the test and create issue, we would need to see if we could get local repro on this, since we don't get any info from Mono on hanged processes on CI, so currently no way to see where it hangs.

@krwq
Copy link
Member

krwq commented Oct 5, 2021

@josalem consider disabling test while investigation is pending

@lateralusX
Copy link
Member

I can do that during the day.

@lateralusX lateralusX added the disabled-test The test is disabled in source code against the issue label Oct 6, 2021
@josalem josalem removed their assignment Mar 29, 2022
@jakobbotsch jakobbotsch removed the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label May 3, 2022
@lateralusX lateralusX self-assigned this Jul 19, 2022
@lateralusX
Copy link
Member

lateralusX commented Jul 19, 2022

Investigating, having repro on OSX, only reproduce very infrequently (on release builds), been able to track it down to be related to the file streaming stack hash map and duplications of entries not correctly detected for this specific hash map (using custom hash key and compare functions). Hash map implementation is runtime specific, so likely a Mono specific issue.

lateralusX added a commit to lateralusX/runtime that referenced this issue Jul 20, 2022
As observed by dotnet#59296, EventPipe
streaming thread could infrequently cause an infinite loop on Mono
when cleaning up stack hash map, ep_rt_stack_hash_remove_all called
from ep_file_write_sequence_point, flushing buffer memory into file stream.

Issue only occurred on Release builds and so far, only observed on OSX,
and reproduced in 1 of around 100 runs of the test suite.

After debugging the assembler when hitting the hang, it turns out that
one item in the hash map has a hash key, that doesn't correspond
to its hash bucket, this scenario should not be possible
since items get placed into buckets based on hash key value that
doesn't change for the lifetime of the item. This indicates that
there is some sort of corruption happening to the key, after it
has been added to the hash map.

After some more instrumentation it turns out that insert into the
hash map infrequently triggers a replace, but Mono hash table used in
EventPipe is setup to insert without replace, meaning it will keep old
key but switch and free old value. Stack has map uses same memory
for its key and value, so freeing the old value will also free the key,
but since old key is kept, it will point into freed memory and future
reuse of that memory region will cause corruption of the hash table key.

This scenario should not be possible since EventPipe code will only add
to the hash map, if the item is not already in the hash map. After some
further investigation it turns out that the call to ep_rt_stack_hash_lookup
reports false, while call to ep_rt_stack_hash_add for the same key
will hit replace scenario in g_hash_table_insert_replace.
g_hash_table_insert_replace finds item in the hash map, using callbacks for
hash and equal of hash keys. It turns out that the equal callback is defined to return
gboolean, while the callback implementation used in EventPipe is defined to return
bool. gboolean is typed as int32_t on Mono and this is the root cause of the complete issue.
On optimized OSX build (potential on other platforms) the callback will do a memcmp
(updating full eax register) and when returning from callback, callback will only update
first byte of eax register to 0/1, keeping upper bits, so if memcmp returns negative value
or a positive value bigger than first byte, eax will contains garbage in byte 2, 3 and 4,
but since Mono's g_hash_table_insert_replace expects gboolean, it will
look at complete eax content meaning if any of the bits in byte 2, 3 or 4 are still set,
condition will still be true, even if byte 1 is 0, representing false, incorrectly trigger the
replace logic, freeing the old value and key opening up for future corruption of the key,
now reference freed memory.

Fix is to make sure the callback signatures used with hash map callbacks,
match expected signatures of underlying container implementation.

Fix also adds a checked build assert into hash map’s add implementation
on Mono validating that the added key is not already contained in the hash map
enforcing callers to check for existence before calling add on hash map.
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 20, 2022
lateralusX added a commit that referenced this issue Jul 21, 2022
…ad. (#72517)

* Fix infrequent infinite loop on Mono EventPipe streaming thread.

As observed by #59296, EventPipe
streaming thread could infrequently cause an infinite loop on Mono
when cleaning up stack hash map, ep_rt_stack_hash_remove_all called
from ep_file_write_sequence_point, flushing buffer memory into file stream.

Issue only occurred on Release builds and so far, only observed on OSX,
and reproduced in 1 of around 100 runs of the test suite.

After debugging the assembler when hitting the hang, it turns out that
one item in the hash map has a hash key, that doesn't correspond
to its hash bucket, this scenario should not be possible
since items get placed into buckets based on hash key value that
doesn't change for the lifetime of the item. This indicates that
there is some sort of corruption happening to the key, after it
has been added to the hash map.

After some more instrumentation it turns out that insert into the
hash map infrequently triggers a replace, but Mono hash table used in
EventPipe is setup to insert without replace, meaning it will keep old
key but switch and free old value. Stack has map uses same memory
for its key and value, so freeing the old value will also free the key,
but since old key is kept, it will point into freed memory and future
reuse of that memory region will cause corruption of the hash table key.

This scenario should not be possible since EventPipe code will only add
to the hash map, if the item is not already in the hash map. After some
further investigation it turns out that the call to ep_rt_stack_hash_lookup
reports false, while call to ep_rt_stack_hash_add for the same key
will hit replace scenario in g_hash_table_insert_replace.
g_hash_table_insert_replace finds item in the hash map, using callbacks for
hash and equal of hash keys. It turns out that the equal callback is defined to return
gboolean, while the callback implementation used in EventPipe is defined to return
bool. gboolean is typed as int32_t on Mono and this is the root cause of the complete issue.
On optimized OSX build (potential on other platforms) the callback will do a memcmp
(updating full eax register) and when returning from callback, callback will only update
first byte of eax register to 0/1, keeping upper bits, so if memcmp returns negative value
or a positive value bigger than first byte, eax will contains garbage in byte 2, 3 and 4,
but since Mono's g_hash_table_insert_replace expects gboolean, it will
look at complete eax content meaning if any of the bits in byte 2, 3 or 4 are still set,
condition will still be true, even if byte 1 is 0, representing false, incorrectly trigger the
replace logic, freeing the old value and key opening up for future corruption of the key,
now reference freed memory.

Fix is to make sure the callback signatures used with hash map callbacks,
match expected signatures of underlying container implementation.

Fix also adds a checked build assert into hash map’s add implementation
on Mono validating that the added key is not already contained in the hash map
enforcing callers to check for existence before calling add on hash map.
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 21, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Aug 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-Tracing-mono disabled-test The test is disabled in source code against the issue os-mac-os-x macOS aka OSX
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants