Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tizen armel CI job frequently hangs #9569

Closed
tannergooding opened this issue Jan 18, 2018 · 12 comments
Closed

Tizen armel CI job frequently hangs #9569

tannergooding opened this issue Jan 18, 2018 · 12 comments
Labels
area-Infrastructure-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' bug
Milestone

Comments

@tannergooding
Copy link
Member

If you look at the failures for the armel_cross_checked_tizen_prtest, you will see that they frequently fail due to timeout issues.

On one side, it looks like a given job will hang after some tests have hit an 'Unsupported syscall'.

On other jobs, they will succesfully execute all tests and then stall with Perform an action if the job was performed on an Azure VM Agent. is waiting for a checkpoint on dotnet_coreclr » master » armel_cross_checked_tizen_prtest #### (the job will be waiting on another prtest job, from the same queue, and for an unrelated PR to complete before allowing itself to finish).

The timeout on all of these jobs is currently 4 hours which can quickly cause the queue to get backed up or cause PRs to get delayed for long periods of time.

@tannergooding
Copy link
Member Author

I would like to see one or more of the following happen, as I believe they may help:

  • Remove whatever is causing the PR jobs to wait for a previous, unrelated PR job to complete
  • Reduce the timeout on these jobs to something like 2 hours, rather than 4 (timeouts start from the time the job actually starts executing, so this shouldn't be an issue)
  • Have the jobs pre-emptively abort after the first test failure
  • Have some monitor that will abort the run if a given test executes for more than some predetermined time

@tannergooding
Copy link
Member Author

Similar issues seem to impact the arm_cross_debug_ubuntu_prtest job.

@jkotas
Copy link
Member

jkotas commented Jan 19, 2018

@RussKeldorph Have we looked into what can be done about this?

@RussKeldorph
Copy link
Contributor

@hseok-oh @hqueue Is there someone that can look into the armel failures? We're going to have to remove these from PR testing if they can't be made reliable more or less immediately.

@danmoseley
Copy link
Member

cc @Anipik

@hseok-oh
Copy link
Contributor

@RussKeldorph You can find failure began after dotnet/coreclr#15878 is merged.

Here is ci's first failure job for push master.
https://ci.dot.net/job/dotnet_coreclr/job/master/job/armel_cross_debug_tizen/2510/

Here is ci's last success job for push master.
https://ci.dot.net/job/dotnet_coreclr/job/master/job/armel_cross_debug_tizen/2509/

@RussKeldorph
Copy link
Contributor

@hseok-oh Thanks. @mikem8361 Can you look into this?

@benaadams
Copy link
Member

Might be coincidence but qemu: Unsupported syscall: happens on Tizen then CI jobs start to time out

13:46:04 Skip preparing for GC stress test. Dependent package is not supported on this architecture.
13:46:05 The tests have been prepared
13:46:06 The tests have been prepared
13:46:06 FAILED   - JIT/Directed/coverage/importer/ldelemnullarr2/ldelemnullarr2.sh
13:46:06   BEGIN EXECUTION
13:46:06   /home/coreclr/bin/tests/Windows_NT.x64.Checked/Tests/coreoverlay/corerun ldelemnullarr2.exe
13:46:06   qemu: Unsupported syscall: 389

@mikem8361
Copy link
Member

I know nothing about how to debug this on arm. Can we putty to the machine when the tests hang?

Since it is a checked build, can the coreclr logging be enabled (COMPlus_LogEnable env var, etc.)?

My changes really shouldn't have affected Linux/arm, but you never know.

@jkotas
Copy link
Member

jkotas commented Jan 25, 2018

This should be fixed now

@jkotas jkotas closed this as completed Jan 25, 2018
@benaadams
Copy link
Member

Looking at recent PRs its still jamming and/or failing with qemu: Unsupported syscall: 389 - 819 Segmentation fault

@jkotas
Copy link
Member

jkotas commented Jan 26, 2018

The unsupported syscall is tracked by #8614

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 2.1.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Infrastructure-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' bug
Projects
None yet
Development

No branches or pull requests

8 participants