-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Android emulators fail to boot before accepting Helix jobs #754
Comments
I've noticed that this issue reproduced during first runs but after some time all machines in my PR's have emulators up and running. It seems that emulators there require more time to starts and want to learn how long it takes. |
@greenEkatherine could you please elaborate on this? |
@greenEkatherine I have some questions as I don't understand the space much:
|
There are many parameters to set up https://developer.android.com/studio/run/emulator-commandline |
|
Looking at the code - we seem to be doing some "waiting for emulators to start" in the validation phase. However, we're not doing the same when actually launching the VMs in production (first-run or when starting the systemd service) since validation scripts don't run then. Do you know if there's some difference or reason for that? |
FWIW I do know that we had issues with the "wait for emulator" code in the past, I wonder if just waiting some hardcoded time on these problematic older API levels would be enough for now? |
From the emulator artifact's code it seems that we're not waiting for emulator at all when booting the VM. We only wait for it in validation phase that happens when we're building the image itself (and that's where we've seen the flaky behaviour). So it might be we just start accepting helix jobs too early.
I guess a plan B? :) |
I think we thought that it'd be handled by xharness' wait-for-device code, but the 5min timeout for that is probably too short:
|
I run 100 jobs and set up timeout up to 30 minutes. Unfortunately there are only 4 machines took jobs out of 23 machines - too few to estimate timeout, but even there I see that it takes 10-15 minutes to run the first jobs on each machine. I would like to look at the results from bigger number of machines. @premun does |
Afaik You could technically scale up the queue manually in Azure to get more machines and more data - if you don't let them run needlessly long. |
Second attempt show that 19 out of 30 machines take 10+ minutes to run emulators for their first job. I raised PR and will try to run tests against this queue https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/19540 |
@greenEkatherine @fanyang-mono reported a case ( All of the work items processed on the machine in the span of almost 15 minutes have failed on the |
Oh, the emulator was actually frozen.. |
@premun looks like there's still an open PR for this work. Did we want to do anything with it? https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/19540 |
@missymessa yes, this is very much on my radar but I didn't have time to have a look at this yet. It is very important but also quite complex issue and Katya has done quite a lot of investigations in this regard. Some of them are in the PR, so I am keeping it until I have time to have a look. I expect to get to this in about 2 weeks. |
Thanks for the detailed analysis, @premun. Having proved the emulators are actually fixed by rebooting is a good news, also the sudden crash by any operation is a strong clue. |
Furthermore, I added some options in case the emulator start-up would get "stalled" (Android term): I will dig a bit more into the exact moment the emulator stops running before I close this. |
Last piece of investigation was the overall SLA for the emulators. I have only found 4 work items in the last week that failed with on third attempts and none of them for infra reasons. From the above reasons I conclude that the Android emulators are booting and rebooting fine and I am closing this issue. |
This is a new issue to track two old ones. It mostly affects old emulators with API 21-24 (https://github.com/dotnet/core-eng/issues/13359) but also seen on new emulators - less frequently (#689)
It seems that emulator is hanging, I'm trying to pass different parameters on start to be sure that it has enough space, memory and use hardware properly. If it doesn't help, work around will be to kill VM in bad shape and rerun on the new one.
The text was updated successfully, but these errors were encountered: