Android emulators fail to boot before accepting Helix jobs #754

greenEkatherine · 2021-11-01T08:52:15Z

This is a new issue to track two old ones. It mostly affects old emulators with API 21-24 (https://github.com/dotnet/core-eng/issues/13359) but also seen on new emulators - less frequently (#689)

It seems that emulator is hanging, I'm trying to pass different parameters on start to be sure that it has enough space, memory and use hardware properly. If it doesn't help, work around will be to kill VM in bad shape and rerun on the new one.

greenEkatherine · 2021-11-01T09:35:36Z

I've noticed that this issue reproduced during first runs but after some time all machines in my PR's have emulators up and running. It seems that emulators there require more time to starts and want to learn how long it takes.

tkapin · 2021-11-01T10:20:02Z

It seems that emulator is hanging, I'm trying to pass different parameters on start to be sure that it has enough space, memory and use hardware properly.

@greenEkatherine could you please elaborate on this?

premun · 2021-11-01T13:07:17Z

@greenEkatherine I have some questions as I don't understand the space much:

Do we boot the simulator using some commands after the machine starts or do we create a VM with already running simulator and once the VM starts the emulator is ready?
What exactly do we do to validate the emulator is running in OSOB? How do we validate in OSOB pipeline that when we make changes to Android queues, the emulators are running?
Do you think it makes sense we should add some ValidateHelix step that would run something on the Android queues?

greenEkatherine · 2021-11-01T14:38:46Z

It seems that emulator is hanging, I'm trying to pass different parameters on start to be sure that it has enough space, memory and use hardware properly.

@greenEkatherine could you please elaborate on this?

There are many parameters to set up https://developer.android.com/studio/run/emulator-commandline
For example -use-system-libs is able to speed up emulator - based on other developers' experience.

greenEkatherine · 2021-11-01T14:42:45Z

@greenEkatherine I have some questions as I don't understand the space much:

Do we boot the simulator using some commands after the machine starts or do we create a VM with already running simulator and once the VM starts the emulator is ready?

What exactly do we do to validate the emulator is running in OSOB? How do we validate in OSOB pipeline that when we make changes to Android queues, the emulators are running?

Do you think it makes sense we should add some ValidateHelix step that would run something on the Android queues?

We run emulator after the VM starts.
In validation script we check that there are two emulators for x86 and x64 (except queue with API 31 where only x64 is availbale) and that their status is not "offline"
Do you mean to do it on every VM? because the issue seems to be flaky so our validation usually passed

premun · 2021-11-02T08:48:48Z

Looking at the code - we seem to be doing some "waiting for emulators to start" in the validation phase. However, we're not doing the same when actually launching the VMs in production (first-run or when starting the systemd service) since validation scripts don't run then. Do you know if there's some difference or reason for that?

akoeplinger · 2021-11-02T14:37:02Z

FWIW -use-system-libs is kinda scary, we shouldn't be using it if the issue is just that the emulator starts more slowly. We don't care if it takes 10mins for a VM+emulator to boot as long as it works :)

I do know that we had issues with the "wait for emulator" code in the past, I wonder if just waiting some hardcoded time on these problematic older API levels would be enough for now?

premun · 2021-11-02T14:48:33Z

I do know that we had issues with the "wait for emulator" code in the past

From the emulator artifact's code it seems that we're not waiting for emulator at all when booting the VM. We only wait for it in validation phase that happens when we're building the image itself (and that's where we've seen the flaky behaviour).

So it might be we just start accepting helix jobs too early.

I do know that we had issues with the "wait for emulator" code in the past, I wonder if just waiting some hardcoded time on these problematic older API levels would be enough for now?

I guess a plan B? :)

akoeplinger · 2021-11-02T14:57:33Z

I think we thought that it'd be handled by xharness' wait-for-device code, but the 5min timeout for that is probably too short:

xharness/src/Microsoft.DotNet.XHarness.Android/AdbRunner.cs

Line 158 in 63e5171

var result = RunAdbCommand("wait-for-device", TimeSpan.FromMinutes(5));

greenEkatherine · 2021-11-03T17:49:48Z

I run 100 jobs and set up timeout up to 30 minutes. Unfortunately there are only 4 machines took jobs out of 23 machines - too few to estimate timeout, but even there I see that it takes 10-15 minutes to run the first jobs on each machine.

I would like to look at the results from bigger number of machines. @premun does WorkItemWaitTime monitoring show it or it's about Helix Queueing?

premun · 2021-11-03T17:56:05Z

Afaik WorkItemWaitTime is Started - Enqueued - so how long it took to pull the work item from the queue after it was enqueued so it is probably not useful here.

You could technically scale up the queue manually in Azure to get more machines and more data - if you don't let them run needlessly long.
But not sure how exact we want to be.

greenEkatherine · 2021-11-04T19:31:26Z

Second attempt show that 19 out of 30 machines take 10+ minutes to run emulators for their first job.

I raised PR and will try to run tests against this queue https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/19540

premun · 2021-11-09T14:27:00Z

@greenEkatherine @fanyang-mono reported a case (0c79eace-2029-4e08-810e-0a2c30cdd3d3) where we see no emulator on the default API (ubuntu.1804.amd64.android.open.svc).

All of the work items processed on the machine in the span of almost 15 minutes have failed on the ADB_ENUMERATION_FAILURE. However, the last work item made several attempts until almost 30 minutes from the first one.

greenEkatherine · 2021-11-12T09:35:51Z

Current state in PR:
Based on the feedback, I moved waiting for device command with 15 minutes timeout into systemd, so it will be called not only during the first run but after reboot as well. I put it into ExecStartPost. If I understand correctly documentation, it will run right after starting emulator but we wait for result before calling helix.service.

I would not comment here if it works - it does not. I send 200 test jobs to check emulators' state and more than a half of them failed. The question is why validation always passed and why service is not indicating a failure?

premun · 2021-12-16T13:50:14Z

I was randomly trying to run XHarness locally and saw this:

It might be possible that the emulators actually start fine and XHarness doesn't detect this well?

premun · 2021-12-16T13:54:36Z

Oh, the emulator was actually frozen..

missymessa · 2022-01-28T20:19:32Z

@premun looks like there's still an open PR for this work. Did we want to do anything with it? https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/19540

premun · 2022-01-31T08:51:14Z

@missymessa yes, this is very much on my radar but I didn't have time to have a look at this yet. It is very important but also quite complex issue and Katya has done quite a lot of investigations in this regard. Some of them are in the PR, so I am keeping it until I have time to have a look. I expect to get to this in about 2 weeks.

premun · 2022-04-13T11:38:04Z

I found more proof that the device can reboot on its own in between android run commands:

Analyzing android/run@x86_64 (0)
2022-04-13T02:45:08.815Z	INFO   	xharness-event-processor.py	event(51)	send	Sending event type _MobileDeviceOperation
2022-04-13T02:45:08.900Z	INFO   	xharness-event-processor.py	saferequests(87)	request_with_retry	Response complete with status code '201'
Analyzing android/run@None (81)
    Encountered DEVICE_NOT_FOUND. This is typically not a failure of the work item. It will be run again. This machine will reboot to heal.
    If this occurs repeatedly, please check for architectural mismatch, e.g. sending arm64_v8a APKs to an x86_64 / x86 only queue.
2022-04-13T02:45:08.901Z	INFO   	xharness-event-processor.py	event(51)	send	Sending event type _MobileDeviceOperation
2022-04-13T02:45:08.936Z	INFO   	xharness-event-processor.py	saferequests(87)	request_with_retry	Response complete with status code '201'

One second it's there, XHarness starts it fine, the next second it's gone and it's not XHarness causing the reboot.

This query shows all first operations that started failing:

Metrics
| where EventType == "_MobileDeviceOperation" and MetricName == "ExitCode" and Timestamp > now()-14d
| extend Dimensions = parse_json(Dimensions)
| extend
    ExitCode = MetricValue,
    Command = tostring(Dimensions.command),
    Platform = tostring(Dimensions.platform),
    Target = tostring(Dimensions.target)
| where Platform == "android"
| join kind=inner WorkItems on WorkItemId
| where MachineName != "" and QueueName != "windows.10.amd64.android.open"
| order by JobId asc, WorkItemId asc, Timestamp asc
| extend PrevExitCode = prev(ExitCode, 1), PrevCommand = prev(Command, 1)
| where PrevExitCode == 0 and ExitCode == 81
| project
    JobId,
    QueueName,
    WorkItemFriendlyName,
    MachineName,
    Target,
    Command,
    PrevCommand,
    ConsoleUri

Some more findings:

There are 22 cases where an emulator went bad in the last 14 days.
I see it happen even between uninstall and install commands which means that the emulator can go bad even outside run commands.
I only see this with runtime tests and not BCL tests (100%).
I found proof in the data of emulator getting healthy after a reboot.
I didn't find a single VM that would have 100% failure rate - it always was working and stopped halfway through

Example of an emulator that went bad half-way through the work item, rebooted and a second attempt of the same workitem completed fine after the machine came back (the same machine picked it up):

I was able to find many examples of the above (sample of the 22 failures from the last 14 days showed this frequently)

tkapin · 2022-04-14T11:39:38Z

Thanks for the detailed analysis, @premun. Having proved the emulators are actually fixed by rebooting is a good news, also the sudden crash by any operation is a strong clue.

premun · 2022-04-14T11:43:01Z

Furthermore, I added some options in case the emulator start-up would get "stalled" (Android term):
https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/22296

I will dig a bit more into the exact moment the emulator stops running before I close this.

premun · 2022-04-19T13:52:11Z

Last piece of investigation was the overall SLA for the emulators. I have only found 4 work items in the last week that failed with on third attempts and none of them for infra reasons.

From the above reasons I conclude that the Android emulators are booting and rebooting fine and I am closing this issue.

greenEkatherine added android Android area infrastructure Connected to Helix infrastructure labels Nov 1, 2021

greenEkatherine mentioned this issue Nov 1, 2021

Failed to find a x86_64 device on ubuntu.1804.amd64.android.open.rt #689

Closed

premun assigned greenEkatherine Nov 3, 2021

greenEkatherine mentioned this issue Nov 3, 2021

Redundant retries on the same machines #749

Closed

premun unassigned greenEkatherine Nov 16, 2021

premun changed the title ~~Some emulators cannot start~~ Android emulators fail to boot before accepting Helix jobs Mar 31, 2022

premun self-assigned this Apr 12, 2022

premun closed this as completed Apr 19, 2022

This was referenced Apr 21, 2022

Android and Apple platform support for .NET 7 / MAUI dotnet/arcade#9136

Closed

Old Android version emulators fail to boot reliably dotnet/arcade#9149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Android emulators fail to boot before accepting Helix jobs #754

Android emulators fail to boot before accepting Helix jobs #754

greenEkatherine commented Nov 1, 2021 •

edited

Loading

greenEkatherine commented Nov 1, 2021

tkapin commented Nov 1, 2021

premun commented Nov 1, 2021

greenEkatherine commented Nov 1, 2021

greenEkatherine commented Nov 1, 2021

premun commented Nov 2, 2021 •

edited

Loading

akoeplinger commented Nov 2, 2021 •

edited

Loading

premun commented Nov 2, 2021

akoeplinger commented Nov 2, 2021

greenEkatherine commented Nov 3, 2021

premun commented Nov 3, 2021

greenEkatherine commented Nov 4, 2021

premun commented Nov 9, 2021 •

edited

Loading

greenEkatherine commented Nov 12, 2021

premun commented Dec 16, 2021

premun commented Dec 16, 2021

missymessa commented Jan 28, 2022

premun commented Jan 31, 2022 •

edited

Loading

premun commented Apr 13, 2022 •

edited

Loading

tkapin commented Apr 14, 2022

premun commented Apr 14, 2022

premun commented Apr 19, 2022

Android emulators fail to boot before accepting Helix jobs #754

Android emulators fail to boot before accepting Helix jobs #754

Comments

greenEkatherine commented Nov 1, 2021 • edited Loading

greenEkatherine commented Nov 1, 2021

tkapin commented Nov 1, 2021

premun commented Nov 1, 2021

greenEkatherine commented Nov 1, 2021

greenEkatherine commented Nov 1, 2021

premun commented Nov 2, 2021 • edited Loading

akoeplinger commented Nov 2, 2021 • edited Loading

premun commented Nov 2, 2021

akoeplinger commented Nov 2, 2021

greenEkatherine commented Nov 3, 2021

premun commented Nov 3, 2021

greenEkatherine commented Nov 4, 2021

premun commented Nov 9, 2021 • edited Loading

greenEkatherine commented Nov 12, 2021

premun commented Dec 16, 2021

premun commented Dec 16, 2021

missymessa commented Jan 28, 2022

premun commented Jan 31, 2022 • edited Loading

premun commented Apr 13, 2022 • edited Loading

tkapin commented Apr 14, 2022

premun commented Apr 14, 2022

premun commented Apr 19, 2022

greenEkatherine commented Nov 1, 2021 •

edited

Loading

premun commented Nov 2, 2021 •

edited

Loading

akoeplinger commented Nov 2, 2021 •

edited

Loading

premun commented Nov 9, 2021 •

edited

Loading

premun commented Jan 31, 2022 •

edited

Loading

premun commented Apr 13, 2022 •

edited

Loading