Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Android emulators fail to boot before accepting Helix jobs #754

Closed
Tracked by #9136
greenEkatherine opened this issue Nov 1, 2021 · 22 comments
Closed
Tracked by #9136

Android emulators fail to boot before accepting Helix jobs #754

greenEkatherine opened this issue Nov 1, 2021 · 22 comments
Assignees
Labels
android Android area infrastructure Connected to Helix infrastructure

Comments

@greenEkatherine
Copy link
Contributor

greenEkatherine commented Nov 1, 2021

This is a new issue to track two old ones. It mostly affects old emulators with API 21-24 (https://github.com/dotnet/core-eng/issues/13359) but also seen on new emulators - less frequently (#689)

It seems that emulator is hanging, I'm trying to pass different parameters on start to be sure that it has enough space, memory and use hardware properly. If it doesn't help, work around will be to kill VM in bad shape and rerun on the new one.

@greenEkatherine
Copy link
Contributor Author

I've noticed that this issue reproduced during first runs but after some time all machines in my PR's have emulators up and running. It seems that emulators there require more time to starts and want to learn how long it takes.

@tkapin
Copy link
Member

tkapin commented Nov 1, 2021

It seems that emulator is hanging, I'm trying to pass different parameters on start to be sure that it has enough space, memory and use hardware properly.

@greenEkatherine could you please elaborate on this?

@premun
Copy link
Member

premun commented Nov 1, 2021

@greenEkatherine I have some questions as I don't understand the space much:

  1. Do we boot the simulator using some commands after the machine starts or do we create a VM with already running simulator and once the VM starts the emulator is ready?
  2. What exactly do we do to validate the emulator is running in OSOB? How do we validate in OSOB pipeline that when we make changes to Android queues, the emulators are running?
  3. Do you think it makes sense we should add some ValidateHelix step that would run something on the Android queues?

@greenEkatherine
Copy link
Contributor Author

It seems that emulator is hanging, I'm trying to pass different parameters on start to be sure that it has enough space, memory and use hardware properly.

@greenEkatherine could you please elaborate on this?

There are many parameters to set up https://developer.android.com/studio/run/emulator-commandline
For example -use-system-libs is able to speed up emulator - based on other developers' experience.

@greenEkatherine
Copy link
Contributor Author

@greenEkatherine I have some questions as I don't understand the space much:

  1. Do we boot the simulator using some commands after the machine starts or do we create a VM with already running simulator and once the VM starts the emulator is ready?
  2. What exactly do we do to validate the emulator is running in OSOB? How do we validate in OSOB pipeline that when we make changes to Android queues, the emulators are running?
  3. Do you think it makes sense we should add some ValidateHelix step that would run something on the Android queues?
  1. We run emulator after the VM starts.
  2. In validation script we check that there are two emulators for x86 and x64 (except queue with API 31 where only x64 is availbale) and that their status is not "offline"
  3. Do you mean to do it on every VM? because the issue seems to be flaky so our validation usually passed

@premun
Copy link
Member

premun commented Nov 2, 2021

Looking at the code - we seem to be doing some "waiting for emulators to start" in the validation phase. However, we're not doing the same when actually launching the VMs in production (first-run or when starting the systemd service) since validation scripts don't run then. Do you know if there's some difference or reason for that?

@akoeplinger
Copy link
Member

akoeplinger commented Nov 2, 2021

FWIW -use-system-libs is kinda scary, we shouldn't be using it if the issue is just that the emulator starts more slowly. We don't care if it takes 10mins for a VM+emulator to boot as long as it works :)

I do know that we had issues with the "wait for emulator" code in the past, I wonder if just waiting some hardcoded time on these problematic older API levels would be enough for now?

@premun
Copy link
Member

premun commented Nov 2, 2021

I do know that we had issues with the "wait for emulator" code in the past

From the emulator artifact's code it seems that we're not waiting for emulator at all when booting the VM. We only wait for it in validation phase that happens when we're building the image itself (and that's where we've seen the flaky behaviour).

So it might be we just start accepting helix jobs too early.

I do know that we had issues with the "wait for emulator" code in the past, I wonder if just waiting some hardcoded time on these problematic older API levels would be enough for now?

I guess a plan B? :)

@akoeplinger
Copy link
Member

I think we thought that it'd be handled by xharness' wait-for-device code, but the 5min timeout for that is probably too short:

var result = RunAdbCommand("wait-for-device", TimeSpan.FromMinutes(5));

@greenEkatherine
Copy link
Contributor Author

I run 100 jobs and set up timeout up to 30 minutes. Unfortunately there are only 4 machines took jobs out of 23 machines - too few to estimate timeout, but even there I see that it takes 10-15 minutes to run the first jobs on each machine.

I would like to look at the results from bigger number of machines. @premun does WorkItemWaitTime monitoring show it or it's about Helix Queueing?

@premun
Copy link
Member

premun commented Nov 3, 2021

Afaik WorkItemWaitTime is Started - Enqueued - so how long it took to pull the work item from the queue after it was enqueued so it is probably not useful here.

You could technically scale up the queue manually in Azure to get more machines and more data - if you don't let them run needlessly long.
But not sure how exact we want to be.

@greenEkatherine
Copy link
Contributor Author

Second attempt show that 19 out of 30 machines take 10+ minutes to run emulators for their first job.

I raised PR and will try to run tests against this queue https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/19540

@premun
Copy link
Member

premun commented Nov 9, 2021

@greenEkatherine @fanyang-mono reported a case (0c79eace-2029-4e08-810e-0a2c30cdd3d3) where we see no emulator on the default API (ubuntu.1804.amd64.android.open.svc).

All of the work items processed on the machine in the span of almost 15 minutes have failed on the ADB_ENUMERATION_FAILURE. However, the last work item made several attempts until almost 30 minutes from the first one.

@greenEkatherine
Copy link
Contributor Author

Current state in PR:
Based on the feedback, I moved waiting for device command with 15 minutes timeout into systemd, so it will be called not only during the first run but after reboot as well. I put it into ExecStartPost. If I understand correctly documentation, it will run right after starting emulator but we wait for result before calling helix.service.

I would not comment here if it works - it does not. I send 200 test jobs to check emulators' state and more than a half of them failed. The question is why validation always passed and why service is not indicating a failure?
image

@premun
Copy link
Member

premun commented Dec 16, 2021

I was randomly trying to run XHarness locally and saw this:
image

It might be possible that the emulators actually start fine and XHarness doesn't detect this well?

@premun
Copy link
Member

premun commented Dec 16, 2021

Oh, the emulator was actually frozen..

@missymessa
Copy link
Member

@premun looks like there's still an open PR for this work. Did we want to do anything with it? https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/19540

@premun
Copy link
Member

premun commented Jan 31, 2022

@missymessa yes, this is very much on my radar but I didn't have time to have a look at this yet. It is very important but also quite complex issue and Katya has done quite a lot of investigations in this regard. Some of them are in the PR, so I am keeping it until I have time to have a look. I expect to get to this in about 2 weeks.

@premun premun changed the title Some emulators cannot start Android emulators fail to boot before accepting Helix jobs Mar 31, 2022
@premun premun self-assigned this Apr 12, 2022
@premun
Copy link
Member

premun commented Apr 13, 2022

I found more proof that the device can reboot on its own in between android run commands:

Analyzing android/run@x86_64 (0)
2022-04-13T02:45:08.815Z	INFO   	xharness-event-processor.py	event(51)	send	Sending event type _MobileDeviceOperation
2022-04-13T02:45:08.900Z	INFO   	xharness-event-processor.py	saferequests(87)	request_with_retry	Response complete with status code '201'
Analyzing android/run@None (81)
    Encountered DEVICE_NOT_FOUND. This is typically not a failure of the work item. It will be run again. This machine will reboot to heal.
    If this occurs repeatedly, please check for architectural mismatch, e.g. sending arm64_v8a APKs to an x86_64 / x86 only queue.
2022-04-13T02:45:08.901Z	INFO   	xharness-event-processor.py	event(51)	send	Sending event type _MobileDeviceOperation
2022-04-13T02:45:08.936Z	INFO   	xharness-event-processor.py	saferequests(87)	request_with_retry	Response complete with status code '201'

One second it's there, XHarness starts it fine, the next second it's gone and it's not XHarness causing the reboot.

This query shows all first operations that started failing:

Metrics
| where EventType == "_MobileDeviceOperation" and MetricName == "ExitCode" and Timestamp > now()-14d
| extend Dimensions = parse_json(Dimensions)
| extend
    ExitCode = MetricValue,
    Command = tostring(Dimensions.command),
    Platform = tostring(Dimensions.platform),
    Target = tostring(Dimensions.target)
| where Platform == "android"
| join kind=inner WorkItems on WorkItemId
| where MachineName != "" and QueueName != "windows.10.amd64.android.open"
| order by JobId asc, WorkItemId asc, Timestamp asc
| extend PrevExitCode = prev(ExitCode, 1), PrevCommand = prev(Command, 1)
| where PrevExitCode == 0 and ExitCode == 81
| project
    JobId,
    QueueName,
    WorkItemFriendlyName,
    MachineName,
    Target,
    Command,
    PrevCommand,
    ConsoleUri

Some more findings:

  • There are 22 cases where an emulator went bad in the last 14 days.
  • I see it happen even between uninstall and install commands which means that the emulator can go bad even outside run commands.
  • I only see this with runtime tests and not BCL tests (100%).
  • I found proof in the data of emulator getting healthy after a reboot.
  • I didn't find a single VM that would have 100% failure rate - it always was working and stopped halfway through

Example of an emulator that went bad half-way through the work item, rebooted and a second attempt of the same workitem completed fine after the machine came back (the same machine picked it up):
image

I was able to find many examples of the above (sample of the 22 failures from the last 14 days showed this frequently)

@tkapin
Copy link
Member

tkapin commented Apr 14, 2022

Thanks for the detailed analysis, @premun. Having proved the emulators are actually fixed by rebooting is a good news, also the sudden crash by any operation is a strong clue.

@premun
Copy link
Member

premun commented Apr 14, 2022

Furthermore, I added some options in case the emulator start-up would get "stalled" (Android term):
https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/22296

I will dig a bit more into the exact moment the emulator stops running before I close this.

@premun
Copy link
Member

premun commented Apr 19, 2022

Last piece of investigation was the overall SLA for the emulators. I have only found 4 work items in the last week that failed with on third attempts and none of them for infra reasons.

From the above reasons I conclude that the Android emulators are booting and rebooting fine and I am closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
android Android area infrastructure Connected to Helix infrastructure
Projects
None yet
Development

No branches or pull requests

5 participants