Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow/inconsistent startup time for SQL Server LocalDB on windows-2022 image #8164

Closed
2 of 10 tasks
cremor opened this issue Aug 25, 2023 · 36 comments
Closed
2 of 10 tasks

Comments

@cremor
Copy link

cremor commented Aug 25, 2023

Description

I’m running database integration tests in Azure DevOps Pipelines with the Microsoft-hosted Windows agent. The tests use MS SQL Server Express LocalDB. The time it takes to start the LocalDB instance is very inconsistent. Sometimes it takes just 20 seconds (which is still slow compared to my local dev machine), but sometimes it takes more than 5 minutes!

For that reason I can’t just use LocalDB in my tests and rely on automatic startup of the DB like I’m used to from my local dev machine because then the tests time out.
As a workaround I now have an explicit step in my pipeline:

- script: 'sqllocaldb start MSSQLLocalDB'
  displayName: 'Start SQL Server LocalDB'

The works most of the time, but as mentioned just this single command can take multiple minutes.

Rarely the LocalDB startup even fails. It then prints the following to the log:

Start of LocalDB instance "MSSQLLocalDB" failed because of the following error:
Cannot create an automatic instance. See the Windows Application event log for error details.

(I can't check the Windows Application event log for details because this is a Microsoft-hosted agent. Enabling system diagnostics for the build also doesn't provide any more useful logs.)

Please fix this so that the LocalDB startup is consistently fast.
Maybe it would be possible to already start it when the image is created so that each new build run already gets a started LocalDB instance?

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 20.04
  • Ubuntu 22.04
  • macOS 11
  • macOS 12
  • macOS 13
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

Image: windows-2022
Version: 20230804.1.0

Is it regression?

unknown

Expected behavior

MS SQL Server Express LocalDB creation/startup time should be fast and consistent.
Ideally I could rely on the auto-start feature and not have to manually start the LocalDB with a script step.

Actual behavior

MS SQL Server Express LocalDB creation/startup time varies between a few seconds and multiple minutes.

Repro steps

  1. Create a pipeline which includes this step:
- script: 'sqllocaldb start MSSQLLocalDB'
  displayName: 'Start SQL Server LocalDB'
  1. Run the pipeline multiple times and observe the time it takes to execute the step.
@Alexey-Ayupov
Copy link
Collaborator

Hello @cremor, we will take a look, however, we cannot promise to start the MSSQLLocalDB service after Microsoft-hosted runner allocation, since it can affect the performance of other customers.

@ilia-shipitsin
Copy link
Contributor

@cremor , you can check event log by running powershell, for example

- task: PowerShell@2
  condition: always()
  inputs:
    targetType: 'inline'
    script: |
      Get-EventLog -LogName System  | fl
      Get-EventLog -LogName Application | fl 
  

@cremor
Copy link
Author

cremor commented Aug 25, 2023

we cannot promise to start the MSSQLLocalDB service after Microsoft-hosted runner allocation, since it can affect the performance of other customers.

I don't know how the agent images work, but can't you start the LocalDB instance before you create the image, so that it is already running when a new agent is started from that image?
If it doesn't work that way, then maybe the LocalDB instance could at least be created in the image? Then the agent would only have to start it (this is assuming that creating the instance is the problem and not starting it - I haven't verified this).

Of course, those are just some ideas. If there is a way to "simply" fix the creation/startup performance then none of this is needed.

@ilia-shipitsin
Copy link
Contributor

@cremor , did you have a chance to collect Windows Application Event Log ?

@cremor
Copy link
Author

cremor commented Aug 25, 2023

No, I've only seen that LocalDB creation error twice over a timespan of about a month. I'd have to add this to all my pipelines and hope that it happens again, which I don't really want to do.

@ilia-shipitsin
Copy link
Contributor

ok, most probably it will be something like "the mentioned service was not started for .... seconds". but let's add and see.

from my side, I'll try running your steps sqllocaldb start MSSQLLocalDB in ADO and see how much time it takes.

@ilia-shipitsin
Copy link
Contributor

@criemen , I created sample workflow

trigger:
- main

pool:
  vmImage: windows-latest

steps:
          
      - task: PowerShell@2
        inputs:
          targetType: 'inline'
          script: 'sqllocaldb start MSSQLLocalDB'

and ran it 50 times

14s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=279
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=280
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=281
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=282
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=283
14s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=284
60s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=285
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=286
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=287
34s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=288
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=289
20s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=290
17s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=291
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=292
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=293
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=294
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=295
39s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=296
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=297
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=298
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=299
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=300
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=301
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=302
50s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=303
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=304
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=305
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=306
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=307
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=308
35s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=309
40s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=310
11s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=311
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=312
20s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=313
21s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=314
44s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=315
33s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=316
23s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=317
31s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=318
34s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=319
21s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=320
18s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=321
36s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=322
30s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=323
30s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=324
23s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=325
14s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=326
23s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=327
23s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=328

all 50 times LocalDB started. I was not able to reproduce your case when service did not start.
can it be a matter of system utilization ? do you start LocalDB in the very beginning ?

@cremor
Copy link
Author

cremor commented Aug 28, 2023

My actual pipeline calls the LocalDB start script after a .NET build step. But even in my test pipeline with that single LocalDB start step the step alone sometimes took over an minute.

I didn't catch extremely long (multiple minute) durations with the simple test pipeline, but that might just have been bad (good? 😁 ) luck. As I wrote in my initial report, the time is very inconsistent. Sometimes it's fast, sometimes it's slow.

The error has only happend twice for me as far as I know, so that is very rare. And I suspect that the error is just some timeout, because when it happend the step duration was around 5 minutes. So fixing the performance might also fix the error.

Btw, I'm in the Azure DevOps region "West Europe" in case the region affects performance.

@ilia-shipitsin
Copy link
Contributor

I'd say that bottleneck is vm size. Standard runners are Standard DS2 v2 (2 vcpus, 7 GiB memory), it is somewhat possible to utilize 7Gb of RAM by heavy build.

I'd suggest to move LocalDB start to the very beginning.
I do not beleive "West Europe" is a reason. Well, if you can collect, say 100 builds in West Europe and 100 builds in other region, we can compare numbers. But it is unlikely to be the root cause.

@cremor
Copy link
Author

cremor commented Aug 28, 2023

But if the same project (same build step) still results in LocalDB startup times that vary between 20s and 5m, then the build can't be the factor that slows the LocalDB startup time down, right?

Also, I've seen the 5 minute LocalDB startup time for very simple Web API projects (not much more than a single controller and a bit of EF Core infrastructure code).

But I'll try a few things and will get back to you.

@ilia-shipitsin
Copy link
Contributor

ilia-shipitsin commented Aug 28, 2023

I do not observer 5m. I suspect that you have some other component which slows things down. On clear build I observe startup times between 12s and 60s.

@ilia-shipitsin
Copy link
Contributor

anyway, please provide detailed repro step how to observe 5min startup time, so we can investigate.

with repro steps provided, I see 12s-60s.

@cremor
Copy link
Author

cremor commented Aug 28, 2023

I just tested it again with this simple pipeline:

trigger:
- main

pool:
  vmImage: windows-latest

steps:
  - script: 'sqllocaldb create MSSQLLocalDB'
    displayName: 'Create SQL Server LocalDB'

  - script: 'sqllocaldb start MSSQLLocalDB'
    displayName: 'Start SQL Server LocalDB'

I split the "create" and "start" steps here to maybe figure out what it exactly is that's slow. Turns out it can be both 😁 So maybe I better continue to use just a "start" step.

First try:
grafik
(Missing display names for the steps. I added those later.)

Second try:
grafik

@ilia-shipitsin
Copy link
Contributor

@cremor ,

I modified my workflow as

trigger:
- main

pool:
  vmImage: windows-latest

steps:

      - task: PowerShell@2
        inputs:
          targetType: 'inline'
          script: 'sqllocaldb create MSSQLLocalDB'

      - task: PowerShell@2
        inputs:
          targetType: 'inline'
          script: 'sqllocaldb start MSSQLLocalDB'

I really wonder whether THAT difference is because of "script" <--> "task" semantic, but my numbers are ...

image

@ilia-shipitsin
Copy link
Contributor

ilia-shipitsin commented Aug 28, 2023

well, I cannot reproduce your numbers... thus I cannot make measurement on my side.
I'd suggest adding WPT code around "sqllocaldb create MSSQLLocalDB"

something like

trigger:
- main

pool:
  vmImage: windows-latest

steps:
- task: CmdLine@2
  inputs:
    script: |
      choco install windows-performance-toolkit
      "C:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\xperf.exe" -start -on LOADER+PROC_THREAD+DISK_IO+HARD_FAULTS+DPC+INTERRUPT+CSWITCH+PERF_COUNTER+FILE_IO_INIT+REGISTRY

- task: CmdLine@2
  inputs:
    script: |
      sqllocaldb create MSSQLLocalDB

- task: CmdLine@2
  inputs:
    script: |
      sqllocaldb start MSSQLLocalDB

- task: CmdLine@2
  inputs:
    script: |
      "c:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\xperf.exe" -d $(Build.ArtifactStagingDirectory)\tmp.etl

- task: PublishBuildArtifacts@1
  inputs:
    PathtoPublish: '$(Build.ArtifactStagingDirectory)\tmp.etl'
    ArtifactName: 'etl'
    publishLocation: 'Container'

this should collect WPT telemetry in tmp.etl and publish it as artifact.

@cremor
Copy link
Author

cremor commented Aug 29, 2023

I'll provide the WPT telemetry file once I manage to reproduce the slow startup in my test pipeline again. Right now the LocalDB startup always takes less than 1 minute. It's really annoying that this is so inconsistent (maybe dependend on outside factors?).

@ilia-shipitsin
Copy link
Contributor

I'm not familiar with LocalDB, how Create SQL Server LocalDB works ? Does it look at files around ? Maybe we have different filesets in our repos (I've no files in mine).

@cremor
Copy link
Author

cremor commented Aug 29, 2023

sqllocaldb should be completely independent of any source files in the repo. See also https://learn.microsoft.com/en-us/sql/tools/sqllocaldb-utility

@ilia-shipitsin
Copy link
Contributor

ok, I'll try to run 50 builds with createdb + telemetry

@cremor
Copy link
Author

cremor commented Aug 29, 2023

I got a 3m 7s start time. Here is the WPT telemetry file:
tmp.zip

@ilia-shipitsin
Copy link
Contributor

good good. I made small mistake in yaml, glad you were able to fix it.

@ilia-shipitsin
Copy link
Contributor

interesting. defender is relatively short.
what can be RuntimeBroker ? is it part of SQL ?

I see that SQL itself takes all the time (and no other process)

image

@ilia-shipitsin
Copy link
Contributor

well, in theory we can invoke sqllocaldb create MSSQLLocalDB during image preparation. what is MSSQLLocalDB, is it database name ? we cannot "warmup" just any database, but we should name certain database ?

@ilia-shipitsin
Copy link
Contributor

well, seems the winner is svchost

image

no idea what is it doing, but most probably "docker" is not related to LocalDB activity (and we can something with that after we figure out what it is)

image

@cremor
Copy link
Author

cremor commented Aug 29, 2023

in theory we can invoke sqllocaldb create MSSQLLocalDB during image preparation

That would be great. It seems like most of the time the "start" only takes a few seconds when "create" was executed before. Only rarely "start" alone takes long too. (If I catch it, I will provide separate trace files.)

what is MSSQLLocalDB, is it database name ? we cannot "warmup" just any database, but we should name certain database ?

MSSQLLocalDB is the default DB name for the automatic instance. See
https://learn.microsoft.com/en-us/sql/database-engine/configure-windows/sql-server-express-localdb?view=sql-server-ver16#automatic-and-named-instances
and
https://learn.microsoft.com/en-us/sql/database-engine/configure-windows/sql-server-express-localdb?view=sql-server-ver16#connect-to-the-automatic-instance

@ilia-shipitsin
Copy link
Contributor

ilia-shipitsin commented Aug 31, 2023

@cremor , I made some tests, unfortunately "sqllocaldb create MSSQLLocalDB" will take a bit more effort from our side.

good news - it can be run many times, if I add it to image generation, you still can run it from your pipeline, no error will be emitted.
bad news - it stores database in user profile. user used for image generation is deleted after image generation (profile is deleted as well). we need to move generated database to "C:\Users\Default" (which is used as template for new users)

also, few thoughts

  1. "sqllocaldb create MSSQLLocalDB" creates database in user profile, which is stored on "C:" drive. drive "C:" is relatively slow, only 500 IOPS for everything. we have faster "D:" drive (it is exposed via env variables as Agent Temp directory)
  2. in collected WPT I see "windows cortana" activity. I suspect it utilized huge amount of 500 IOPS on "C:" drive. cortana is enabled in purpose, for real life e2e tests it is invaluable to run in very similar to real life environment. Unfortunately, we do not have good solution for customers who do not need those extra services. Of course, customer can disable it from pipeline, but it takes many efforts to find out what to disable.

@cremor
Copy link
Author

cremor commented Sep 1, 2023

  1. "sqllocaldb create MSSQLLocalDB" creates database in user profile, which is stored on "C:" drive. drive "C:" is relatively slow, only 500 IOPS for everything. we have faster "D:" drive

Is this also true for other databases that are preinstalled on the images? Or do the data files of that other databases use the D: drive?

2. in collected WPT I see "windows cortana" activity. I suspect it utilized huge amount of 500 IOPS on "C:" drive.

If this (or some other background process) is indeed the problem, than maybe the correct solution would be to fix that resource usage instead of doing something special for LocalDB?
But I'm wondering why the effect of those background processes is so inconsistent. Shouldn't they always behave the same since the same image is used to start the agent?

@Vossekop
Copy link

Vossekop commented Sep 7, 2023

Just wanted to let you know... thank god we are not the only ones with this issue. It has been bugging us on and off for a few months. But recently (as in, last week and the week before that) it has gotten significantly worse. We weren't using the work around @cremor implemented so our unit tests keep failing now and then. So at least that is something we can implement on our end to make the builds stay green.

@cremor
Copy link
Author

cremor commented Sep 8, 2023

Yeah, Microsoft-hosted agent performance is all over the place recently. I have seen steps (not LocalDB, something completely different) that usually take less than 5 seconds and now sometimes take nearly 2 minutes 😞

@ilia-shipitsin
Copy link
Contributor

@cremor , sorry for long delay.

while I'm still looking how to properly "warmup" things, it seems that I've found the reason of accidental high disk IO.

we haven't disabled the following scheduled task

image

we'll disable it, it's easy part.
just curious, does your observation of slowness corellate with that schedule ?

@cremor
Copy link
Author

cremor commented Sep 13, 2023

I don't think so. Even though I don't know which time zone this would be, I've seen slow LocalDB startup at all times around a (work) day.

But maybe the trigger is executed as soon as the agent starts because it's last scheduled time was missed? If that is the case, then it could indeed be the cause for all the performance problems of the agent.

@ilia-shipitsin
Copy link
Contributor

@cremor , I've figured out that deleting defender tasks does not help, it recreates them silently.

we are going to add Set-MpPreference -ScanScheduleDay 8 to images, it tells defender to delete scanning task.
new image rollout will take a week.

if you have spare time, can you verify that adding Set-MpPreference -ScanScheduleDay 8 to your pipeline increase stability (or not ?)

@cremor
Copy link
Author

cremor commented Sep 16, 2023

I've now ran a few more tests:

  • 10 runs without the command:
    • Longest LocalDB start time: 1m 52s
      WPT telemetry file: Without command.zip
    • All other start times were around 40s or less
  • 10 runs with the command:

So I'd say I can't see an improvement with that command 😉

I've ran the command as early as possible (even before checkout). Here is my full test pipeline:

pool:
  vmImage: 'windows-latest'

steps:
- task: PowerShell@2
  inputs:
    targetType: 'inline'
    script: 'Set-MpPreference -ScanScheduleDay 8'

- checkout: self

- task: CmdLine@2
  inputs:
    script: |
      choco install windows-performance-toolkit
      "C:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\xperf.exe" -start -on LOADER+PROC_THREAD+DISK_IO+HARD_FAULTS+DPC+INTERRUPT+CSWITCH+PERF_COUNTER+FILE_IO_INIT+REGISTRY

- script: 'sqllocaldb start MSSQLLocalDB'
  displayName: 'Start SQL Server LocalDB'

- task: CmdLine@2
  inputs:
    script: |
      "c:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\xperf.exe" -d $(Build.ArtifactStagingDirectory)\tmp.etl

- task: PublishBuildArtifacts@1
  inputs:
    PathtoPublish: '$(Build.ArtifactStagingDirectory)\tmp.etl'
    ArtifactName: 'etl'
    publishLocation: 'Container'

@ilia-shipitsin
Copy link
Contributor

I'm back with some good news and some bad news.

bad news: LocalDB wants database to be placed in %USERPROFILE%, neither #8435, nor ilia-shipitsin@25d45e4 did not convince LocalDB to use moved database. it is created from scratch and placed to %USERPROFILE%

good news: during investigation we improved several things

taking into account that we cannot warmup LocalDB due to its nature (to use %USERPOFILE%), we hope that other improvements will help to reduce disk IO (especially disabling StorSvc)

@cremor
Copy link
Author

cremor commented Oct 10, 2023

Thanks for the information.
How can I find out which image version contains those fixes? I'd like to test again when the new image version is available on hosted agents.

@ilia-shipitsin
Copy link
Contributor

my measurements show that StorSvc was the most significant Disk IO consumer.
it is deployed already (PR: #8388 )

the last change (related to powershell first run warmup) is not merged yet. but it is rather cosmetic. it warms up the first invocation. if you add dummy first step with "something in powershell", that step will warmup for you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants