Slow/inconsistent startup time for SQL Server LocalDB on windows-2022 image #8164

cremor · 2023-08-25T06:28:24Z

Description

I’m running database integration tests in Azure DevOps Pipelines with the Microsoft-hosted Windows agent. The tests use MS SQL Server Express LocalDB. The time it takes to start the LocalDB instance is very inconsistent. Sometimes it takes just 20 seconds (which is still slow compared to my local dev machine), but sometimes it takes more than 5 minutes!

For that reason I can’t just use LocalDB in my tests and rely on automatic startup of the DB like I’m used to from my local dev machine because then the tests time out.
As a workaround I now have an explicit step in my pipeline:

- script: 'sqllocaldb start MSSQLLocalDB'
  displayName: 'Start SQL Server LocalDB'

The works most of the time, but as mentioned just this single command can take multiple minutes.

Rarely the LocalDB startup even fails. It then prints the following to the log:

Start of LocalDB instance "MSSQLLocalDB" failed because of the following error:
Cannot create an automatic instance. See the Windows Application event log for error details.

(I can't check the Windows Application event log for details because this is a Microsoft-hosted agent. Enabling system diagnostics for the build also doesn't provide any more useful logs.)

Please fix this so that the LocalDB startup is consistently fast.
Maybe it would be possible to already start it when the image is created so that each new build run already gets a started LocalDB instance?

Platforms affected

Azure DevOps
GitHub Actions - Standard Runners
GitHub Actions - Larger Runners

Runner images affected

Image version and build link

Image: windows-2022
Version: 20230804.1.0

Is it regression?

unknown

Expected behavior

MS SQL Server Express LocalDB creation/startup time should be fast and consistent.
Ideally I could rely on the auto-start feature and not have to manually start the LocalDB with a script step.

Actual behavior

MS SQL Server Express LocalDB creation/startup time varies between a few seconds and multiple minutes.

Repro steps

Create a pipeline which includes this step:

- script: 'sqllocaldb start MSSQLLocalDB'
  displayName: 'Start SQL Server LocalDB'

Run the pipeline multiple times and observe the time it takes to execute the step.

The text was updated successfully, but these errors were encountered:

Alexey-Ayupov · 2023-08-25T07:39:02Z

Hello @cremor, we will take a look, however, we cannot promise to start the MSSQLLocalDB service after Microsoft-hosted runner allocation, since it can affect the performance of other customers.

ilia-shipitsin · 2023-08-25T08:23:18Z

@cremor , you can check event log by running powershell, for example

- task: PowerShell@2
  condition: always()
  inputs:
    targetType: 'inline'
    script: |
      Get-EventLog -LogName System  | fl
      Get-EventLog -LogName Application | fl

cremor · 2023-08-25T13:14:31Z

we cannot promise to start the MSSQLLocalDB service after Microsoft-hosted runner allocation, since it can affect the performance of other customers.

I don't know how the agent images work, but can't you start the LocalDB instance before you create the image, so that it is already running when a new agent is started from that image?
If it doesn't work that way, then maybe the LocalDB instance could at least be created in the image? Then the agent would only have to start it (this is assuming that creating the instance is the problem and not starting it - I haven't verified this).

Of course, those are just some ideas. If there is a way to "simply" fix the creation/startup performance then none of this is needed.

ilia-shipitsin · 2023-08-25T13:45:02Z

@cremor , did you have a chance to collect Windows Application Event Log ?

cremor · 2023-08-25T14:23:34Z

No, I've only seen that LocalDB creation error twice over a timespan of about a month. I'd have to add this to all my pipelines and hope that it happens again, which I don't really want to do.

ilia-shipitsin · 2023-08-25T14:41:36Z

ok, most probably it will be something like "the mentioned service was not started for .... seconds". but let's add and see.

from my side, I'll try running your steps sqllocaldb start MSSQLLocalDB in ADO and see how much time it takes.

ilia-shipitsin · 2023-08-28T11:44:38Z

@criemen , I created sample workflow

trigger:
- main

pool:
  vmImage: windows-latest

steps:
          
      - task: PowerShell@2
        inputs:
          targetType: 'inline'
          script: 'sqllocaldb start MSSQLLocalDB'

and ran it 50 times

14s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=279
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=280
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=281
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=282
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=283
14s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=284
60s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=285
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=286
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=287
34s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=288
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=289
20s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=290
17s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=291
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=292
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=293
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=294
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=295
39s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=296
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=297
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=298
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=299
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=300
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=301
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=302
50s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=303
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=304
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=305
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=306
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=307
13s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=308
35s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=309
40s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=310
11s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=311
12s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=312
20s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=313
21s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=314
44s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=315
33s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=316
23s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=317
31s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=318
34s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=319
21s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=320
18s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=321
36s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=322
30s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=323
30s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=324
23s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=325
14s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=326
23s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=327
23s https://dev.azure.com/v-ishipitsin/uno/_build/results?buildId=328

all 50 times LocalDB started. I was not able to reproduce your case when service did not start.
can it be a matter of system utilization ? do you start LocalDB in the very beginning ?

cremor · 2023-08-28T12:15:14Z

My actual pipeline calls the LocalDB start script after a .NET build step. But even in my test pipeline with that single LocalDB start step the step alone sometimes took over an minute.

I didn't catch extremely long (multiple minute) durations with the simple test pipeline, but that might just have been bad (good? 😁 ) luck. As I wrote in my initial report, the time is very inconsistent. Sometimes it's fast, sometimes it's slow.

The error has only happend twice for me as far as I know, so that is very rare. And I suspect that the error is just some timeout, because when it happend the step duration was around 5 minutes. So fixing the performance might also fix the error.

Btw, I'm in the Azure DevOps region "West Europe" in case the region affects performance.

ilia-shipitsin · 2023-08-28T12:20:37Z

I'd say that bottleneck is vm size. Standard runners are Standard DS2 v2 (2 vcpus, 7 GiB memory), it is somewhat possible to utilize 7Gb of RAM by heavy build.

I'd suggest to move LocalDB start to the very beginning.
I do not beleive "West Europe" is a reason. Well, if you can collect, say 100 builds in West Europe and 100 builds in other region, we can compare numbers. But it is unlikely to be the root cause.

cremor · 2023-08-28T12:40:41Z

But if the same project (same build step) still results in LocalDB startup times that vary between 20s and 5m, then the build can't be the factor that slows the LocalDB startup time down, right?

Also, I've seen the 5 minute LocalDB startup time for very simple Web API projects (not much more than a single controller and a bit of EF Core infrastructure code).

But I'll try a few things and will get back to you.

ilia-shipitsin · 2023-08-28T12:42:46Z

I do not observer 5m. I suspect that you have some other component which slows things down. On clear build I observe startup times between 12s and 60s.

ilia-shipitsin · 2023-08-28T12:44:52Z

anyway, please provide detailed repro step how to observe 5min startup time, so we can investigate.

with repro steps provided, I see 12s-60s.

cremor · 2023-08-28T12:58:05Z

I just tested it again with this simple pipeline:

trigger:
- main

pool:
  vmImage: windows-latest

steps:
  - script: 'sqllocaldb create MSSQLLocalDB'
    displayName: 'Create SQL Server LocalDB'

  - script: 'sqllocaldb start MSSQLLocalDB'
    displayName: 'Start SQL Server LocalDB'

I split the "create" and "start" steps here to maybe figure out what it exactly is that's slow. Turns out it can be both 😁 So maybe I better continue to use just a "start" step.

First try:

(Missing display names for the steps. I added those later.)

Second try:

ilia-shipitsin · 2023-08-28T13:36:29Z

@cremor ,

I modified my workflow as

trigger:
- main

pool:
  vmImage: windows-latest

steps:

      - task: PowerShell@2
        inputs:
          targetType: 'inline'
          script: 'sqllocaldb create MSSQLLocalDB'

      - task: PowerShell@2
        inputs:
          targetType: 'inline'
          script: 'sqllocaldb start MSSQLLocalDB'

I really wonder whether THAT difference is because of "script" <--> "task" semantic, but my numbers are ...

ilia-shipitsin · 2023-08-28T13:44:38Z

well, I cannot reproduce your numbers... thus I cannot make measurement on my side.
I'd suggest adding WPT code around "sqllocaldb create MSSQLLocalDB"

something like

trigger:
- main

pool:
  vmImage: windows-latest

steps:
- task: CmdLine@2
  inputs:
    script: |
      choco install windows-performance-toolkit
      "C:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\xperf.exe" -start -on LOADER+PROC_THREAD+DISK_IO+HARD_FAULTS+DPC+INTERRUPT+CSWITCH+PERF_COUNTER+FILE_IO_INIT+REGISTRY

- task: CmdLine@2
  inputs:
    script: |
      sqllocaldb create MSSQLLocalDB

- task: CmdLine@2
  inputs:
    script: |
      sqllocaldb start MSSQLLocalDB

- task: CmdLine@2
  inputs:
    script: |
      "c:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\xperf.exe" -d $(Build.ArtifactStagingDirectory)\tmp.etl

- task: PublishBuildArtifacts@1
  inputs:
    PathtoPublish: '$(Build.ArtifactStagingDirectory)\tmp.etl'
    ArtifactName: 'etl'
    publishLocation: 'Container'

this should collect WPT telemetry in tmp.etl and publish it as artifact.

cremor · 2023-08-29T06:56:35Z

I'll provide the WPT telemetry file once I manage to reproduce the slow startup in my test pipeline again. Right now the LocalDB startup always takes less than 1 minute. It's really annoying that this is so inconsistent (maybe dependend on outside factors?).

ilia-shipitsin · 2023-08-29T07:00:21Z

I'm not familiar with LocalDB, how Create SQL Server LocalDB works ? Does it look at files around ? Maybe we have different filesets in our repos (I've no files in mine).

cremor · 2023-08-29T07:09:59Z

sqllocaldb should be completely independent of any source files in the repo. See also https://learn.microsoft.com/en-us/sql/tools/sqllocaldb-utility

ilia-shipitsin · 2023-08-29T07:25:06Z

ok, I'll try to run 50 builds with createdb + telemetry

cremor · 2023-08-29T09:28:10Z

I got a 3m 7s start time. Here is the WPT telemetry file:
tmp.zip

ilia-shipitsin · 2023-08-29T09:31:10Z

good good. I made small mistake in yaml, glad you were able to fix it.

ilia-shipitsin · 2023-08-29T09:37:05Z

interesting. defender is relatively short.
what can be RuntimeBroker ? is it part of SQL ?

I see that SQL itself takes all the time (and no other process)

ilia-shipitsin · 2023-08-29T09:46:11Z

well, in theory we can invoke sqllocaldb create MSSQLLocalDB during image preparation. what is MSSQLLocalDB, is it database name ? we cannot "warmup" just any database, but we should name certain database ?

ilia-shipitsin · 2023-08-29T09:54:50Z

well, seems the winner is svchost

no idea what is it doing, but most probably "docker" is not related to LocalDB activity (and we can something with that after we figure out what it is)

cremor · 2023-08-29T09:57:09Z

in theory we can invoke sqllocaldb create MSSQLLocalDB during image preparation

That would be great. It seems like most of the time the "start" only takes a few seconds when "create" was executed before. Only rarely "start" alone takes long too. (If I catch it, I will provide separate trace files.)

what is MSSQLLocalDB, is it database name ? we cannot "warmup" just any database, but we should name certain database ?

MSSQLLocalDB is the default DB name for the automatic instance. See
https://learn.microsoft.com/en-us/sql/database-engine/configure-windows/sql-server-express-localdb?view=sql-server-ver16#automatic-and-named-instances
and
https://learn.microsoft.com/en-us/sql/database-engine/configure-windows/sql-server-express-localdb?view=sql-server-ver16#connect-to-the-automatic-instance

ilia-shipitsin · 2023-08-31T10:04:13Z

@cremor , I made some tests, unfortunately "sqllocaldb create MSSQLLocalDB" will take a bit more effort from our side.

good news - it can be run many times, if I add it to image generation, you still can run it from your pipeline, no error will be emitted.
bad news - it stores database in user profile. user used for image generation is deleted after image generation (profile is deleted as well). we need to move generated database to "C:\Users\Default" (which is used as template for new users)

also, few thoughts

"sqllocaldb create MSSQLLocalDB" creates database in user profile, which is stored on "C:" drive. drive "C:" is relatively slow, only 500 IOPS for everything. we have faster "D:" drive (it is exposed via env variables as Agent Temp directory)
in collected WPT I see "windows cortana" activity. I suspect it utilized huge amount of 500 IOPS on "C:" drive. cortana is enabled in purpose, for real life e2e tests it is invaluable to run in very similar to real life environment. Unfortunately, we do not have good solution for customers who do not need those extra services. Of course, customer can disable it from pipeline, but it takes many efforts to find out what to disable.

cremor · 2023-09-01T06:43:47Z

"sqllocaldb create MSSQLLocalDB" creates database in user profile, which is stored on "C:" drive. drive "C:" is relatively slow, only 500 IOPS for everything. we have faster "D:" drive

Is this also true for other databases that are preinstalled on the images? Or do the data files of that other databases use the D: drive?

2. in collected WPT I see "windows cortana" activity. I suspect it utilized huge amount of 500 IOPS on "C:" drive.

If this (or some other background process) is indeed the problem, than maybe the correct solution would be to fix that resource usage instead of doing something special for LocalDB?
But I'm wondering why the effect of those background processes is so inconsistent. Shouldn't they always behave the same since the same image is used to start the agent?

Vossekop · 2023-09-07T19:10:56Z

Just wanted to let you know... thank god we are not the only ones with this issue. It has been bugging us on and off for a few months. But recently (as in, last week and the week before that) it has gotten significantly worse. We weren't using the work around @cremor implemented so our unit tests keep failing now and then. So at least that is something we can implement on our end to make the builds stay green.

cremor · 2023-09-08T07:18:30Z

Yeah, Microsoft-hosted agent performance is all over the place recently. I have seen steps (not LocalDB, something completely different) that usually take less than 5 seconds and now sometimes take nearly 2 minutes 😞

ilia-shipitsin · 2023-09-12T15:54:10Z

@cremor , sorry for long delay.

while I'm still looking how to properly "warmup" things, it seems that I've found the reason of accidental high disk IO.

we haven't disabled the following scheduled task

we'll disable it, it's easy part.
just curious, does your observation of slowness corellate with that schedule ?

cremor · 2023-09-13T09:20:22Z

I don't think so. Even though I don't know which time zone this would be, I've seen slow LocalDB startup at all times around a (work) day.

But maybe the trigger is executed as soon as the agent starts because it's last scheduled time was missed? If that is the case, then it could indeed be the cause for all the performance problems of the agent.

ilia-shipitsin · 2023-09-13T22:11:56Z

@cremor , I've figured out that deleting defender tasks does not help, it recreates them silently.

we are going to add Set-MpPreference -ScanScheduleDay 8 to images, it tells defender to delete scanning task.
new image rollout will take a week.

if you have spare time, can you verify that adding Set-MpPreference -ScanScheduleDay 8 to your pipeline increase stability (or not ?)

cremor · 2023-09-16T12:02:43Z

I've now ran a few more tests:

10 runs without the command:
- Longest LocalDB start time: 1m 52s
  WPT telemetry file: Without command.zip
- All other start times were around 40s or less
10 runs with the command:
- Longest LocalDB start time: 3m 55s
  WPT telemetry file: With command longest.zip
- 2nd longest LocalDB start time: 1m 59s
  WPT telemetry file: With command 2nd longest.zip
- 3nd longest LocalDB start time: 1m 40s
  WPT telemetry file: With command 3rd longest.zip
- Most other start times were around 40s (sometimes more, sometimes less)

So I'd say I can't see an improvement with that command 😉

I've ran the command as early as possible (even before checkout). Here is my full test pipeline:

pool:
  vmImage: 'windows-latest'

steps:
- task: PowerShell@2
  inputs:
    targetType: 'inline'
    script: 'Set-MpPreference -ScanScheduleDay 8'

- checkout: self

- task: CmdLine@2
  inputs:
    script: |
      choco install windows-performance-toolkit
      "C:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\xperf.exe" -start -on LOADER+PROC_THREAD+DISK_IO+HARD_FAULTS+DPC+INTERRUPT+CSWITCH+PERF_COUNTER+FILE_IO_INIT+REGISTRY

- script: 'sqllocaldb start MSSQLLocalDB'
  displayName: 'Start SQL Server LocalDB'

- task: CmdLine@2
  inputs:
    script: |
      "c:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\xperf.exe" -d $(Build.ArtifactStagingDirectory)\tmp.etl

- task: PublishBuildArtifacts@1
  inputs:
    PathtoPublish: '$(Build.ArtifactStagingDirectory)\tmp.etl'
    ArtifactName: 'etl'
    publishLocation: 'Container'

ilia-shipitsin · 2023-10-10T10:46:21Z

I'm back with some good news and some bad news.

bad news: LocalDB wants database to be placed in %USERPROFILE%, neither #8435, nor ilia-shipitsin@25d45e4 did not convince LocalDB to use moved database. it is created from scratch and placed to %USERPROFILE%

good news: during investigation we improved several things

taking into account that we cannot warmup LocalDB due to its nature (to use %USERPOFILE%), we hope that other improvements will help to reduce disk IO (especially disabling StorSvc)

cremor · 2023-10-10T11:11:43Z

Thanks for the information.
How can I find out which image version contains those fixes? I'd like to test again when the new image version is available on hosted agents.

ilia-shipitsin · 2023-10-10T11:15:36Z

my measurements show that StorSvc was the most significant Disk IO consumer.
it is deployed already (PR: #8388 )

the last change (related to powershell first run warmup) is not merged yet. but it is rather cosmetic. it warms up the first invocation. if you add dummy first step with "something in powershell", that step will warmup for you

cremor added bug report needs triage labels Aug 25, 2023

Alexey-Ayupov added Area: Databases OS: Windows and removed needs triage labels Aug 25, 2023

Alexey-Ayupov assigned Alexey-Ayupov and ilia-shipitsin and unassigned Alexey-Ayupov Aug 25, 2023

ilia-shipitsin mentioned this issue Sep 14, 2023

[windows] disabling windows defender scheduled task #8285

Merged

4 tasks

ilia-shipitsin mentioned this issue Oct 2, 2023

[windows] add SQL LocalDB warmup #8435

Merged

4 tasks

ilia-shipitsin closed this as completed Oct 10, 2023

ilia-shipitsin mentioned this issue Oct 10, 2023

[windows] reverting SQL LocalDB warmup #8505

Merged

4 tasks

Slow/inconsistent startup time for SQL Server LocalDB on windows-2022 image #8164

Slow/inconsistent startup time for SQL Server LocalDB on windows-2022 image #8164

Comments

cremor commented Aug 25, 2023

Description

Platforms affected

Runner images affected

Image version and build link

Is it regression?

Expected behavior

Actual behavior

Repro steps

Alexey-Ayupov commented Aug 25, 2023

ilia-shipitsin commented Aug 25, 2023

cremor commented Aug 25, 2023 • edited Loading

ilia-shipitsin commented Aug 25, 2023

cremor commented Aug 25, 2023

ilia-shipitsin commented Aug 25, 2023

ilia-shipitsin commented Aug 28, 2023

cremor commented Aug 28, 2023

ilia-shipitsin commented Aug 28, 2023

cremor commented Aug 28, 2023

ilia-shipitsin commented Aug 28, 2023 • edited Loading

ilia-shipitsin commented Aug 28, 2023

cremor commented Aug 28, 2023

ilia-shipitsin commented Aug 28, 2023

ilia-shipitsin commented Aug 28, 2023 • edited Loading

cremor commented Aug 29, 2023

ilia-shipitsin commented Aug 29, 2023

cremor commented Aug 29, 2023

ilia-shipitsin commented Aug 29, 2023

cremor commented Aug 29, 2023

ilia-shipitsin commented Aug 29, 2023

ilia-shipitsin commented Aug 29, 2023

ilia-shipitsin commented Aug 29, 2023

ilia-shipitsin commented Aug 29, 2023

cremor commented Aug 29, 2023 • edited Loading

ilia-shipitsin commented Aug 31, 2023 • edited Loading

cremor commented Sep 1, 2023

Vossekop commented Sep 7, 2023

cremor commented Sep 8, 2023

ilia-shipitsin commented Sep 12, 2023

cremor commented Sep 13, 2023 • edited Loading

ilia-shipitsin commented Sep 13, 2023

cremor commented Sep 16, 2023

ilia-shipitsin commented Oct 10, 2023

cremor commented Oct 10, 2023

ilia-shipitsin commented Oct 10, 2023

cremor commented Aug 25, 2023 •

edited

Loading

ilia-shipitsin commented Aug 28, 2023 •

edited

Loading

ilia-shipitsin commented Aug 28, 2023 •

edited

Loading

cremor commented Aug 29, 2023 •

edited

Loading

ilia-shipitsin commented Aug 31, 2023 •

edited

Loading

cremor commented Sep 13, 2023 •

edited

Loading