Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Worker sometimes fails to install a new SQL Server instance #2804

Closed
johlju opened this issue Jan 7, 2019 · 10 comments
Closed

Build Worker sometimes fails to install a new SQL Server instance #2804

johlju opened this issue Jan 7, 2019 · 10 comments

Comments

@johlju
Copy link

johlju commented Jan 7, 2019

I have seen the build worker not restarting correctly sometimes. Curious if this is a bug, known problem or us doing something wrong.

For this PR dsccommunity/SqlServerDsc#1246 it happened twice for commit 95bf616 and commit 6502e54.

See example here.
https://ci.appveyor.com/project/PowerShell/sqlserverdsc/builds/21423039

And the YAML:
https://github.com/PowerShell/SqlServerDsc/blob/dev/appveyor.yml

@IlyaFinkelshteyn
Copy link
Contributor

It would be great if you ping us immediately after you see it stuck so we can investigate as it happens, before build failed. Or if you increase sleep before restart to say 10 seconds, chances that it stuck will be much lower. But I prefer to wait for next issue and investigate.

Out of curiosity -- why do you need to restart VM? I do not see anything which require restart happening before...

@johlju
Copy link
Author

johlju commented Jan 7, 2019

@IlyaFinkelshteyn I will report back if I see it happening again. Where would it be best to ping you? In this issue, Twitter or somewhere else?

I would rather not restart it, and haven't needed it before. But it seems that it might be that an SQL instance was installed without the image being rebooted after, so sometimes the worker started to fail to install an new instance (installing a default instance as part of our integration test - that test the SqlSetup DSC resource).
The integration tests sometimes worked (let say 3 out of 10). It felt almost like sometimes it used an other image (of Visual Studio 2017 Build Worker) that had the problem, but guess there is only one image, so can't be that. When I added this restart to the Build Worker I haven't been able to reproduce it (yet).
So a solution could be that the image should be rebooted one more time, then I might not need this restart. But far from sure. Thought the added restart could act as a fail safe too. 😄

@johlju
Copy link
Author

johlju commented Jan 7, 2019

@IlyaFinkelshteyn For reference. This is the issue dsccommunity/SqlServerDsc#1260 for which I added the restart workaround for, and this is the error that happened (which no longer happens when the restart workaround was added) https://ci.appveyor.com/project/johlju/sqlserverdsc/builds/21302743?fullLog=true#L2641 (see line 2660 for SQL Server setup error message).

@IlyaFinkelshteyn
Copy link
Contributor

@johlju I would rather root cause SQL issue than stay with reboot workaround. Can you send a links to number of randomly failed and successful builds -- we will try to find some commonalities.

Also can you create a simplified fast repro which should fail after some number of repetitions? It is a not problem for us to run a lot of repetitions as we can do that on internal account with a lot of concurrent jobs, but it would be great if repro itself it fast and simple.

Regarding reboot issue you can email to team@appveyor.com with high importance and reference this issue. Most of us work in PST though but often after normal working hours too. But again, I believe we can root cause the problem.

@johlju
Copy link
Author

johlju commented Jan 9, 2019

I tested yesterday to make sure the SQL issue still existed by running all the tests without the restart workaround, 1 of 4 test runs failed. I'm now looking at reproducing the SQL issue with a simplified branch (removed most other tests). It's probably gonna take a day or so until I see that this fails as well.

@johlju
Copy link
Author

johlju commented Jan 10, 2019

I saw yesterday that a contributor got the same SQL issue when the restart workaround was applied, so the restart workaround only mitigate the SQL issue. I rename this issue to focus on the SQL issue.

@johlju johlju changed the title Build Worker hanging on restart Build Worker sometimes fails to install a new SQL Server instance Jan 10, 2019
@johlju
Copy link
Author

johlju commented Jan 10, 2019

@IlyaFinkelshteyn Yesterday I created a simplified branch with minimum of tests to see if I could reproduce the issue, but after running the tests 16 times, none failed. This lets me believe that this might be a memory problem when running with all tests 🤔 I have seen the build worker adding more total memory as it goes, maybe the VM does not get more memory fast enough. 🤔
While the SQL Server instance is running there is one or two docker containers running as well (depending on how long the tests are run in the docker containers). Another reason my thought is around memory is that I had seen (only one time so far) that one of the Docker containers could not start due to insufficient page memory.

Maybe we have out grow the (free) AppVeyor build worker?

@IlyaFinkelshteyn
Copy link
Contributor

It is interesting coincidence that starting from Jan 8th we instantiate VMs with 5000Mb memory and allow Hyper-V dynamic memory to grow up to 6000Mb. It was 1400Mb - 4000Mb before. We did not announced it yet because we are still monitoring how it goes and see we if have to do some adjustments.

If your tests are indeed that memory hungry, you should see an improvement during last 2 days.

Another option which I would recommend is to use parallel testing (which is actually a special case of build matrix) to segment tests into smaller groups which will run as a separate build jobs against the same commit.

@johlju
Copy link
Author

johlju commented Jan 11, 2019

That is great news that you have raised the memory! I will re-test with the full test suite and see if that memory increase helped my case.
Yesterday I had the same thought to look into parallel build workers (or sequentially for the free account) as another method of reducing load.
I will do some more testing and report back.

@johlju
Copy link
Author

johlju commented Jan 20, 2019

@IlyaFinkelshteyn In my test I could not find the actual reason for the tests to fail, I evaluated if it could be a problem with the downloaded media, but the media do have the same hash both when a test run passes as when a test run fails.
I can only reproduce the problem when running the two containers and installing the SQL instance at the same time, and from that I conclude that there is some sort of memory problem connected to a timing issues.

I have switched from using the containers and instead running two parallel (sequential on the free account) build workers, unit tests in one and integration tests in the other, and it seems stable. No errors so far, but the time i takes to to run the full test suit doubled. I need to rewriting our test framework to get the containers working in this new parallel scenario, that would speed up the testing again.

I'm closing this issue at this time as I think the test suit, how it was run previously, overwhelmed the build worker.

/cc @PlagueHO (FYI)

@johlju johlju closed this as completed Jan 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants