Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Documentation] Zero/minimal downtime deployment best practices documentation request #862

Open
SimonLuckenuik opened this issue Jun 26, 2018 · 20 comments

Comments

@SimonLuckenuik
Copy link

SimonLuckenuik commented Jun 26, 2018

Hi,

I am looking for best practices on handling deployment/upgrade scenarios to minimize the downtime. I am also looking for Function App code deployment organization recommendations to help minimizing impact on those deployment scenarios. (Ex: One Function App per version, separate APIs from async triggers processing code (Queues...))

While I found information here and there, scattered around in Github repos and docs.com, I was not able to find a nice summary or reference/guide on the subject that provide best practices.

Some questions to understand what I am looking for :

  1. What is the proposed approach to deployment/upgrade scenario while having a currently executing function in Azure Functions? (App with lot of activity)

  2. What is the status of deployment slots and should we use them? They have been in preview for a while and still have issues (cannot prevent code from executing, see reference, meaning that old code after swap will still compete with newer code)

  3. Should we separate API Triggers to a different Function App to minimize user impact of upgrades?

  4. What is the impact on my Function App when the Functions Runtime is updated?

  5. From a coding perspective, what are the recommended approaches to minimize impact of such upgrades (how to detect upgrades)?

Discussion initially started here while brainstorming on a different topic . Quote from @jeffhollan from that thread to get discussion started:

In order to have "zero down time" deployments with Azure Functions you'd need to do some green/blue deployment with something like traffic manager or proxies if HTTP, or potential phased / competing consumer if non-HTTP.

@jeffhollan
Copy link
Contributor

Thanks for creating. Checked yesterday and didn’t see any issues that were perfect fit for discussion / tracking so this one works good.

@jeffhollan
Copy link
Contributor

Few thoughts on this for now just to start conversation as I chew more into this:

Slots - still in preview. Valid for HTTP scenarios most (more challenging to get work with things like queue trigger. One of the caveats with slots today is around scaling which would impact high-activity functions on the consumption plan. Specifically if you had a function in consumption with 30 active scaled out instances, the "staging" slot may have no activity and therefore no scale. If you did an full swap right away you'd now have all the HTTP traffic hitting no scaled instances. This may be acceptable - usually it only takes a few seconds - but could have an impact.

For scaling down that's a good question and I believe cancellation token is the current way you could handle that gracefully. I'm also not sure what scenarios may cause us to scale down while requests are in flight. @tohling @cgillum @paulbatum may be able to chime in. Don't know if answer is different depending on trigger.

I think scope of question may be a bit too big for a single issue though. Zero down time deployment feels very different than "how to handle function running longer than the 10 min max" for example. Scale up and scale down or slightly related, as is updates to the runtime (is there any downtime when a runtime version is updated, and how do those updates roll across instances). Hopefully some others will help chime in as well and we can start gathering some better guidance for docs

@SimonLuckenuik
Copy link
Author

SimonLuckenuik commented Jun 27, 2018

@jeffhollan, I initially grouped "Zero/minimal downtime deployment" and "graceful shutdown", considering that, for me, zero downtime and graceful shutdown are tightly coupled (deployment usually involves stopping the currently executing version, which involves some downtime/restart, unless we have multiple versions deployed / blue-green deployment). Let me split them so that we can focus on proper topic in each issue.

@SimonLuckenuik SimonLuckenuik changed the title [Documentation] Zero/minimal downtime deployment and graceful shutdown best practices documentation request [Documentation] Zero/minimal downtime deployment best practices documentation request Jun 27, 2018
@jeffhollan
Copy link
Contributor

Just updating this in that I am still looking into this. Also spoke some with @fabiocav this week about graceful shutdowns. Some stuff may be possible today, and more planned. The comment above is still valid but conversations are healthy on our side and hoping to have better updates in the next few weeks.

@ryanspletzer
Copy link

Hi All, I just stumbled across this issue while following a similar thought process as the original poster. Just curious if there's been additional discussion / thoughts around this area?

@jeffhollan
Copy link
Contributor

Definitely - nothing to share yet but is some work being discussed right now to help with the zero downtime story. @nimakms as an FYI who is going to be helping with maturing some of the slots stories. Hopeful to have more updates here than already stated in few weeks. For now also worth noting the importance of the cancellation token which can notify your code of a cancellation event and allow you to cleanup (though the host will wait I believe 30 seconds for in flight executions to complete before updating them with latest bits).

@jeffhollan jeffhollan removed their assignment Oct 23, 2018
@VeronicaWasson
Copy link

@jeffhollan Is the cancellation token approach only available in C#? Or is there an equivalent for nodejs/Java? thanks!

@sadgit
Copy link

sadgit commented Nov 13, 2018

I expected the swapped out slot to upgrade after the fact. Failing that the slot which is off-lined should go into disabled mode as its code is now invalid.
I suggest two modes Update After Swap - XOR - Offline After Swap.
I am thinking in the context of Trigger functions not HTTP functions.
Also using 'Zip Deploy' and deployed via pipeline with Azure App Service Deploy (4.* Preview)

@ggirard07
Copy link

@jeffhollan any update on those discussions? deployment slots for function are now GA but I don't see any new related this issue.
Were the discussions about the Docker support?

Documentation about slots does not even expose the scaling issue with sloting you described in your first comment.

@ColbyTresness
Copy link

@ahmedelnably

@ahmedelnably
Copy link

@ggirard07 The scaling behavior did change for consumption, if your production slot scales up, the slot will match the scale. so if you have a high activity function in consumption, you should see any downtime as the slot would be already scaled out.

@paulbatum
Copy link
Member

Not sure if was mentioned above, but slots for windows consumption are no longer in preview. This documentation has been updated: https://docs.microsoft.com/en-us/azure/azure-functions/functions-deployment-slots

@oatsoda
Copy link

oatsoda commented Jan 20, 2020

Is there any consolidated documentation on this yet?

My main concern is around Queue and Timer functions - What should we do? If I disable/stop them to avoid staging and production competing for locks, how do I handle the shutdown - I believe they get 30 seconds grace before processes are killed, but if longer running then does this mean I have to ensure they are fully transactional or "undoable" using the CancelationToken workflow?

See Azure/azure-functions-host#2412 (comment) - This also stems from the issue that we don't want to have to add an appSetting for every function just to provide the "turn off if staging" functionality.

@kmadof
Copy link

kmadof commented Aug 11, 2020

Is this possible to achieve zero downtime deployment using slot and swap with preview? I tried it, but after swap complete I noticed these:

image

It caused down time of course as for the moment there weren't any running instance.

@paulbatum
Copy link
Member

@kmadof So you can't treat the sample telemetry as authoritative as some entries might be missing (better to look at the full trace logs in app insights). However it looks like I was able to reproduce your issue - I did a swap with some timer triggered functions and was able to observe approximately 30 seconds of downtime. We're investigating.

@kmadof
Copy link

kmadof commented Aug 13, 2020

@paulbatum Thanks for advice. I found this article with suggesion to use setting WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG set to 1. I tried and it does help with downtime. However, I wonder why it is mentioned only on deployment slots for Web Apps site and not on Function stie.

@paulbatum
Copy link
Member

@kmadof Thank you for this additional detail. It looks like this setting is designed to preserve backwards compatibility and I'm not sure if that is necessary for functions, so there might be an improvement we can make here. We're taking a look.

@qcnguyen
Copy link

how about this issue ? We expect no downtime when swap, but the actual is not

@qcnguyen
Copy link

The article: https://medium.com/@yapaxinl/azure-deployment-slots-how-not-to-make-deployment-worse-23c5819d1a17 work, it provide 0 downtime. However, is it safe to use, team ?

@glennt
Copy link

glennt commented Dec 10, 2020

We've been struggling with trying to get zero downtime deployment with functions as well.

This is the process we use:

  • Update settings in staging slot
  • Deploy to staging slot
  • Ping staging function until we get 200 for a period of time
  • Do the swap to production

Settings for both slots:
WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG=1
WEBSITE_RUN_FROM_PACKAGE=1

We are using the consumption plan.

This works most of the time but occassionaly the deploy to the staging slot fails and it ends up taking down the production slot with it. Below is the error:

{'id': 'c17a7ec08c4e4258bd866890b6925e1b', 'status': 3, 'status_text': '', 'author_email': 'N/A', 'author': 'N/A', 'deployer': 'ZipDeploy', 'message': 'Created via a push deployment', 'progress': '', 'received_time': '2020-12-10T19:08:18.6211406Z', 'start_time': '2020-12-10T19:08:19.5121218Z', 'end_time': '2020-12-10T19:08:42.1614686Z', 'last_success_end_time': None, 'complete': True, 'active': False, 'is_temp': False, 'is_readonly': True, 'url': 'https://gt-caas-publications-us-dev-staging.scm.azurewebsites.net/api/deployments/latest', 'log_url': 'https://gt-caas-publications-us-dev-staging.scm.azurewebsites.net/api/deployments/latest/log', 'site_name': 'gt-caas-publications-us-dev', 'provisioningState': None}

The logs in the URLs provided by the message don't have any useful information in them.

Has anyone been able to achieve zero downtime deployment without these itermittent issues?

Is there maybe a setting we're missing that triggers these deployment failures? Does running deployments back to back cause some locking issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests