feat: Add support for targeting Kubernetes Service Kind #372

mumoshu · 2019-11-18T09:44:58Z

Resolves #371

This adds the support for corev1.Service as the targetRef.kind, so that we can use Flagger just for canary analysis and traffic-shifting on existing and pre-created services. Flagger doesn't touch deployments and HPAs in this mode.

This is useful for keeping your full-control on the resources backing the service to be canary-released, including pods(behind a ClusterIP service) and external services(behind an ExternalName service).

Major use-case in my mind are:

Canary-release a K8s cluster. You create two clusters and a master cluster. In the master cluster, you create two ExternalName services pointing to (the hostname of the loadbalancer of the targeted app instance in) each cluster. Flagger runs on the master cluster and helps safely rolling-out a new K8s cluster by doing a canary release on the ExternalName service.
You want annotations and labels added to the service for integrating with things like external lbs(without extending Flagger to support customizing any aspect of the K8s service it manages

Design:

A canary release on a K8s service is almost the same as one on a K8s deployment. The only fundamental difference is that it operates only on a set of K8s services.

For example, one may start by creating two Helm releases for podinfo-blue and podinfo-green, and a K8s service podinfo. The podinfo service should initially have the same Spec as that of podinfo-blue.

On a new release, you update podinfo-green, then trigger Flagger by updating the K8s service podinfo so that it points to pods or externalName as declared in podinfo-green. Flagger does the rest. The end result is the traffic to podinfo is gradually and safely shifted from podinfo-blue to podinfo-green.

How it works:

Under the hood, Flagger maintains two K8s services, podinfo-primary and podinfo-canary. Compared to canaries on K8s deployments, it doesn't create the service named podinfo, as it is already provided by YOU.

Once Flagger detects the change in the podinfo service, it updates the podinfo-canary service and the routes, then analyzes the canary. On successful analysis, it promotes the canary service to the podinfo-primary service. You expose the podinfo service via any L7 ingress solution or a service mesh so that the traffic is managed by Flagger for safe deployments.

Giving it a try:

To give it a try, create a Canary as usual, but its targetRef pointed to a K8s service:

apiVersion: flagger.app/v1alpha3
kind: Canary
metadata:
  name: podinfo
spec:
  provider: kubernetes
  targetRef:
    apiVersion: core/v1
    kind: Service
    name: podinfo
  service:
    port: 9898
  canaryAnalysis:
    # schedule interval (default 60s)
    interval: 10s
    # max number of failed checks before rollback
    threshold: 2
    # number of checks to run before rollback
    iterations: 2
    # Prometheus checks based on
    # http_request_duration_seconds histogram
    metrics: []

Create a K8s service named podinfo, and update it. Now watch for the services podinfo, podinfo-primary, podinfo-canary.

Flagger tracks podinfo service for changes. Upon any change, it reconciles podinfo-primary and podinfo-canary services. podinfo-canary always replicate the latest podinfo. In contract, podinfo-primary replicates the latest successful podinfo-canary.

Notes:

For the canary cluster use-case, we would need to write a K8s operator to, e.g. for App Mesh, sync ExternalName services to AppMesh VirtualNodes. But that's another story!

TODOs:

Add E2E test (e39f45b)
Add integration tests (e5f4bfc)
Write docs (Where to add? Do we need to include it in this PR, or another one?
Please add anything you think needed

stefanprodan · 2019-11-18T22:19:42Z

@mumoshu this looks really good. I'm at KubeCon so I'll do a proper review after I'm getting back from US. As for e2e testing, this should be implemented similar to the blue/green one see https://github.com/weaveworks/flagger/blob/master/test/e2e-kubernetes.sh and https://github.com/weaveworks/flagger/blob/master/test/e2e-kubernetes-tests.sh

mumoshu · 2019-11-19T16:52:30Z

@stefanprodan Thanks!

JFYI I've added an E2E test suite for this feature in e39f45b.

After you're back from KubeCon, I'd appreciate any suggestion or request on where to add docs and unit/integration tests, too.

Re tests, does an another version of TestScheduler_Promotion but with mocks for the service canary helps?

Uses fluxcd/flagger#372 to let Flagger manages traffic between two Kubernetes services each is managed by a Helm release.

stefanprodan · 2019-11-24T10:21:12Z

@mumoshu I have a couple of questions/suggestions:

I think it would be best to create an interface for Deployer and have an implementation for kind deployment and another one for kind service, this way you don't have to do if isTargetedService(cd) {return "", nil, nil} to skip the logic that makes reading the code now very hard. Also for the router package, I think is best to create a separate router for service canaries instead of mixing the two use-cases.
I have a hard time understanding how does the metrics analysis works with services. Flagger has two built-in checks: success rate and latency, these metrics are implemented here. The metric checks look for the targetRef name. All mesh providers record these metrics using the deployment name but with your implementation the targetRef doesn't match any metrics as far as I can tell. With a ClusterIP as target, how would Flagger find the canary destination and query the metrics?
Have you tested this with Istio, Linkerd or App Mesh if the rollback still works? Does users need to specify custom metrics, if so how can it be done since the canary target changes with each deployment. My understanding is that one would switch from blue to green on first canary and them from green to blue and the 2nd making it impossible to maintain the same prom queries.
How would this work in a GitOps pipeline? If the traffic shift changes from blue to green on the first canary, one 2nd one does the user need to update the blue chart in git?

mumoshu · 2019-11-25T00:22:59Z

I think it would be best to create an interface for Deployer...
...
Also for the router package, I think is best to create a separate router for service canaries instead of mixing the two use-cases.

That sounds good! I'll address those in the coming days.

With a ClusterIP as target, how would Flagger find the canary destination and query the metrics?

Thanks for all the explanations! Honestly speaking I only had a vague idea on how analysis on service canary work in practice, but reading your explanation made it clearer... So, I'd say the only way would be to use custom queries.

You use an external operator or some umbrella helm chart or helmfile to compose custom queries (or their templates).

This service canary feature is pretty low-level compared to that for K8s deployments. That means various assumptions we've made so far on Flagger - like mapping between targetRef.name to query labels - doesn't make sense on service canaries.

Would it justify the requirement for users to write their own custom queries?

Have you tested this with Istio, Linkerd or App Mesh if the rollback still works?

Honestly, it won't work as you might have expected.

I did confirm that a rollback happens (while testing Flagger with my own crossover via the smi provider) when one of webhook tests is failed.

As you've pointed out earlier, it won't work with the out-of-box request success rate and duration queries, as there's no way to match custom queries to the metrics for the canary service.

My understanding is that one would switch from blue to green on first canary and them from green to blue and the 2nd

Correct.

making it impossible to maintain the same prom queries.

Absolutely. I was thinking that I would change the custom queries part of a canary object before changing the targeted service, so that Flagger knows the exact custom queries on metrics from the next/canary service.

If it doesn't work, we need an additional source of metrics labels to query per canary. It's kind of like annotating Service with app.flagger.service/metrics-labels: kubernetes_cluster=blue,kubernets_deployment=podinfo for multi-cluster deployment or app.flagger/service/metrics-labels: kubernetes_deployment=podinfo-blue for in-cluster canary deployment.

And use those from within a query template that would look like:

canaryAnalysis:
    metrics:
    - name: "500s percentage"
      threshold: 3
      queryTemplate: |
        100 - sum(
            rate(
                response_total{
                    namespace="test",
                    deployment="{{.MetricsLabels | index "kubernetes_deployment" }}",
                    status_code!="500",
                    direction="inbound"
                }[1m]
            )
        )
        /
        sum(
            rate(
                response_total{
                    namespace="test",
                    deployment="{{.MetricsLabels | index "kubernetes_deployment" }}",
                    direction="inbound"
                }[1m]
            )
        )
        * 100

How would this work in a GitOps pipeline? If the traffic shift changes from blue to green on the first canary, one 2nd one does the user need to update the blue chart in git?

Exactly. To avoid that, I believe we may need another k8s operator, or a complex CI pipeline that generates manifests accordingly.

stefanprodan · 2019-11-25T15:33:58Z

@mumoshu I've extracted the Kubernetes operations to an interface and created a factory for it. Once merged you have to pull the changes into this PR and add a ServiceControlller implementation. They you can make the factory return a ServiceControlller when the kind is Service.

PS: this is merged in master, please see the controller interface.

mumoshu · 2019-11-25T19:44:08Z

@stefanprodan Thanks! That was close, I've already done similar things on my end. I'll try to adapt to yours soon.

A few points that I wanted to confirm:

Regading the router, I've split it into two, kubernetes_service and kubernetes_deployment. Does it look good to you?
To avoid recreating the kind of Controller that needed in each place, I've introduced a facade called MultiController. Does it make sense to you?
I've extracted utility functions shared between DeploymentController and ServiceController, like syncCanaryStatus, setStatusFailedChecks, setStatusWeight, etc.

Can I move DeploymentController specific methods like this one to deployment_controller.go?
```
func (c *DeploymentController) SetStatusWeight(cd *flaggerv1.Canary, val int) error {
	return setStatusWeight(c.FlaggerClient, cd, val)
}
```
They're now mostly wrappers around the utility funcs and not fit to the dedicated file like status.go. To be clear, the utility funcs will remain where they are, like syncCanaryStatus remains in status.go.

stefanprodan · 2019-11-25T19:56:35Z

@mumoshu can you please join the Flagger slack so we can chat there? Thanks

stefanprodan · 2019-11-26T16:40:34Z

@mumoshu sorry I had to fix a bug and it conflicts with your branch. Can you please fix the conflict, the change are very lite https://github.com/weaveworks/flagger/pull/380/files

Resolves fluxcd#371 --- This adds the support for `corev1.Service` as the `targetRef.kind`, so that we can use Flagger just for canary analysis and traffic-shifting on existing and pre-created services. Flagger doesn't touch deployments and HPAs in this mode. This is useful for keeping your full-control on the resources backing the service to be canary-released, including pods(behind a ClusterIP service) and external services(behind an ExternalName service). Major use-case in my mind are: - Canary-release a K8s cluster. You create two clusters and a master cluster. In the master cluster, you create two `ExternalName` services pointing to (the hostname of the loadbalancer of the targeted app instance in) each cluster. Flagger runs on the master cluster and helps safely rolling-out a new K8s cluster by doing a canary release on the `ExternalName` service. - You want annotations and labels added to the service for integrating with things like external lbs(without extending Flagger to support customizing any aspect of the K8s service it manages **Design**: A canary release on a K8s service is almost the same as one on a K8s deployment. The only fundamental difference is that it operates only on a set of K8s services. For example, one may start by creating two Helm releases for `podinfo-blue` and `podinfo-green`, and a K8s service `podinfo`. The `podinfo` service should initially have the same `Spec` as that of `podinfo-blue`. On a new release, you update `podinfo-green`, then trigger Flagger by updating the K8s service `podinfo` so that it points to pods or `externalName` as declared in `podinfo-green`. Flagger does the rest. The end result is the traffic to `podinfo` is gradually and safely shifted from `podinfo-blue` to `podinfo-green`. **How it works**: Under the hood, Flagger maintains two K8s services, `podinfo-primary` and `podinfo-canary`. Compared to canaries on K8s deployments, it doesn't create the service named `podinfo`, as it is already provided by YOU. Once Flagger detects the change in the `podinfo` service, it updates the `podinfo-canary` service and the routes, then analyzes the canary. On successful analysis, it promotes the canary service to the `podinfo-primary` service. You expose the `podinfo` service via any L7 ingress solution or a service mesh so that the traffic is managed by Flagger for safe deployments. **Giving it a try**: To give it a try, create a `Canary` as usual, but its `targetRef` pointed to a K8s service: ``` apiVersion: flagger.app/v1alpha3 kind: Canary metadata: name: podinfo spec: provider: kubernetes targetRef: apiVersion: core/v1 kind: Service name: podinfo service: port: 9898 canaryAnalysis: # schedule interval (default 60s) interval: 10s # max number of failed checks before rollback threshold: 2 # number of checks to run before rollback iterations: 2 # Prometheus checks based on # http_request_duration_seconds histogram metrics: [] ``` Create a K8s service named `podinfo`, and update it. Now watch for the services `podinfo`, `podinfo-primary`, `podinfo-canary`. Flagger tracks `podinfo` service for changes. Upon any change, it reconciles `podinfo-primary` and `podinfo-canary` services. `podinfo-canary` always replicate the latest `podinfo`. In contract, `podinfo-primary` replicates the latest successful `podinfo-canary`. **Notes**: - For the canary cluster use-case, we would need to write a K8s operator to, e.g. for App Mesh, sync `ExternalName` services to AppMesh `VirtualNode`s. But that's another story!

mumoshu · 2019-11-27T00:08:55Z

@stefanprodan Done 👍

mumoshu · 2019-11-27T13:13:34Z

Update: I had some conversation with @stefanprodan today and we'll be moving everything from the newly added KubernetesServiceRouter to ServiceController. ServiceController's Initialize is currently no-op but that's where the primary/canary service creation logics should go.

The current design is that everything related to managing the targeted resource should go into the respective implementation of `canary.Controller`. In the service-canary use-case our target is Service so rather than splitting and scattering the logics over Controller and Router, everything should naturally go to `ServiceController`. Maybe at the time of writing the first implementation, I was confusing the target service vs the router.

stefanprodan

LGTM

Thanks @mumoshu I think we should create an issue for adding this feature to docs including a tutorial with crossover as ingress controller

mumoshu requested a review from stefanprodan as a code owner November 18, 2019 09:44

mumoshu force-pushed the svc-support branch from a24ca63 to e5f4bfc Compare November 19, 2019 23:50

mumoshu added a commit to mumoshu/crossover that referenced this pull request Nov 23, 2019

feat: Canary Release with Flagger

4ab0d8d

Uses fluxcd/flagger#372 to let Flagger manages traffic between two Kubernetes services each is managed by a Helm release.

mumoshu force-pushed the svc-support branch from 607fd71 to a2ea38f Compare November 25, 2019 19:39

mumoshu force-pushed the svc-support branch from f6e3c5c to 692402f Compare November 25, 2019 21:05

mumoshu force-pushed the svc-support branch from 692402f to 1ba595b Compare November 27, 2019 00:08

stefanprodan approved these changes Nov 27, 2019

View reviewed changes

stefanprodan merged commit 59d18de into fluxcd:master Nov 27, 2019

mumoshu mentioned this pull request Nov 30, 2019

Enable canary analysis on deployments and services behind Envoy #385

Closed

mumoshu deleted the svc-support branch November 30, 2019 03:29

stefanprodan changed the title ~~feat: Canary-release anything behind K8s service~~ feat: Add support for targeting Kubernetes Service Kind Jan 6, 2020

stefanprodan mentioned this pull request Jan 6, 2020

Release 0.21.0 #400

Merged

mumoshu mentioned this pull request Aug 31, 2021

Daemonset support argoproj/argo-rollouts#1457

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add support for targeting Kubernetes Service Kind #372

feat: Add support for targeting Kubernetes Service Kind #372

mumoshu commented Nov 18, 2019 •

edited

Loading

stefanprodan commented Nov 18, 2019

mumoshu commented Nov 19, 2019

stefanprodan commented Nov 24, 2019 •

edited

Loading

mumoshu commented Nov 25, 2019 •

edited

Loading

stefanprodan commented Nov 25, 2019 •

edited

Loading

mumoshu commented Nov 25, 2019 •

edited

Loading

stefanprodan commented Nov 25, 2019

stefanprodan commented Nov 26, 2019

mumoshu commented Nov 27, 2019

mumoshu commented Nov 27, 2019

stefanprodan left a comment •

edited

Loading

feat: Add support for targeting Kubernetes Service Kind #372

feat: Add support for targeting Kubernetes Service Kind #372

Conversation

mumoshu commented Nov 18, 2019 • edited Loading

stefanprodan commented Nov 18, 2019

mumoshu commented Nov 19, 2019

stefanprodan commented Nov 24, 2019 • edited Loading

mumoshu commented Nov 25, 2019 • edited Loading

stefanprodan commented Nov 25, 2019 • edited Loading

mumoshu commented Nov 25, 2019 • edited Loading

stefanprodan commented Nov 25, 2019

stefanprodan commented Nov 26, 2019

mumoshu commented Nov 27, 2019

mumoshu commented Nov 27, 2019

stefanprodan left a comment • edited Loading

Choose a reason for hiding this comment

mumoshu commented Nov 18, 2019 •

edited

Loading

stefanprodan commented Nov 24, 2019 •

edited

Loading

mumoshu commented Nov 25, 2019 •

edited

Loading

stefanprodan commented Nov 25, 2019 •

edited

Loading

mumoshu commented Nov 25, 2019 •

edited

Loading

stefanprodan left a comment •

edited

Loading