Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for targeting Kubernetes Service Kind #372

Merged
merged 2 commits into from
Nov 27, 2019

Conversation

mumoshu
Copy link
Contributor

@mumoshu mumoshu commented Nov 18, 2019

Resolves #371


This adds the support for corev1.Service as the targetRef.kind, so that we can use Flagger just for canary analysis and traffic-shifting on existing and pre-created services. Flagger doesn't touch deployments and HPAs in this mode.

This is useful for keeping your full-control on the resources backing the service to be canary-released, including pods(behind a ClusterIP service) and external services(behind an ExternalName service).

Major use-case in my mind are:

  • Canary-release a K8s cluster. You create two clusters and a master cluster. In the master cluster, you create two ExternalName services pointing to (the hostname of the loadbalancer of the targeted app instance in) each cluster. Flagger runs on the master cluster and helps safely rolling-out a new K8s cluster by doing a canary release on the ExternalName service.
  • You want annotations and labels added to the service for integrating with things like external lbs(without extending Flagger to support customizing any aspect of the K8s service it manages

Design:

A canary release on a K8s service is almost the same as one on a K8s deployment. The only fundamental difference is that it operates only on a set of K8s services.

For example, one may start by creating two Helm releases for podinfo-blue and podinfo-green, and a K8s service podinfo. The podinfo service should initially have the same Spec as that of podinfo-blue.

On a new release, you update podinfo-green, then trigger Flagger by updating the K8s service podinfo so that it points to pods or externalName as declared in podinfo-green. Flagger does the rest. The end result is the traffic to podinfo is gradually and safely shifted from podinfo-blue to podinfo-green.

How it works:

Under the hood, Flagger maintains two K8s services, podinfo-primary and podinfo-canary. Compared to canaries on K8s deployments, it doesn't create the service named podinfo, as it is already provided by YOU.

Once Flagger detects the change in the podinfo service, it updates the podinfo-canary service and the routes, then analyzes the canary. On successful analysis, it promotes the canary service to the podinfo-primary service. You expose the podinfo service via any L7 ingress solution or a service mesh so that the traffic is managed by Flagger for safe deployments.

Giving it a try:

To give it a try, create a Canary as usual, but its targetRef pointed to a K8s service:

apiVersion: flagger.app/v1alpha3
kind: Canary
metadata:
  name: podinfo
spec:
  provider: kubernetes
  targetRef:
    apiVersion: core/v1
    kind: Service
    name: podinfo
  service:
    port: 9898
  canaryAnalysis:
    # schedule interval (default 60s)
    interval: 10s
    # max number of failed checks before rollback
    threshold: 2
    # number of checks to run before rollback
    iterations: 2
    # Prometheus checks based on
    # http_request_duration_seconds histogram
    metrics: []

Create a K8s service named podinfo, and update it. Now watch for the services podinfo, podinfo-primary, podinfo-canary.

Flagger tracks podinfo service for changes. Upon any change, it reconciles podinfo-primary and podinfo-canary services. podinfo-canary always replicate the latest podinfo. In contract, podinfo-primary replicates the latest successful podinfo-canary.

Notes:

  • For the canary cluster use-case, we would need to write a K8s operator to, e.g. for App Mesh, sync ExternalName services to AppMesh VirtualNodes. But that's another story!

TODOs:

  • Add E2E test (e39f45b)
  • Add integration tests (e5f4bfc)
  • Write docs (Where to add? Do we need to include it in this PR, or another one?
  • Please add anything you think needed

@stefanprodan
Copy link
Member

@mumoshu this looks really good. I'm at KubeCon so I'll do a proper review after I'm getting back from US. As for e2e testing, this should be implemented similar to the blue/green one see https://github.com/weaveworks/flagger/blob/master/test/e2e-kubernetes.sh and https://github.com/weaveworks/flagger/blob/master/test/e2e-kubernetes-tests.sh

@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 19, 2019

@stefanprodan Thanks!

JFYI I've added an E2E test suite for this feature in e39f45b.

After you're back from KubeCon, I'd appreciate any suggestion or request on where to add docs and unit/integration tests, too.

Re tests, does an another version of TestScheduler_Promotion but with mocks for the service canary helps?

mumoshu added a commit to mumoshu/crossover that referenced this pull request Nov 23, 2019
Uses fluxcd/flagger#372 to let Flagger manages traffic between two Kubernetes services each is managed by a Helm release.
@stefanprodan
Copy link
Member

stefanprodan commented Nov 24, 2019

@mumoshu I have a couple of questions/suggestions:

  1. I think it would be best to create an interface for Deployer and have an implementation for kind deployment and another one for kind service, this way you don't have to do if isTargetedService(cd) {return "", nil, nil} to skip the logic that makes reading the code now very hard. Also for the router package, I think is best to create a separate router for service canaries instead of mixing the two use-cases.

  2. I have a hard time understanding how does the metrics analysis works with services. Flagger has two built-in checks: success rate and latency, these metrics are implemented here. The metric checks look for the targetRef name. All mesh providers record these metrics using the deployment name but with your implementation the targetRef doesn't match any metrics as far as I can tell. With a ClusterIP as target, how would Flagger find the canary destination and query the metrics?

  3. Have you tested this with Istio, Linkerd or App Mesh if the rollback still works? Does users need to specify custom metrics, if so how can it be done since the canary target changes with each deployment. My understanding is that one would switch from blue to green on first canary and them from green to blue and the 2nd making it impossible to maintain the same prom queries.

  4. How would this work in a GitOps pipeline? If the traffic shift changes from blue to green on the first canary, one 2nd one does the user need to update the blue chart in git?

@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 25, 2019

  1. I think it would be best to create an interface for Deployer...
    ...
    Also for the router package, I think is best to create a separate router for service canaries instead of mixing the two use-cases.

That sounds good! I'll address those in the coming days.

With a ClusterIP as target, how would Flagger find the canary destination and query the metrics?

Thanks for all the explanations! Honestly speaking I only had a vague idea on how analysis on service canary work in practice, but reading your explanation made it clearer... So, I'd say the only way would be to use custom queries.

You use an external operator or some umbrella helm chart or helmfile to compose custom queries (or their templates).

This service canary feature is pretty low-level compared to that for K8s deployments. That means various assumptions we've made so far on Flagger - like mapping between targetRef.name to query labels - doesn't make sense on service canaries.

Would it justify the requirement for users to write their own custom queries?

Have you tested this with Istio, Linkerd or App Mesh if the rollback still works?

Honestly, it won't work as you might have expected.

I did confirm that a rollback happens (while testing Flagger with my own crossover via the smi provider) when one of webhook tests is failed.

As you've pointed out earlier, it won't work with the out-of-box request success rate and duration queries, as there's no way to match custom queries to the metrics for the canary service.

My understanding is that one would switch from blue to green on first canary and them from green to blue and the 2nd

Correct.

making it impossible to maintain the same prom queries.

Absolutely. I was thinking that I would change the custom queries part of a canary object before changing the targeted service, so that Flagger knows the exact custom queries on metrics from the next/canary service.

If it doesn't work, we need an additional source of metrics labels to query per canary. It's kind of like annotating Service with app.flagger.service/metrics-labels: kubernetes_cluster=blue,kubernets_deployment=podinfo for multi-cluster deployment or app.flagger/service/metrics-labels: kubernetes_deployment=podinfo-blue for in-cluster canary deployment.

And use those from within a query template that would look like:

canaryAnalysis:
    metrics:
    - name: "500s percentage"
      threshold: 3
      queryTemplate: |
        100 - sum(
            rate(
                response_total{
                    namespace="test",
                    deployment="{{.MetricsLabels | index "kubernetes_deployment" }}",
                    status_code!="500",
                    direction="inbound"
                }[1m]
            )
        )
        /
        sum(
            rate(
                response_total{
                    namespace="test",
                    deployment="{{.MetricsLabels | index "kubernetes_deployment" }}",
                    direction="inbound"
                }[1m]
            )
        )
        * 100

How would this work in a GitOps pipeline? If the traffic shift changes from blue to green on the first canary, one 2nd one does the user need to update the blue chart in git?

Exactly. To avoid that, I believe we may need another k8s operator, or a complex CI pipeline that generates manifests accordingly.

@stefanprodan
Copy link
Member

stefanprodan commented Nov 25, 2019

@mumoshu I've extracted the Kubernetes operations to an interface and created a factory for it. Once merged you have to pull the changes into this PR and add a ServiceControlller implementation. They you can make the factory return a ServiceControlller when the kind is Service.

PS: this is merged in master, please see the controller interface.

@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 25, 2019

@stefanprodan Thanks! That was close, I've already done similar things on my end. I'll try to adapt to yours soon.

A few points that I wanted to confirm:

  • Regading the router, I've split it into two, kubernetes_service and kubernetes_deployment. Does it look good to you?

  • To avoid recreating the kind of Controller that needed in each place, I've introduced a facade called MultiController. Does it make sense to you?

  • I've extracted utility functions shared between DeploymentController and ServiceController, like syncCanaryStatus, setStatusFailedChecks, setStatusWeight, etc.

    Can I move DeploymentController specific methods like this one to deployment_controller.go?

    func (c *DeploymentController) SetStatusWeight(cd *flaggerv1.Canary, val int) error {
    	return setStatusWeight(c.FlaggerClient, cd, val)
    }
    

    They're now mostly wrappers around the utility funcs and not fit to the dedicated file like status.go. To be clear, the utility funcs will remain where they are, like syncCanaryStatus remains in status.go.

@stefanprodan
Copy link
Member

@mumoshu can you please join the Flagger slack so we can chat there? Thanks

@stefanprodan
Copy link
Member

@mumoshu sorry I had to fix a bug and it conflicts with your branch. Can you please fix the conflict, the change are very lite https://github.com/weaveworks/flagger/pull/380/files

Resolves fluxcd#371

---

This adds the support for `corev1.Service` as the `targetRef.kind`, so that we can use Flagger just for canary analysis and traffic-shifting on existing and pre-created services. Flagger doesn't touch deployments and HPAs in this mode.

This is useful for keeping your full-control on the resources backing the service to be canary-released, including pods(behind a ClusterIP service) and external services(behind an ExternalName service).

Major use-case in my mind are:

- Canary-release a K8s cluster. You create two clusters and a master cluster. In the master cluster, you create two `ExternalName` services pointing to (the hostname of the loadbalancer of the targeted app instance in) each cluster. Flagger runs on the master cluster and helps safely rolling-out a new K8s cluster by doing a canary release on the `ExternalName` service.
- You want annotations and labels added to the service for integrating with things like external lbs(without extending Flagger to support customizing any aspect of the K8s service it manages

**Design**:

A canary release on a K8s service is almost the same as one on a K8s deployment. The only fundamental difference is that it operates only on a set of K8s services.

For example, one may start by creating two Helm releases for `podinfo-blue` and `podinfo-green`, and a K8s service `podinfo`. The `podinfo` service should initially have the same `Spec` as that of  `podinfo-blue`.

On a new release, you update `podinfo-green`, then trigger Flagger by updating the K8s service `podinfo` so that it points to pods or `externalName` as declared in `podinfo-green`. Flagger does the rest. The end result is the traffic to `podinfo` is gradually and safely shifted from `podinfo-blue` to `podinfo-green`.

**How it works**:

Under the hood, Flagger maintains two K8s services, `podinfo-primary` and `podinfo-canary`. Compared to canaries on K8s deployments, it doesn't create the service named `podinfo`, as it is already provided by YOU.

Once Flagger detects the change in the `podinfo` service, it updates the `podinfo-canary` service and the routes, then analyzes the canary. On successful analysis, it promotes the canary service to the `podinfo-primary` service. You expose the `podinfo` service via any L7 ingress solution or a service mesh so that the traffic is managed by Flagger for safe deployments.

**Giving it a try**:

To give it a try, create a `Canary` as usual, but its `targetRef` pointed to a K8s service:

```
apiVersion: flagger.app/v1alpha3
kind: Canary
metadata:
  name: podinfo
spec:
  provider: kubernetes
  targetRef:
    apiVersion: core/v1
    kind: Service
    name: podinfo
  service:
    port: 9898
  canaryAnalysis:
    # schedule interval (default 60s)
    interval: 10s
    # max number of failed checks before rollback
    threshold: 2
    # number of checks to run before rollback
    iterations: 2
    # Prometheus checks based on
    # http_request_duration_seconds histogram
    metrics: []
```

Create a K8s service named `podinfo`, and update it. Now watch for the services `podinfo`, `podinfo-primary`, `podinfo-canary`.

Flagger tracks `podinfo` service for changes. Upon any change, it reconciles `podinfo-primary` and `podinfo-canary` services. `podinfo-canary` always replicate the latest `podinfo`. In contract, `podinfo-primary` replicates the latest successful `podinfo-canary`.

**Notes**:

- For the canary cluster use-case, we would need to write a K8s operator to, e.g. for App Mesh, sync `ExternalName` services to AppMesh `VirtualNode`s. But that's another story!
@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 27, 2019

@stefanprodan Done 👍

@mumoshu
Copy link
Contributor Author

mumoshu commented Nov 27, 2019

Update: I had some conversation with @stefanprodan today and we'll be moving everything from the newly added KubernetesServiceRouter to ServiceController. ServiceController's Initialize is currently no-op but that's where the primary/canary service creation logics should go.

The current design is that everything related to managing the targeted resource should go into the respective implementation of `canary.Controller`. In the service-canary use-case our target is Service so rather than splitting and scattering the logics over Controller and Router, everything should naturally go to `ServiceController`. Maybe at the time of writing the first implementation, I was confusing the target service vs the router.
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @mumoshu I think we should create an issue for adding this feature to docs including a tutorial with crossover as ingress controller

@stefanprodan stefanprodan merged commit 59d18de into fluxcd:master Nov 27, 2019
@mumoshu mumoshu deleted the svc-support branch November 30, 2019 03:29
@stefanprodan stefanprodan changed the title feat: Canary-release anything behind K8s service feat: Add support for targeting Kubernetes Service Kind Jan 6, 2020
@stefanprodan stefanprodan mentioned this pull request Jan 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cross-Cluster Canary Release
2 participants