Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Custom Resource Deployments #376

Merged
merged 47 commits into from
Jan 21, 2019
Merged

Conversation

timothysmith0609
Copy link
Contributor

@timothysmith0609 timothysmith0609 commented Nov 12, 2018

Motivation and Goals

required for #229
see #128

This PR represents the kubernetes-deploy side of our custom resource status implementation. The goal of this PR is to create a backwards compatible means of meaningfully monitoring the rollouts of custom resources. It has the following goals:

  • Hardcoded custom resources (e.g. redis.rb, cloudsql.rb) are deployed using the custom logic that exists for them.
  • Custom resources with no hardcoded logic and no annotations notifying kubernetes-deploy to treat the CR as a generic custom resource using our Pass/Fail convention will be deployed as before. E.g. warn the user that we don't know how to monitor the deployment of such a resource and assume it has passed
  • New case: A custom resource with no hardcoded logic but with an annotation declaring it implements our Pass/Fail status convention will be deployed using the generic CR watcher. This watcher observes the following states:
    • deploy_succeeded? == true if the Ready condition on the CR status is true
    • deploy_failed? == true if the Failed condition on the CR status is true
    • The deploy is progressing if both Ready and Failed are false
    • Custom timeouts for CRs can be placed on the owning CRD or, for more granularity, on specific instances of a CR spec

Implementation details

(OUTDATED) Changes to PREDEPLOY_SEQUENCE

Since we cannot know, a priori, the types of custom resources in a cluster, we must dynamically find them during the discovery phase. As an additional concern, I argue that the common case is that custom resources must be deployed before other kubernetes objects. That is, we need a working cloudsql before we can think of running a db-migrate pod, e.g. As it stands, we hardcode this priority inside the PREDEPLOY_SEQUENCE constant in deploy_task.rb. In order to maintain the rough ordering of the PREDEPLOY_SEQUENCE const while also handling the case of dynamic custom resource discovery, I have moved the creation of PREDEPLOY_SEQUENCE into 2 separate phases.

  • In the first phase, we hardcode the core kubernetes resources that we know ante deploy and place them in BASE_PREDEPLOY_SEQUENCE.
  • During discovery, we find all the CRDs and union them with BASE_PREDEPLOY_SEQUENCE. Using the result of this union, we then set the PREDEPLOY_SEQUENCE constant. See here
  • @stefanmb has proposed using a dependency graph to model deployment priority, but unless an explicit case can be made against the proposed implementation I think we can defer that issue for now.

(OUTDATED) Caching discovered CRDs

As an implementation detail, I have opted to cache the value of ResourceDiscovery.crds in an instance variable. Linking together CRs with their parent CRDs requires passing around the list of CRDs in a number of places, and it doesn't seem risky to avoid the extra work of rediscovering them for every call.

TODO

  • Tests
  • Annotation scheme for CRs -> e.g. annotation for declaring monitorability? Do we want additional annotations to declare which conditions map to ready/failed or should we enforce Ready and Failed?

cc @Shopify/cloudx

Copy link
Contributor

@karanthukral karanthukral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method seems valid to me. I like having the generic CR class instead of attempting to dynamically define classes

@timothysmith0609
Copy link
Contributor Author

timothysmith0609 commented Nov 13, 2018

Example rollout.

  • Bucket exposes the monitor-rollout annotation and is processed as a generic CR (e.g. waiting for Ready status
  • Redis is hardcoded, we wait for the deployment
  • Memcached neither exposes the monitor-rollout annotation nor has a hardcoded implementation (it is removed for this example). It defaults to the "assuming successful deploy" behaviour

tty

lib/kubernetes-deploy/deploy_task.rb Show resolved Hide resolved
lib/kubernetes-deploy/deploy_task.rb Outdated Show resolved Hide resolved
lib/kubernetes-deploy/kubernetes_resource.rb Outdated Show resolved Hide resolved
lib/kubernetes-deploy/kubernetes_resource.rb Outdated Show resolved Hide resolved
lib/kubernetes-deploy/deploy_task.rb Outdated Show resolved Hide resolved
@@ -10,7 +10,7 @@ def initialize(namespace:, context:, logger:, namespace_tags:)
end

def crds(sync_mediator)
sync_mediator.get_all(CustomResourceDefinition.kind).map do |r_def|
@crds ||= sync_mediator.get_all(CustomResourceDefinition.kind).map do |r_def|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you deploy a CRD and a CR at same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one shortcoming of the current approach, unfortunately. There's a few solutions available to us here:

  • Add discovered CRD specs to the top of the priority list. This seems risk-free since they should have no external dependencies
  • In the future, think about using a dependency graph to produce the priority list (e.g. a CloudSQL has something that says iDependOn: cloudsqls.stable.shopify.io. I'd say this is out of scope for this PR.

It's unclear to me whether this issue is handled right now, anyway. Not saying we shouldn't fix it here, just a note

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one shortcoming of the current approach, unfortunately.

What is the exact behaviour though? It still falls back on a KubernetesResource with the "dunno what to do here" message, right?

I'm on the fence about introducing an annotation-driven dependency graph, but FWIW we did have someone request customized sequencing for another reason earlier this week. Regardless, I agree it doesn't need to be handled in this PR.

@@ -2,6 +2,8 @@
module KubernetesDeploy
class CustomResourceDefinition < KubernetesResource
TIMEOUT = 2.minutes
CHILD_CR_TIMEOUT_ANNOTATION = "kubernetes-deploy.shopify.io/cr-timeout-override"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to update the README.md

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why CHILD_ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to suggestions. I'm mainly trying to avoid the perennial issue of confusing CRDs with CRs by being more explicit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this part of the other child rollout annotation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this annotation for now as I'm more concerned with the status aspect of this PR. If necessary, the base timeout-override annotation can be used in the meantime.

@timothysmith0609 timothysmith0609 changed the title -WIP- Dynamic Custom Resource Deployments Dynamic Custom Resource Deployments Nov 16, 2018
lib/kubernetes-deploy/deploy_task.rb Outdated Show resolved Hide resolved
@timothysmith0609 timothysmith0609 force-pushed the dynamic_cr_capturing branch 2 times, most recently from f47b1aa to 3454521 Compare November 20, 2018 23:09
@timothysmith0609
Copy link
Contributor Author

timothysmith0609 commented Nov 20, 2018

Configurable success/failure conditions

I've decided to add configurable success/failure statuses as part of this PR. Users can supply a JSON string to the kubernetes.shopify.io/cr-rollout-params that takes:

  • A JSON array of success conditions (JsonPath/expected value pairs)
  • A JSON array of failure conditions
  • ... we can add whatever other configurable fields we desire

For convenience, default values (which conform to our buddies Status implementation) are used if such fields are missing.

Limitations

Currently, resources, such as CloudSQL, reference other Kubernetes objects to discern their readiness (in CloudSQLs case, its deployment + service). In the new implementation, deploying resources are only able to observe themselves.

lib/kubernetes-deploy/deploy_task.rb Outdated Show resolved Hide resolved
lib/kubernetes-deploy/deploy_task.rb Outdated Show resolved Hide resolved
lib/kubernetes-deploy/deploy_task.rb Outdated Show resolved Hide resolved
lib/kubernetes-deploy/deploy_task.rb Outdated Show resolved Hide resolved
kubernetes-deploy.gemspec Outdated Show resolved Hide resolved
@definition["kind"]
end

def rollout_params
Copy link
Contributor

@KnVerey KnVerey Dec 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concept of rollout_params and its structure isn't super clear to me in my first read of this class, and it seems like we always use it one piece at a time. Is there a better abstraction we can come up with here? Maybe CRD has a private ChildConfiguration object that it exposes, and that has methods like error_message_path and failure_status_path and such? (I dunno--don't take that specific suggestion too seriously)

@@ -2,6 +2,8 @@
module KubernetesDeploy
class CustomResourceDefinition < KubernetesResource
TIMEOUT = 2.minutes
CHILD_CR_TIMEOUT_ANNOTATION = "kubernetes-deploy.shopify.io/cr-timeout-override"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this part of the other child rollout annotation?

@@ -10,7 +10,7 @@ def initialize(namespace:, context:, logger:, namespace_tags:)
end

def crds(sync_mediator)
sync_mediator.get_all(CustomResourceDefinition.kind).map do |r_def|
@crds ||= sync_mediator.get_all(CustomResourceDefinition.kind).map do |r_def|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one shortcoming of the current approach, unfortunately.

What is the exact behaviour though? It still falls back on a KubernetesResource with the "dunno what to do here" message, right?

I'm on the fence about introducing an annotation-driven dependency graph, but FWIW we did have someone request customized sequencing for another reason earlier this week. Regardless, I agree it doesn't need to be handled in this PR.

Copy link
Contributor

@KnVerey KnVerey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's room for improvement in the way we model the rollout configuration data, but the overall class CustomResource approach seems good to me. When you start the tests, please make sure to include one that proves the correct classes get instantiated (KubernetesResource vs CustomResource vs hardcoded CR class).

params if validate_params(params)

rescue JSON::ParserError
raise FatalDeploymentError, "custom rollout params are not valid JSON: '#{rollout_params_string}'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we actually fail the whole deploy on this, or is there a more graceful fallback behaviour we could adopt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a value-judgement we'll need to make. My opinion is that, if users are opting-in to use this feature, we should consider it critical and fail fast if something goes wrong.

On the other hand, aborting a deploy because of some bad JSON might cause too much friction.

@timothysmith0609
Copy link
Contributor Author

Example of custom query parameters:

'{
      "success_queries": [
        {
          "path": "$.status.conditions[?(@.type == "Ready")].status",
          "value":"success_value"
        },
        {
          "path":"$.spec.test_field",
          "value":"success_value"
        }
      ],
      "failure_queries": [
        {
          "path":"$.status.condition",
          "value":"failure_value",
          "custom_error_msg":"test custom error message"
        },
        {
          "path":"$.spec.test_field",
          "value":"failure_value",
          "error_msg_path":"$.spec.error_msg"
        }
      ]
    }'

@KnVerey
Copy link
Contributor

KnVerey commented Jan 4, 2019

Two questions about the example:

  • How do multiple queries combine? Is it the same for success queries and failure queries?
  • Is the inclusion of custom_error_msg based on a real use case? This is not always true of course, but in cases where the CRD is the company's own, the error messages are already their own too.

Copy link
Contributor

@KnVerey KnVerey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work. Thanks for sticking with this long-running but very impactful feature!

@timothysmith0609 timothysmith0609 merged commit a591e52 into master Jan 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants