Wait cluster responsive #639

shaunc · 2019-12-16T14:47:09Z

PR o'clock

Description

I experience intermittent problems applying config_map as cluster is not immediately ready to respond to kubectl after create. This PR adds a null resource that sleeps until the cluster responds to kubectl.

Checklist

Change added to CHANGELOG.md. All changes must be added and breaking changes and highlighted
CI tests are passing
README.md has been updated after any changes to variables and outputs. See https://github.com/terraform-aws-modules/terraform-aws-eks/#doc-generation

max-rocket-internet · 2019-12-16T14:51:46Z

This solves #621

But, in this PR we are running kubectl, which kind of defeats the point of using the Kubernetes Terraform provider in the first place 😅

Let's see what others think about this approach or if there's a way we can do it more elegantly?

shaunc · 2019-12-16T14:59:36Z

@max-rocket-internet ... well ... it undermines it, but doesn't defeat it entirely: after this completes, the provider abstraction is perfectly usable. However, I see your point. I wonder if there is a way of health-checking with curl... hmm... https://success.docker.com/article/how-to-poll-kubernetes-health-with-curl ?... also isn't so pretty....

barryib · 2019-12-17T21:46:43Z

FWIW, you can get the server version with curl -k https://<k8s-api-endpoint>/version?timeout=30s. I don't know exactly if this is a blocking http call or not, but it worth to try to get this endpoint with the http provider 🤞 .

I also noticed that the /healthz is readable without creds. We can also test this endpoint.

max-rocket-internet · 2019-12-19T15:44:59Z

Cool ideas @barryib.

@shaunc could you test that?

shaunc · 2019-12-19T16:24:11Z

I'll check that. I'm also thinking that, instead of a separate null_resource, it would be best to include this as a provisioner on the cluster itself, so anything that depends on the cluster doesn't have to separately specify "cluster" and "cluster is ready". Of course, in theory this could slow up create time for configuration that only depends on knowing cluster attributes (e.g. configuring assume role w/ the OIDC endpoint)... but only a minute or two compared with ~>10min for cluster create anyway.

shaunc · 2019-12-19T17:17:20Z

Health check w/o credentials seems to work -- thanks @barryib! As in comment above, got rid of separate null_resource in favor of provisioner for the cluster itself. I found that terraform would figure out it could start creation of other things that depended on the cluster (that needed to use it) prematurely, and this PR didn't expose the null_resource anyway. In theory we could have a separate output for "cluster_ready" but IMHO not worth the complexity cost for the benefit.

shaunc · 2019-12-19T17:18:40Z

(rebased)

barryib · 2019-12-20T07:51:52Z

cluster.tf

@@ -31,6 +31,11 @@ resource "aws_eks_cluster" "this" {
    aws_iam_role_policy_attachment.cluster_AmazonEKSServicePolicy,
    aws_cloudwatch_log_group.this
  ]
+  provisioner "local-exec" {


Any chance to avoid local-exec ? The http provider doesn't work ? We wanted to be agnostic from the OS. Using the kubernetes provider was a huge improvement in that way and would like to avoid going back to local-exec with curl if possible.

Ah ... missed that part of your comment; sorry! Hmm... I'd worry about three things: (1) That the http provider would sometimes timeout, and it doesn't provide any control over timeouts, (2) when not combined with the cluster resource itself, terraform could schedule other resources that depend on the cluster before its ready, and outside the module, its a bit opaque to know what to do (though it could be documented). (3) Doc says they verify chain of trust for https, and certificate is likely to be unchained.

I guess (2) could be possibly solved by having (say) cluster_id depend on an expression involving some output from http resource. (3) just testing will confirm/deny :)....

vs (1) -- I'm a terraform newbie ... do you know anything about http data source timeout behavior? Would seem to be a risk to me, but I'll go ahead and try it if you guys want to go that way.

UPDATE: Unfortunately, test gives:

Error: Error during making a request: https://....sk1.us-west-2.eks.amazonaws.com/healthz on ../../cluster.tf line 40, in data "http" "wait_for_cluster": 40: data "http" "wait_for_cluster" {

Doesn't say if failed because not up yet, or if failed because insecure.

Hmm... could the local-exec provisioner branch by OS and appropriately install curl? Or perhaps it should use a tool ("curl" or alternative appropriate for the OS) passed in in a variable (including no-op)? Your call -- I can give it a whirl if not too complex.

Most general would be to have variable for command be a template required to embed (say) ${URL}. I guess the shell varies by platform, but I guess this type of variable substitution should be broadly compatible? I'm not sure if the "until" loop is available "everywhere"? In any case, AFAIK the options are to code up a custom provider (or patch the http provider, to accept options for retry on error, and for insecure https), or to try to make local-exec be as cross-platform as possible.

UPDATE ... If you are going to patch/create providers, then perhaps changing the kubernetes provider to wait until endpoint is up would also be an option.

hen perhaps changing the kubernetes provider to wait until endpoint is up would also be an option.

hashicorp/terraform-provider-kubernetes#96

I'm not sure what we should do now but open to opinions. Either merge with a new local-exec or think of something else 🙂

I would think easiest way to fix something without upstream changes would be to use local-exec, but make it configurable for cross-platform support. Do you have an idea how you'd like to configure it? Do you want me to propose something? (Adding variables for configuration -- what program to use for "curl", for instance, or possibly whether to install curl appropriate for platform.)?

Together with this you could open ticket(s) for upstream change(s) -- for instance if the kubernetes or the http provider had a "wait for cluster"/"wait for url" configuration.

From what I'm reading from stackoverflow, it sounds like since powershell 3.O, wget and curl are aliases for Invoke-WebRequest http://support.moonpoint.com/os/windows/PowerShell/wget-curl.php. It means, we can probably use curl in both OS, but i'm not sure for arguments.

Any windows guy around ?

I think we forget about Windows. It's proved to be too much effort to support both platforms natively in the past. Windows users can use Docker I think.

dpiddockcmp · 2020-01-02T17:03:11Z

versions.tf

+    null       = ">= 2.1"
+    template   = ">= 2.1"
+    random     = ">= 2.1"
+    kubernetes = "~> 1.10"


We should only use => here to avoid conflicts with users and other module definitions. 2.0 might be compatible with our resource. We can't know yet.

Our minimum compatible version is 1.6.2, the first correctly tagged with TF 0.12 support.

Suggested change

kubernetes = "~> 1.10"

kubernetes = "=> 1.6.2"

dpiddockcmp

Why is the OIDC PR merged into here? Makes this PR a little confusing.

dpiddockcmp · 2020-01-02T20:39:09Z

aws_auth.tf

@@ -41,9 +41,13 @@ data "template_file" "worker_role_arns" {
    )
  }
 }
+locals {
+  kubeconfig_filename = concat(local_file.kubeconfig.*.filename, [""])[0]


Is this local needed for anything?

Nope ... will remove

dpiddockcmp · 2020-01-02T20:39:45Z

aws_auth.tf


 resource "kubernetes_config_map" "aws_auth" {
-  count = var.create_eks && var.manage_aws_auth ? 1 : 0
+  depends_on = [aws_eks_cluster.this]


Is this explicit depends_on required? The provider doesn't work before the cluster's creation.

It depends on how the kubernetes provider is set up(?) I wasn't sure it wasn't impossible make dependence on cluster opaque to terraform, depending on how the provider is setup, so thought explicit dependency would be useful to avoid bug in edge case.

Very true. It's not necessarily implicit from other variables.

max-rocket-internet

You need to rebase to remove the IRSA stuff

…heck to /healthz on endpoint

shaunc · 2020-01-06T14:52:37Z

(rebased)

max-rocket-internet · 2020-01-06T16:08:25Z

cluster.tf

@@ -31,6 +31,11 @@ resource "aws_eks_cluster" "this" {
    aws_iam_role_policy_attachment.cluster_AmazonEKSServicePolicy,
    aws_cloudwatch_log_group.this
  ]
+  provisioner "local-exec" {
+    command = <<EOT
+    until curl -k ${aws_eks_cluster.this[0].endpoint}/healthz >/dev/null; do sleep 4; done


This with no limit is OK?

(I thought that overall terraform timeout would be by default how user would control)

max-rocket-internet · 2020-01-06T16:08:54Z

@dpiddockcmp @barryib shall we merge this then?

barryib · 2020-01-06T17:04:32Z

LGTM.

dpiddockcmp · 2020-01-06T17:15:56Z

LGTM. Create timeout is 30 minutes. Can always add fail-fast logic later if people complain.

max-rocket-internet

Cool. Thanks @shaunc and all the reviewers 🙂

github-actions · 2022-11-19T02:27:08Z

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

max-rocket-internet added the help wanted label Dec 16, 2019

barryib reviewed Dec 20, 2019

View reviewed changes

barryib mentioned this pull request Dec 26, 2019

Wait for kubernetes API to be ready during EKS cluster creation hashicorp/terraform-provider-aws#11426

Closed

max-rocket-internet mentioned this pull request Jan 2, 2020

Error: Post https://xxxx.eks.amazonaws.com/api/v1/namespaces/kube-system/configmaps: dial tcp xxx:443: i/o timeout #621

Closed

dpiddockcmp reviewed Jan 2, 2020

View reviewed changes

barryib mentioned this pull request Jan 2, 2020

Add server certificate validation with Root CA and insecure option hashicorp/terraform-provider-http#29

Closed

max-rocket-internet removed the help wanted label Jan 3, 2020

max-rocket-internet suggested changes Jan 3, 2020

View reviewed changes

barryib mentioned this pull request Jan 4, 2020

Use Kubernetes provider #547

Closed

4 tasks

shaunc added 7 commits January 6, 2020 09:38

wait for cluster to respond before creating auth config map

32ab0ec

adds changelog entry

c0748bc

fixup tf format

6d011d3

fixup kubernetes required version

ed6a51e

fixup missing local for kubeconfig_filename

1b07322

combine wait for cluster into provisioner on cluster; change status c…

67afd55

…heck to /healthz on endpoint

fix: make kubernetes provider version more permissive

5f6f7a3

shaunc force-pushed the wait-cluster-responsive branch from 3866fff to 5f6f7a3 Compare January 6, 2020 14:48

max-rocket-internet reviewed Jan 6, 2020

View reviewed changes

max-rocket-internet mentioned this pull request Jan 6, 2020

Release 8.0.0 #662

Merged

max-rocket-internet approved these changes Jan 7, 2020

View reviewed changes

max-rocket-internet merged commit d79c8ab into terraform-aws-modules:master Jan 7, 2020

shaunc deleted the wait-cluster-responsive branch January 7, 2020 12:26

barryib mentioned this pull request Jan 13, 2020

Not working on Windows #680

Closed

sanjeevgiri mentioned this pull request Jan 20, 2020

Configurable local exec command for waiting until cluster is healthy #701

Merged

3 tasks

shaunc restored the wait-cluster-responsive branch February 11, 2020 22:48

avoidik mentioned this pull request Mar 14, 2020

feat: Add interpreter option to wait_for_cluster_cmd #795

Merged

3 tasks

github-actions bot locked as resolved and limited conversation to collaborators Nov 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait cluster responsive #639

Wait cluster responsive #639

shaunc commented Dec 16, 2019

max-rocket-internet commented Dec 16, 2019

shaunc commented Dec 16, 2019

barryib commented Dec 17, 2019

max-rocket-internet commented Dec 19, 2019

shaunc commented Dec 19, 2019

shaunc commented Dec 19, 2019

shaunc commented Dec 19, 2019

barryib Dec 20, 2019

shaunc Dec 20, 2019 •

edited

Loading

shaunc Dec 20, 2019 •

edited

Loading

max-rocket-internet Dec 23, 2019

max-rocket-internet Dec 23, 2019

shaunc Dec 23, 2019 •

edited

Loading

barryib Dec 23, 2019

max-rocket-internet Jan 2, 2020

dpiddockcmp Jan 2, 2020

dpiddockcmp left a comment

dpiddockcmp Jan 2, 2020

shaunc Jan 6, 2020

dpiddockcmp Jan 2, 2020

shaunc Jan 6, 2020

dpiddockcmp Jan 6, 2020

max-rocket-internet left a comment

shaunc commented Jan 6, 2020

max-rocket-internet Jan 6, 2020

shaunc Jan 6, 2020

max-rocket-internet commented Jan 6, 2020

barryib commented Jan 6, 2020

dpiddockcmp commented Jan 6, 2020

max-rocket-internet left a comment

github-actions bot commented Nov 19, 2022

Wait cluster responsive #639

Wait cluster responsive #639

Conversation

shaunc commented Dec 16, 2019

PR o'clock

Description

Checklist

max-rocket-internet commented Dec 16, 2019

shaunc commented Dec 16, 2019

barryib commented Dec 17, 2019

max-rocket-internet commented Dec 19, 2019

shaunc commented Dec 19, 2019

shaunc commented Dec 19, 2019

shaunc commented Dec 19, 2019

Choose a reason for hiding this comment

shaunc Dec 20, 2019 • edited Loading

Choose a reason for hiding this comment

shaunc Dec 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shaunc Dec 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpiddockcmp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-rocket-internet left a comment

Choose a reason for hiding this comment

shaunc commented Jan 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-rocket-internet commented Jan 6, 2020

barryib commented Jan 6, 2020

dpiddockcmp commented Jan 6, 2020

max-rocket-internet left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 19, 2022

shaunc Dec 20, 2019 •

edited

Loading

shaunc Dec 20, 2019 •

edited

Loading

shaunc Dec 23, 2019 •

edited

Loading