Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait cluster responsive #639

Conversation

shaunc
Copy link
Contributor

@shaunc shaunc commented Dec 16, 2019

PR o'clock

Description

I experience intermittent problems applying config_map as cluster is not immediately ready to respond to kubectl after create. This PR adds a null resource that sleeps until the cluster responds to kubectl.

Checklist

@max-rocket-internet
Copy link
Contributor

This solves #621

But, in this PR we are running kubectl, which kind of defeats the point of using the Kubernetes Terraform provider in the first place 😅

Let's see what others think about this approach or if there's a way we can do it more elegantly?

@shaunc
Copy link
Contributor Author

shaunc commented Dec 16, 2019

@max-rocket-internet ... well ... it undermines it, but doesn't defeat it entirely: after this completes, the provider abstraction is perfectly usable. However, I see your point. I wonder if there is a way of health-checking with curl... hmm... https://success.docker.com/article/how-to-poll-kubernetes-health-with-curl ?... also isn't so pretty....

@barryib
Copy link
Member

barryib commented Dec 17, 2019

FWIW, you can get the server version with curl -k https://<k8s-api-endpoint>/version?timeout=30s. I don't know exactly if this is a blocking http call or not, but it worth to try to get this endpoint with the http provider 🤞 .

I also noticed that the /healthz is readable without creds. We can also test this endpoint.

@max-rocket-internet
Copy link
Contributor

Cool ideas @barryib.

@shaunc could you test that?

@shaunc
Copy link
Contributor Author

shaunc commented Dec 19, 2019

I'll check that. I'm also thinking that, instead of a separate null_resource, it would be best to include this as a provisioner on the cluster itself, so anything that depends on the cluster doesn't have to separately specify "cluster" and "cluster is ready". Of course, in theory this could slow up create time for configuration that only depends on knowing cluster attributes (e.g. configuring assume role w/ the OIDC endpoint)... but only a minute or two compared with ~>10min for cluster create anyway.

@shaunc
Copy link
Contributor Author

shaunc commented Dec 19, 2019

Health check w/o credentials seems to work -- thanks @barryib! As in comment above, got rid of separate null_resource in favor of provisioner for the cluster itself. I found that terraform would figure out it could start creation of other things that depended on the cluster (that needed to use it) prematurely, and this PR didn't expose the null_resource anyway. In theory we could have a separate output for "cluster_ready" but IMHO not worth the complexity cost for the benefit.

@shaunc
Copy link
Contributor Author

shaunc commented Dec 19, 2019

(rebased)

@@ -31,6 +31,11 @@ resource "aws_eks_cluster" "this" {
aws_iam_role_policy_attachment.cluster_AmazonEKSServicePolicy,
aws_cloudwatch_log_group.this
]
provisioner "local-exec" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance to avoid local-exec ? The http provider doesn't work ? We wanted to be agnostic from the OS. Using the kubernetes provider was a huge improvement in that way and would like to avoid going back to local-exec with curl if possible.

Copy link
Contributor Author

@shaunc shaunc Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ... missed that part of your comment; sorry! Hmm... I'd worry about three things: (1) That the http provider would sometimes timeout, and it doesn't provide any control over timeouts, (2) when not combined with the cluster resource itself, terraform could schedule other resources that depend on the cluster before its ready, and outside the module, its a bit opaque to know what to do (though it could be documented). (3) Doc says they verify chain of trust for https, and certificate is likely to be unchained.

I guess (2) could be possibly solved by having (say) cluster_id depend on an expression involving some output from http resource. (3) just testing will confirm/deny :)....

vs (1) -- I'm a terraform newbie ... do you know anything about http data source timeout behavior? Would seem to be a risk to me, but I'll go ahead and try it if you guys want to go that way.

UPDATE: Unfortunately, test gives:

Error: Error during making a request: https://....sk1.us-west-2.eks.amazonaws.com/healthz

  on ../../cluster.tf line 40, in data "http" "wait_for_cluster":
  40: data "http" "wait_for_cluster" {

Doesn't say if failed because not up yet, or if failed because insecure.

Hmm... could the local-exec provisioner branch by OS and appropriately install curl? Or perhaps it should use a tool ("curl" or alternative appropriate for the OS) passed in in a variable (including no-op)? Your call -- I can give it a whirl if not too complex.

Copy link
Contributor Author

@shaunc shaunc Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most general would be to have variable for command be a template required to embed (say) ${URL}. I guess the shell varies by platform, but I guess this type of variable substitution should be broadly compatible? I'm not sure if the "until" loop is available "everywhere"? In any case, AFAIK the options are to code up a custom provider (or patch the http provider, to accept options for retry on error, and for insecure https), or to try to make local-exec be as cross-platform as possible.

UPDATE ... If you are going to patch/create providers, then perhaps changing the kubernetes provider to wait until endpoint is up would also be an option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hen perhaps changing the kubernetes provider to wait until endpoint is up would also be an option.

hashicorp/terraform-provider-kubernetes#96

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what we should do now but open to opinions. Either merge with a new local-exec or think of something else 🙂

Copy link
Contributor Author

@shaunc shaunc Dec 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think easiest way to fix something without upstream changes would be to use local-exec, but make it configurable for cross-platform support. Do you have an idea how you'd like to configure it? Do you want me to propose something? (Adding variables for configuration -- what program to use for "curl", for instance, or possibly whether to install curl appropriate for platform.)?

Together with this you could open ticket(s) for upstream change(s) -- for instance if the kubernetes or the http provider had a "wait for cluster"/"wait for url" configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I'm reading from stackoverflow, it sounds like since powershell 3.O, wget and curl are aliases for Invoke-WebRequest http://support.moonpoint.com/os/windows/PowerShell/wget-curl.php. It means, we can probably use curl in both OS, but i'm not sure for arguments.

Any windows guy around ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we forget about Windows. It's proved to be too much effort to support both platforms natively in the past. Windows users can use Docker I think.

versions.tf Outdated
null = ">= 2.1"
template = ">= 2.1"
random = ">= 2.1"
kubernetes = "~> 1.10"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only use => here to avoid conflicts with users and other module definitions. 2.0 might be compatible with our resource. We can't know yet.

Our minimum compatible version is 1.6.2, the first correctly tagged with TF 0.12 support.

Suggested change
kubernetes = "~> 1.10"
kubernetes = "=> 1.6.2"

Copy link
Contributor

@dpiddockcmp dpiddockcmp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the OIDC PR merged into here? Makes this PR a little confusing.

aws_auth.tf Outdated
@@ -41,9 +41,13 @@ data "template_file" "worker_role_arns" {
)
}
}
locals {
kubeconfig_filename = concat(local_file.kubeconfig.*.filename, [""])[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this local needed for anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope ... will remove


resource "kubernetes_config_map" "aws_auth" {
count = var.create_eks && var.manage_aws_auth ? 1 : 0
depends_on = [aws_eks_cluster.this]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this explicit depends_on required? The provider doesn't work before the cluster's creation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on how the kubernetes provider is set up(?) I wasn't sure it wasn't impossible make dependence on cluster opaque to terraform, depending on how the provider is setup, so thought explicit dependency would be useful to avoid bug in edge case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very true. It's not necessarily implicit from other variables.

Copy link
Contributor

@max-rocket-internet max-rocket-internet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to rebase to remove the IRSA stuff

@barryib barryib mentioned this pull request Jan 4, 2020
4 tasks
@shaunc shaunc force-pushed the wait-cluster-responsive branch from 3866fff to 5f6f7a3 Compare January 6, 2020 14:48
@shaunc
Copy link
Contributor Author

shaunc commented Jan 6, 2020

(rebased)

@@ -31,6 +31,11 @@ resource "aws_eks_cluster" "this" {
aws_iam_role_policy_attachment.cluster_AmazonEKSServicePolicy,
aws_cloudwatch_log_group.this
]
provisioner "local-exec" {
command = <<EOT
until curl -k ${aws_eks_cluster.this[0].endpoint}/healthz >/dev/null; do sleep 4; done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This with no limit is OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I thought that overall terraform timeout would be by default how user would control)

@max-rocket-internet
Copy link
Contributor

@dpiddockcmp @barryib shall we merge this then?

@max-rocket-internet max-rocket-internet mentioned this pull request Jan 6, 2020
@barryib
Copy link
Member

barryib commented Jan 6, 2020

LGTM.

@dpiddockcmp
Copy link
Contributor

LGTM. Create timeout is 30 minutes. Can always add fail-fast logic later if people complain.

Copy link
Contributor

@max-rocket-internet max-rocket-internet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Thanks @shaunc and all the reviewers 🙂

@max-rocket-internet max-rocket-internet merged commit d79c8ab into terraform-aws-modules:master Jan 7, 2020
@shaunc shaunc deleted the wait-cluster-responsive branch January 7, 2020 12:26
@barryib barryib mentioned this pull request Jan 13, 2020
@shaunc shaunc restored the wait-cluster-responsive branch February 11, 2020 22:48
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants