-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait cluster responsive #639
Wait cluster responsive #639
Conversation
This solves #621 But, in this PR we are running Let's see what others think about this approach or if there's a way we can do it more elegantly? |
@max-rocket-internet ... well ... it undermines it, but doesn't defeat it entirely: after this completes, the provider abstraction is perfectly usable. However, I see your point. I wonder if there is a way of health-checking with curl... hmm... https://success.docker.com/article/how-to-poll-kubernetes-health-with-curl ?... also isn't so pretty.... |
FWIW, you can get the server version with I also noticed that the |
I'll check that. I'm also thinking that, instead of a separate null_resource, it would be best to include this as a provisioner on the cluster itself, so anything that depends on the cluster doesn't have to separately specify "cluster" and "cluster is ready". Of course, in theory this could slow up create time for configuration that only depends on knowing cluster attributes (e.g. configuring assume role w/ the OIDC endpoint)... but only a minute or two compared with ~>10min for cluster create anyway. |
Health check w/o credentials seems to work -- thanks @barryib! As in comment above, got rid of separate null_resource in favor of provisioner for the cluster itself. I found that terraform would figure out it could start creation of other things that depended on the cluster (that needed to use it) prematurely, and this PR didn't expose the null_resource anyway. In theory we could have a separate output for "cluster_ready" but IMHO not worth the complexity cost for the benefit. |
(rebased) |
@@ -31,6 +31,11 @@ resource "aws_eks_cluster" "this" { | |||
aws_iam_role_policy_attachment.cluster_AmazonEKSServicePolicy, | |||
aws_cloudwatch_log_group.this | |||
] | |||
provisioner "local-exec" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance to avoid local-exec
? The http provider doesn't work ? We wanted to be agnostic from the OS. Using the kubernetes provider was a huge improvement in that way and would like to avoid going back to local-exec with curl if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ... missed that part of your comment; sorry! Hmm... I'd worry about three things: (1) That the http provider would sometimes timeout, and it doesn't provide any control over timeouts, (2) when not combined with the cluster resource itself, terraform could schedule other resources that depend on the cluster before its ready, and outside the module, its a bit opaque to know what to do (though it could be documented). (3) Doc says they verify chain of trust for https, and certificate is likely to be unchained.
I guess (2) could be possibly solved by having (say) cluster_id depend on an expression involving some output from http resource. (3) just testing will confirm/deny :)....
vs (1) -- I'm a terraform newbie ... do you know anything about http data source timeout behavior? Would seem to be a risk to me, but I'll go ahead and try it if you guys want to go that way.
UPDATE: Unfortunately, test gives:
Error: Error during making a request: https://....sk1.us-west-2.eks.amazonaws.com/healthz
on ../../cluster.tf line 40, in data "http" "wait_for_cluster":
40: data "http" "wait_for_cluster" {
Doesn't say if failed because not up yet, or if failed because insecure.
Hmm... could the local-exec provisioner branch by OS and appropriately install curl? Or perhaps it should use a tool ("curl" or alternative appropriate for the OS) passed in in a variable (including no-op)? Your call -- I can give it a whirl if not too complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most general would be to have variable for command be a template required to embed (say) ${URL}. I guess the shell varies by platform, but I guess this type of variable substitution should be broadly compatible? I'm not sure if the "until" loop is available "everywhere"? In any case, AFAIK the options are to code up a custom provider (or patch the http provider, to accept options for retry on error, and for insecure https), or to try to make local-exec be as cross-platform as possible.
UPDATE ... If you are going to patch/create providers, then perhaps changing the kubernetes provider to wait until endpoint is up would also be an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hen perhaps changing the kubernetes provider to wait until endpoint is up would also be an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what we should do now but open to opinions. Either merge with a new local-exec or think of something else 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think easiest way to fix something without upstream changes would be to use local-exec, but make it configurable for cross-platform support. Do you have an idea how you'd like to configure it? Do you want me to propose something? (Adding variables for configuration -- what program to use for "curl", for instance, or possibly whether to install curl appropriate for platform.)?
Together with this you could open ticket(s) for upstream change(s) -- for instance if the kubernetes or the http provider had a "wait for cluster"/"wait for url" configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I'm reading from stackoverflow, it sounds like since powershell 3.O, wget
and curl
are aliases for Invoke-WebRequest
http://support.moonpoint.com/os/windows/PowerShell/wget-curl.php. It means, we can probably use curl
in both OS, but i'm not sure for arguments.
Any windows guy around ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we forget about Windows. It's proved to be too much effort to support both platforms natively in the past. Windows users can use Docker I think.
versions.tf
Outdated
null = ">= 2.1" | ||
template = ">= 2.1" | ||
random = ">= 2.1" | ||
kubernetes = "~> 1.10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only use =>
here to avoid conflicts with users and other module definitions. 2.0 might be compatible with our resource. We can't know yet.
Our minimum compatible version is 1.6.2, the first correctly tagged with TF 0.12 support.
kubernetes = "~> 1.10" | |
kubernetes = "=> 1.6.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the OIDC PR merged into here? Makes this PR a little confusing.
aws_auth.tf
Outdated
@@ -41,9 +41,13 @@ data "template_file" "worker_role_arns" { | |||
) | |||
} | |||
} | |||
locals { | |||
kubeconfig_filename = concat(local_file.kubeconfig.*.filename, [""])[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this local needed for anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope ... will remove
|
||
resource "kubernetes_config_map" "aws_auth" { | ||
count = var.create_eks && var.manage_aws_auth ? 1 : 0 | ||
depends_on = [aws_eks_cluster.this] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this explicit depends_on
required? The provider doesn't work before the cluster's creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on how the kubernetes provider is set up(?) I wasn't sure it wasn't impossible make dependence on cluster opaque to terraform, depending on how the provider is setup, so thought explicit dependency would be useful to avoid bug in edge case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very true. It's not necessarily implicit from other variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to rebase to remove the IRSA stuff
…heck to /healthz on endpoint
3866fff
to
5f6f7a3
Compare
(rebased) |
@@ -31,6 +31,11 @@ resource "aws_eks_cluster" "this" { | |||
aws_iam_role_policy_attachment.cluster_AmazonEKSServicePolicy, | |||
aws_cloudwatch_log_group.this | |||
] | |||
provisioner "local-exec" { | |||
command = <<EOT | |||
until curl -k ${aws_eks_cluster.this[0].endpoint}/healthz >/dev/null; do sleep 4; done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This with no limit is OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I thought that overall terraform timeout would be by default how user would control)
@dpiddockcmp @barryib shall we merge this then? |
LGTM. |
LGTM. Create timeout is 30 minutes. Can always add fail-fast logic later if people complain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Thanks @shaunc and all the reviewers 🙂
I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
PR o'clock
Description
I experience intermittent problems applying config_map as cluster is not immediately ready to respond to kubectl after create. This PR adds a null resource that sleeps until the cluster responds to kubectl.
Checklist