Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instance-selector cmd to toolbox #9478

Merged
merged 11 commits into from
Aug 11, 2020

Conversation

bwagner5
Copy link
Contributor

@bwagner5 bwagner5 commented Jul 2, 2020

Issue:
#8804

This PR adds kops toolbox instance-selector which is used to create kops instance groups based on resource criteria of AWS instance types. There are built in best-practices for generating heterogeneous spot autoscaling groups w/ capacity-optimized allocation strategy. The instance-selector can also generate on-demand instance-groups which are still heterogeneous but use a lowest-price allocation strategy.

This command is implemented by utilizing the github.com/aws/amazon-ec2-instance-selector go pkg.


Testing:

Cluster has already been created on AWS

$ make kops
$ .build/local/kops toolbox instance-selector --flexible --usage-class spot --instance-group-name spot-test-group
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-07-02T14:32:38Z"
  labels:
    kops.k8s.io/cluster: guac.kops.sh
  name: spot-test-group
spec:
  cloudLabels:
    kops.k8s.io/instance-selector: "1"
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200528
  machineType: c4.xlarge
  maxSize: 15
  minSize: 2
  mixedInstancesPolicy:
    instances:
    - c4.xlarge
    - c5.xlarge
    - c5a.xlarge
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    kops.k8s.io/instancegroup: spot-test-group
  role: Node
  subnets:
  - us-east-2a
  - us-east-2b
  - us-east-2c

$ .build/local/kops toolbox instance-selector --vcpus 4 --memory-min 6000 --memory-max 9000 --instance-group-name od-group
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-07-02T14:34:27Z"
  labels:
    kops.k8s.io/cluster: guac.kops.sh
  name: od-group
spec:
  cloudLabels:
    kops.k8s.io/instance-selector: "1"
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200528
  machineType: c4.xlarge
  maxSize: 15
  minSize: 2
  mixedInstancesPolicy:
    instances:
    - c4.xlarge
    - c5.xlarge
    - c5a.xlarge
    - c5d.xlarge
  nodeLabels:
    kops.k8s.io/instancegroup: od-group
  role: Node
  subnets:
  - us-east-2a
  - us-east-2b
  - us-east-2c

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 2, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @bwagner5. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 2, 2020
@k8s-ci-robot k8s-ci-robot requested review from hakman and mikesplain July 2, 2020 14:35
@hakman
Copy link
Member

hakman commented Jul 2, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 2, 2020
@hakman
Copy link
Member

hakman commented Jul 2, 2020

@bwagner5 Any reason why not using aws-sdk-go v1.31.15?

@bwagner5
Copy link
Contributor Author

bwagner5 commented Jul 2, 2020

@hakman just an oversight, updated.

@hakman
Copy link
Member

hakman commented Jul 2, 2020

Thanks @bwagner5.

@bwagner5
Copy link
Contributor Author

bwagner5 commented Jul 2, 2020

/assign @geojaz

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 7, 2020
@bwagner5 bwagner5 force-pushed the feat-instance-selector branch from 296eab3 to 308c5c7 Compare July 7, 2020 17:34
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 7, 2020
@bwagner5 bwagner5 force-pushed the feat-instance-selector branch from 308c5c7 to 8bc13bc Compare July 7, 2020 17:41
@hakman
Copy link
Member

hakman commented Jul 7, 2020

@bwagner5 The current aws-sdk-go version is 1.32.13: 5107e1d

@bwagner5
Copy link
Contributor Author

bwagner5 commented Jul 7, 2020

@hakman my bad, the commit message is just wrong. I'll update

@bwagner5 bwagner5 force-pushed the feat-instance-selector branch 3 times, most recently from cca320e to 832f9b3 Compare July 7, 2020 18:49
@hakman
Copy link
Member

hakman commented Jul 7, 2020

I think there is still some weirdness in the vendor related commits. For example defaults.go: 832f9b3.

Maybe squashing the the vendor and gomod commits into one would fix all this.

@bwagner5 bwagner5 force-pushed the feat-instance-selector branch from 832f9b3 to 98ba1d4 Compare July 7, 2020 20:18
@hakman
Copy link
Member

hakman commented Jul 8, 2020

I don't think it worked. Still there are some commits reverting changes from previous ones. Probably would be best to restage them.

@bwagner5 bwagner5 force-pushed the feat-instance-selector branch 2 times, most recently from 5cced34 to fa65aea Compare July 8, 2020 16:54
@bwagner5
Copy link
Contributor Author

bwagner5 commented Jul 8, 2020

apologies on all the noise :) I rebased to clean up the PR into 2 nicer commits. Seems there was a transient failure in the 1 year cert issue test too. All tests passed locally, and passed after I reran the build on gh-actions.

@bwagner5 bwagner5 force-pushed the feat-instance-selector branch 4 times, most recently from 981dfc2 to 443ed67 Compare July 31, 2020 11:54
@bwagner5 bwagner5 requested a review from hakman July 31, 2020 17:57
@hakman
Copy link
Member

hakman commented Aug 4, 2020

Maybe I am doing something wrong, but it's not working for me anymore:

% .build/local/kops toolbox instance-selector test --dry-run --memory-min "4 GiB" --memory-max "16 GiB"

Invalid input for --memory-max. A valid example is 16gb. Processing failed.

% .build/local/kops toolbox instance-selector test --dry-run --memory-min 4gb --memory-max 16gb

Invalid input for --memory-min. A valid example is 16gb. Processing failed.

% .build/local/kops toolbox instance-selector test --dry-run --memory-min 4gb --memory-max 16gb
panic: interface conversion: interface {} is *string, not *bytequantity.ByteQuantity

goroutine 1 [running]:
github.com/aws/amazon-ec2-instance-selector/v2/pkg/cli.(*CommandLineInterface).ByteQuantityMinMaxRangeFlagOnFlagSet.func1(0x0, 0x0, 0x4a815ae, 0x6)
	/Users/hakman/Documents/git/go/src/k8s.io/kops/vendor/github.com/aws/amazon-ec2-instance-selector/v2/pkg/cli/flags.go:197 +0x585
github.com/aws/amazon-ec2-instance-selector/v2/pkg/cli.(*CommandLineInterface).ValidateFlags(0xc000178dc0, 0x0, 0x0)
	/Users/hakman/Documents/git/go/src/k8s.io/kops/vendor/github.com/aws/amazon-ec2-instance-selector/v2/pkg/cli/cli.go:115 +0xfa
main.processAndValidateFlags(0xc000178dc0, 0x1, 0xc0005d76f0, 0x2)
	/Users/hakman/Documents/git/go/src/k8s.io/kops/cmd/kops/toolbox_instance_selector.go:334 +0x74
main.RunToolboxInstanceSelector(0x4ff5f60, 0xc000054080, 0xc0005b7ae0, 0xc00010e960, 0x1, 0x6, 0x4f9d400, 0xc000010018, 0xc000178dc0, 0x0, ...)
	/Users/hakman/Documents/git/go/src/k8s.io/kops/cmd/kops/toolbox_instance_selector.go:193 +0x85
main.NewCmdToolboxInstanceSelector.func2(0xc000306580, 0xc00010e960, 0x1, 0x6)
	/Users/hakman/Documents/git/go/src/k8s.io/kops/cmd/kops/toolbox_instance_selector.go:125 +0x85
github.com/spf13/cobra.(*Command).execute(0xc000306580, 0xc00010e8a0, 0x6, 0x6, 0xc000306580, 0xc00010e8a0)
	/Users/hakman/Documents/git/go/src/k8s.io/kops/vendor/github.com/spf13/cobra/command.go:842 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0x6bb4a60, 0x6bff7f0, 0x0, 0x0)
	/Users/hakman/Documents/git/go/src/k8s.io/kops/vendor/github.com/spf13/cobra/command.go:943 +0x336
github.com/spf13/cobra.(*Command).Execute(...)
	/Users/hakman/Documents/git/go/src/k8s.io/kops/vendor/github.com/spf13/cobra/command.go:883
main.Execute()
	/Users/hakman/Documents/git/go/src/k8s.io/kops/cmd/kops/root.go:96 +0x8f
main.main()
	/Users/hakman/Documents/git/go/src/k8s.io/kops/cmd/kops/main.go:25 +0x25

I am more a fan of the "16gb" notation instead of "16 GiB".

@bwagner5 bwagner5 force-pushed the feat-instance-selector branch from 443ed67 to 7e2cb54 Compare August 4, 2020 14:18
@bwagner5
Copy link
Contributor Author

bwagner5 commented Aug 4, 2020

@hakman Sorry, that was my bad... pushed a little too quickly before I went on vacation. It's fixed now:

➜  kops git:(feat-instance-selector) .build/local/kops toolbox instance-selector --memory-min=4gb --memory-max=16gb --dry-run hi --state s3://tti-kops --name tti.k8s.local
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: ""
  name: hi
spec:
  cloudLabels:
    kops.k8s.io/instance-selector: "1"
  image: kope.io/k8s-1.16-debian-stretch-amd64-hvm-ebs-2020-07-20
  machineType: c1.xlarge
  maxSize: 15
  minSize: 2
  mixedInstancesPolicy:
    instances:
    - c1.xlarge
    - c3.2xlarge
    - c3.xlarge
    - c4.2xlarge
    - c4.xlarge
    - c5.2xlarge
    - c5.large
    - c5.xlarge
    - c5a.2xlarge
    - c5a.large
    - c5a.xlarge
    - c5d.2xlarge
    - c5d.large
    - c5d.xlarge
    - c5n.large
    - c5n.xlarge
    - g2.2xlarge
    - g4dn.xlarge
    - i3.large
    - i3en.large
  nodeLabels:
    kops.k8s.io/instancegroup: hi
  role: Node
  subnets:
  - us-east-1a

I agree, I like the "4gb" syntax better as well. I've updated the CLI examples in the usage to reflect. 4 GiB will still be parsable.

@hakman
Copy link
Member

hakman commented Aug 4, 2020

Thanks for the update @bwagner5. Will take another look tomorrow. Enjoy vacation 😄 !

@bwagner5 bwagner5 force-pushed the feat-instance-selector branch from df964c1 to 2d6d7ec Compare August 10, 2020 22:13
@hakman
Copy link
Member

hakman commented Aug 11, 2020

Hey @bwagner5, I finally go some time to finish the review. It looks very well, except a few nits/questions:

  1. The defaults:
	clusterAutoscalerDefault := false
  	nodeCountMinDefault := 2
  	nodeCountMaxDefault := 15

would change them to:

	clusterAutoscalerDefault := true
  	nodeCountMinDefault := 1
  	nodeCountMaxDefault := 10
  1. gpu-memory-total should become gpu-memory, but keep the description as is to explain that it's the total.
  2. There is a bug with the cluster-autoscaler labels, the cluster name is not added:
$ kops toolbox instance-selector ondemand-ig --dry-run --cluster-autoscaler
Using cluster from kubectl context: instance-selector.test.com

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: ""
  name: ondemand-ig
spec:
  cloudLabels:
=>  k8s.io/cluster-autoscaler/: "1"
    k8s.io/cluster-autoscaler/enabled: "1"
    kops.k8s.io/instance-selector: "1"

What do you think?

@bwagner5
Copy link
Contributor Author

Those defaults sound very reasonable, I've updated the PR.

The cluster-autoscaler labels should be added properly with the label now (tested with --name and export KOPS_CLUSTER_NAME=tti.k8s.local:

.build/local/kops toolbox instance-selector --memory-min=4gb --memory-max=16gb --dry-run hi --state s3://tti-kops --name tti.k8s.local
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: tti.k8s.local
  name: hi
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: "1"
    k8s.io/cluster-autoscaler/tti.k8s.local: "1"
    kops.k8s.io/instance-selector: "1"
  image: kope.io/k8s-1.16-debian-stretch-amd64-hvm-ebs-2020-07-20
  machineType: c1.xlarge
  maxSize: 10
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - c1.xlarge
    - c3.2xlarge
    - c3.xlarge
    - c4.2xlarge
    - c4.xlarge
    - c5.2xlarge
    - c5.large
    - c5.xlarge
    - c5a.2xlarge
    - c5a.large
    - c5a.xlarge
    - c5d.2xlarge
    - c5d.large
    - c5d.xlarge
    - c5n.large
    - c5n.xlarge
    - g2.2xlarge
    - g4dn.xlarge
    - i3.large
    - i3en.large
  nodeLabels:
    kops.k8s.io/instancegroup: hi
  role: Node
  subnets:
  - us-east-1a

@hakman
Copy link
Member

hakman commented Aug 11, 2020

Nice work. Thanks!
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 11, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bwagner5, hakman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 11, 2020
@k8s-ci-robot k8s-ci-robot merged commit b7871e2 into kubernetes:master Aug 11, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Aug 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/documentation cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants