Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow local filesystem state stores (to aid CI pull-request workflows) #6465

Merged
merged 2 commits into from
Jan 17, 2020
Merged

Allow local filesystem state stores (to aid CI pull-request workflows) #6465

merged 2 commits into from
Jan 17, 2020

Conversation

ari-becker
Copy link
Contributor

@ari-becker ari-becker commented Feb 13, 2019

Allows the usage of state storage based on the local filesystem, based on the file:// prefix/protocol, as documented.

Fixes #6463.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 13, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @ari-becker. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 13, 2019
@chrisz100
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 13, 2019
@ari-becker
Copy link
Contributor Author

Test failure is about the CI server not having credentials to create a cluster in ap-south-1? Seems unrelated?

/retest

@zetaab
Copy link
Member

zetaab commented Feb 14, 2019

how this is supposed to work? Currently when people create cluster, the s3 will contain all information what nodes is supposed to fetch. I mean that when instance is started - the cloudinit script will fetch the configuration from the state store. Now if this state store is located in your local machine, the instance cannot fetch that information?

@ari-becker
Copy link
Contributor Author

ari-becker commented Feb 14, 2019

how this is supposed to work? Currently when people create cluster, the s3 will contain all information what nodes is supposed to fetch. I mean that when instance is started - the cloudinit script will fetch the configuration from the state store. Now if this state store is located in your local machine, the instance cannot fetch that information?

@zetaab As I said in #6463, the use case is for running dry-runs on pull requests in a CI environment. The CI build for the pull request downloads the state to the local filesystem, runs kops replace --force on the local filesystem state (thus not affecting the shared S3 remote state), and then runs kops update cluster (but intentionally not kops update cluster --yes), so that the dry-run output can be reviewed manually. After the pull request is approved and merged, the master build runs kops replace --force and kops update cluster --yes on the normal shared S3 remote state. The fetchability of a local state store by instances is completely irrelevant since only a dry-run is being executed with local filesystem state stores.

@Smirl
Copy link
Contributor

Smirl commented Feb 14, 2019

+1 to the PR.

I was thinking of raise something similar myself. Perhaps to make sure people don't "accidentally" use this in prod. We could instead add a CLI flag such as --allow-local-state. This could override the IsClusterReadable, or be optionally used in https://github.com/kubernetes/kops/blob/master/cmd/kops/util/factory.go#L108:12

I also am working on a pipeline for kops and it changing the state store during a pull request is a bit of a pain. I am fortunately using the cloudformation output which means that instances do not need to pull much from the state store on start up as the cluster.spec is in the UserData script.

@ari-becker ari-becker changed the title Allow local filesystem state stores Allow local filesystem state stores (to aid CI pull-request workflows) Feb 14, 2019
@Smirl
Copy link
Contributor

Smirl commented Feb 26, 2019

Hi @zetaab @ari-becker is there any update on this?

I initially thought that I could get around not having this for our CI pipeline because when using cloudformation, like in my case, I assumed that nodeup would use the cluster.spec in the UserData. However, that cluster.spec is only used to invalidate the UserData, and all state is pulled from S3 on start up by nodeup.

Therefore, I ran into a situation, where changes from kops replace -f <new_cluster.config> (added a new hook) were picked up by a new worker node before I had updated the cloudformation. This is ok, but it means that "dryruns" during CI/CD are not possible at all.

I don't mind implementing the --allow-local-state or similar if you do not want to accept this PR as it is 😄

@ari-becker
Copy link
Contributor Author

@Smirl I ran into an issue about a week ago where, upon a dry-run of a fresh cluster creation with a file:// state VFS, Kops would try to create an IAM role to grant access for instances in an instance group to access the state bucket, and it failed since file:// doesn't correspond to an S3 bucket.

I worked around it by creating IAM profiles outside of Kops which are fixed to point to the relevant state bucket, and therefore not dependent on what's in configBase. Doing so allows Kops to skip the problematic code section entirely. Unfortunately it also made it clear that getting this to work isn't a one-line code change, more fundamental work needs to be done to fix Kops's tight coupling to S3.

@justinsb justinsb added this to the 1.12 milestone Mar 14, 2019
@chrisz100
Copy link
Contributor

I’d really appreciate some more documentation about the use case and it’s implications (like that it won’t work for real clusters and tests only). It will confuse our users otherwise.
Other then that it’s fine

@chrisz100
Copy link
Contributor

/retest

@justinsb
Copy link
Member

One thing we could do is put this behind a feature-flag ... (git grep featureflag for examples).

I'm going to move this out of 1.12, because as we discussed in office hours we really want to prioritize getting 1.12 out. But if it's behind a featureflag, I think it could still go in to 1.12 as it's a fairly small change.

/milestone 1.13

@justinsb justinsb modified the milestones: 1.12, 1.13 Mar 18, 2019
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 18, 2019
@ari-becker
Copy link
Contributor Author

I’d really appreciate some more documentation about the use case and it’s implications (like that it won’t work for real clusters and tests only). It will confuse our users otherwise.
Other then that it’s fine

@chrisz100 I've added some documentation.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 16, 2019
@ari-becker
Copy link
Contributor Author

/remove-lifecycle stale

We've worked around this issue by rethinking how we built our infrastructure pipelines, but I imagine that this is still a common usecase for many CI workflows, so it is not stale.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 16, 2019
@colinhoglund
Copy link
Contributor

@ari-becker Would you mind sharing your current workflow? I’ve also been looking into various ways to achieve similar functionality.

@ari-becker
Copy link
Contributor Author

@colinhoglund we have a pipeline that can re-create our infrastructure from scratch, so we made the pipeline a function of various inputs to change relevant values for a given infrastructure run (we use Dhall to make the configurations/scripts functional). Now, instead of a PR build telling us what to expect when we run the infrastructure scripts for real, we have a PR build create an entire environment using those same infrastructure scripts. Because infrastructure is versioned, we can be reasonably confident that a test that shows success doing an automatic, real migration from X->Y (where X is the current production infrastructure version) in a non-production environment means that an automatic, real migration from X->Y in production will be successful.

So far we haven't had any production outages yet from this approach (mostly thanks to the safety which Dhall provides), but I think most organizations would think of such an approach as being too aggressive and unsafe.

@rifelpet
Copy link
Member

It would be great if we could get this merged

/test pull-kops-e2e-kubernetes-aws

@rifelpet
Copy link
Member

/test pull-kops-e2e-kubernetes-aws

@omeid
Copy link

omeid commented Aug 17, 2019

@robinpercy @rdrgmnzs Could you guys please look into this? I have been waiting for this for a while now. It is also helpful for setting up terraform config before actually deploying it.

@rifelpet
Copy link
Member

/test pull-kops-verify-gomod

@elblivion
Copy link
Contributor

Hi, I've built kops 1.13.1 with this patch and worked it into our pipeline, using the --state flag after an aws s3 sync when planning changes:

Syncing S3 Kops state to local...                                                                                                                                          
kops --state file://.local-state replace -f cluster.yaml                                                                                              
kops --state file://.local-state replace -f instance_groups/nodes-a.yaml; kops --state file://.local-state replace -f instance_groups/nodes-b.yaml; kops --state file://.lo
cal-state replace -f instance_groups/nodes-c.yaml; kops --state file://.local-state replace -f instance_groups/master-a.yaml; kops --state file://.local-state replace -f i
nstance_groups/master-b.yaml; kops --state file://.local-state replace -f instance_groups/master-c.yaml;       
RO_KUBECONFIG=1 KOPS_BASE_URL=https://my-bucket/kops/1.13.1 kops --state file://.local-state update cluster --target=terraform --name my-cluster --out=.  

it prevents the config file from being modified but a bunch of other files are updated in S3 (listing only the files modified today):

2019-09-26 23:01:01       2580 my-cluster/addons/bootstrap-channel.yaml
2019-09-26 23:01:00         61 my-cluster/addons/core.addons.k8s.io/v1.4.0.yaml
2019-09-26 23:01:01       2030 my-cluster/addons/dns-controller.addons.k8s.io/k8s-1.12.yaml
2019-09-26 23:01:01       2156 my-cluster/addons/dns-controller.addons.k8s.io/k8s-1.6.yaml
2019-09-26 23:01:01       1098 my-cluster/addons/dns-controller.addons.k8s.io/pre-k8s-1.6.yaml
2019-09-26 23:01:01       6297 my-cluster/addons/kube-dns.addons.k8s.io/k8s-1.12.yaml
2019-09-26 23:01:01       6479 my-cluster/addons/kube-dns.addons.k8s.io/k8s-1.6.yaml
2019-09-26 23:01:01       5010 my-cluster/addons/kube-dns.addons.k8s.io/pre-k8s-1.6.yaml
2019-09-26 23:01:01        295 my-cluster/addons/kubelet-api.rbac.addons.k8s.io/k8s-1.9.yaml
2019-09-26 23:01:01        150 my-cluster/addons/limit-range.addons.k8s.io/v1.5.0.yaml
2019-09-26 23:01:01        403 my-cluster/addons/rbac.addons.k8s.io/k8s-1.8.yaml
2019-09-26 23:01:01        458 my-cluster/addons/storage-aws.addons.k8s.io/v1.6.0.yaml
2019-09-26 23:01:01        448 my-cluster/addons/storage-aws.addons.k8s.io/v1.7.0.yaml
2019-09-26 23:01:00      18947 my-cluster/cluster.spec
2019-09-26 10:17:55      14927 my-cluster/config
2019-09-26 23:01:00        380 my-cluster/instancegroup/master-a
2019-09-26 23:01:00        380 my-cluster/instancegroup/master-b
2019-09-26 23:01:00        380 my-cluster/instancegroup/master-c
2019-09-26 23:01:00        545 my-cluster/instancegroup/nodes-a
2019-09-26 23:01:00        545 my-cluster/instancegroup/nodes-b
2019-09-26 23:01:00        545 my-cluster/instancegroup/nodes-c
2019-09-26 23:01:01       1934 my-cluster/manifests/etcd/events.yaml
2019-09-26 23:01:01       1908 my-cluster/manifests/etcd/main.yaml

Am I doing something wrong or is this the expected behaviour?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 22, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 29, 2019
@johngmyers
Copy link
Member

/test pull-kops-e2e-kubernetes-aws

@kpucynski
Copy link
Member

bump. This is really useful in specific cases.

@rifelpet
Copy link
Member

rifelpet commented Jan 3, 2020

I just tested this out myself and it works as expected. kops replace -f cluster.yaml --state file://foo; kops update cluster $cluster --stage file://foo is able to report dryrun changes without touching any files in the real s3 state store bucket.

@justinsb @mikesplain do you think this needs to be behind a feature flag? It is new functionality: no one can currently use file:// state store paths, so I dont think anyone will accidentally find themselves using this feature. Perhaps we could have a warning printed but I don't think anything beyond that is necessary. Will wait for their feedback before approving.

/lgtm

and sorry for the delay!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 3, 2020
@rifelpet
Copy link
Member

/approve

I think we can cherry-pick this back to 1.17. I will try to add a warning message for this usage in a separate PR as well.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ari-becker, rifelpet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 17, 2020
@k8s-ci-robot k8s-ci-robot merged commit b356bd4 into kubernetes:master Jan 17, 2020
@k8s-ci-robot k8s-ci-robot modified the milestones: v1.13, v1.18 Jan 17, 2020
k8s-ci-robot added a commit that referenced this pull request Jan 18, 2020
#8368-upstream-release-1.17

Automated cherry pick of #6465: Allow local filesystem state stores #8368: Add a warning when using file:// state store
@ari-becker ari-becker deleted the bugfix/allow-local-filesystem-state-store branch January 18, 2020 08:44
@MPV
Copy link

MPV commented Mar 23, 2020

Just a short mention/reminder that the docs/continuous_integration.md still mentions this PR as open, and the lack of this functionality as a "current limitation".

Might it be worth updating to reflect what is now possible (and how one would use this)? 😍

@rifelpet
Copy link
Member

I haven't been using it yet but once we upgrade to Kops 1.17 and start using it I'll be sure to update that docs page, unless someone else gets to it first :)

yurrriq added a commit to yurrriq/kops that referenced this pull request Jul 26, 2022
Update wording to note that kubernetes#6465 was merged and add an AWS example of a dry run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to use a local filesystem state store