Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run storage upgrade pre and post master upgrade #4476

Merged
merged 1 commit into from
Jun 19, 2017

Conversation

mtnbikenc
Copy link
Member

Storage upgrades (oc adm migrate storage) is performed prior to master upgrade as well as after. This PR also moves the task into a common playbook for all versions of upgrades, currently 3.3 through 3.6.

See: openshift/origin#14625 (comment)

@smarterclayton Are these CLI options still valid and does this need to be run for all versions of upgrades?

Trello: https://trello.com/c/x2fUWNbK

@sdodson
Copy link
Member

sdodson commented Jun 16, 2017

Implicit here is that oc adm migrate storage does the right thing assuming that the version of the command being executed is matched to the version running in the environment. So on a 3.5 to 3.6 upgrade it will run 3.5 version prior to upgrading to 3.6 and then it will run 3.6 version after upgrading.

It seems to me if we can't rely on that this is a critical flaw in the migrate command.

Copy link
Member

@sdodson sdodson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, need clayton to verify options for migrate command.

@mtnbikenc
Copy link
Member Author

Since this is not covered under CI, I'm testing each version of the upgrade playbooks currently in the repo.

@sdodson
Copy link
Member

sdodson commented Jun 16, 2017

Ok, I'd focus first on 3.5 to 3.6 then 3.4 to 3.5 but this change wouldn't land there unless we backport it anyway. I wouldn't bother going any further back than that.

@mtnbikenc
Copy link
Member Author

aos-ci-test

@sdodson
Copy link
Member

sdodson commented Jun 16, 2017

@mfojtik you're the only other significant contributor to oadm migrate storage can you validate we're doing the right thing here?

@mfojtik
Copy link
Contributor

mfojtik commented Jun 16, 2017 via email

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for b51d9e9 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for b51d9e9 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for b51d9e9 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for b51d9e9 (logs)

@mtnbikenc
Copy link
Member Author

Tested 3.5->3.6 upgrade:

2017-06-16 14:40:18,457 p=17669 u=rteague |  PLAY [Pre master upgrade - Upgrade job storage] **********************************************************************************************
2017-06-16 14:40:18,520 p=17669 u=rteague |  TASK [Gathering Facts] ***********************************************************************************************************************
2017-06-16 14:40:19,524 p=17669 u=rteague |  ok: [ec2-52-204-32-41.compute-1.amazonaws.com]
2017-06-16 14:40:19,555 p=17669 u=rteague |  META: ran handlers
2017-06-16 14:40:19,561 p=17669 u=rteague |  TASK [Upgrade job storage] *******************************************************************************************************************
2017-06-16 14:40:19,561 p=17669 u=rteague |  task path: /home/rteague/dev/clusters/aws-c1/openshift-ansible/playbooks/common/openshift-cluster/upgrades/upgrade_control_plane.yml:11
2017-06-16 14:40:21,761 p=17669 u=rteague |  changed: [ec2-52-204-32-41.compute-1.amazonaws.com] => {
    "changed": true,
    "cmd": [
        "oc",
        "adm",
        "--config=/etc/origin/master/admin.kubeconfig",
        "migrate",
        "storage",
        "--include=jobs",
        "--confirm"
    ],
    "delta": "0:00:01.198273",
    "end": "2017-06-16 14:40:21.725065",
    "rc": 0,
    "start": "2017-06-16 14:40:20.526792"
}

STDOUT:

summary: total=0 errors=0 ignored=0 unchanged=0 migrated=0

@smarterclayton
Copy link
Contributor

oadm migrate storage does nothing clientside. Instead, it talks to the server and does a blind update. So options would only apply to filtering out things, and should be both reentrant and safe.

I'm not sure why you got 0 total.

@smarterclayton
Copy link
Contributor

--include=jobs does not do what you think it does. It only runs migrations on jobs.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jun 16, 2017

You want to run oadm migrate storage --include=*,jobs

@enj
Copy link
Contributor

enj commented Jun 17, 2017

if we can't rely on that this is a critical flaw in the migrate command

No, it is just a bug, like any other programmer mistake. There is no way for us to guarantee that we will know everything in advance, but I suppose we can always backport fixes (I assume upgrades would be run with the latest matching minor version).

@smarterclayton
Copy link
Contributor

There is a bug fix going to origin which corrects where you see resources already modified - the serialization order of protobuf wasn't stable, which was also a bug in upstream.

@sdodson
Copy link
Member

sdodson commented Jun 18, 2017

I just meant that as a pattern, the playbooks will run the current version of the storage migration before upgrading and the current version after upgrading.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jun 19, 2017 via email

@sdodson
Copy link
Member

sdodson commented Jun 19, 2017

ACK, i'll update the pr and get it merged asap. thanks

@sdodson sdodson changed the title [WIP] Run storage upgrade pre and post master upgrade Run storage upgrade pre and post master upgrade Jun 19, 2017
@sdodson sdodson merged commit 9545204 into openshift:master Jun 19, 2017
@smarterclayton
Copy link
Contributor

Well the good news is that this caught a bug:

https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2382/testReport/junit/(root)/Post%20master%20upgrade%20-%20Upgrade%20job%20storage/_localhost__Upgrade_job_storage/

The bad news is that it's now blocking the merge queue and the test queue. @deads2k needs to take a look at the symptom and help assess whether ClusterPolicy should be:

  1. excluded from migration (it's just broken and end users should do manual steps)
  2. automatically updated to not have this error (in practice a user manually has to do this, which is why we have migrations)
  3. require the user to decide what to do before upgrading.

If I'm not mistaken, the error here will block any declarative flow an end user might have (i.e. if you apply your cluster policy, you're broken as well), and this generally fails our "we don't break old APIs on upgrade" rule.

In the meantime we should probably disable this task on those jobs temporarily. @stevekuznetsov

@smarterclayton
Copy link
Contributor

smarterclayton commented Jun 19, 2017

It's also possible the post upgrade errors should be flagged for user attention but not blocking, because the cluster is not necessarily broken at this point, just operating as best it can.

@enj
Copy link
Contributor

enj commented Jun 19, 2017

So the error is caused by the tighter validation I added to policy objects. It is not supposed to complain about attribute restrictions if the old and new objects are the same, but the specifics of the ratcheting are complex...

@smarterclayton
Copy link
Contributor

This is on upgrade from 3.5 - were any fields defaulted? I don't know if this is before or after reconcile.

@sdodson
Copy link
Member

sdodson commented Jun 19, 2017

@smarterclayton you mean disable the upgrade test or revert this PR temporarilly?

@deads2k
Copy link
Contributor

deads2k commented Jun 19, 2017

So the error is caused by the tighter validation I added to policy objects. It is not supposed to complain about attribute restrictions if the old and new objects are the same

Must be a bug in the logic somewhere. Maybe it only fails on the container Policy objects? Also, @smarterclayton I really hate virtual storage.

@sdodson
Copy link
Member

sdodson commented Jun 19, 2017

#4491 reverts back to just migrating jobs

@smarterclayton
Copy link
Contributor

I know you hate virtual storage :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants