-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/daemon: add a current config on disk check #612
pkg/daemon: add a current config on disk check #612
Conversation
/hold Was just discussing this with @derekwaynecarr but thought to start experimenting if this is needed/working |
pkg/daemon/daemon.go
Outdated
if !os.IsNotExist(err) { | ||
return false, errors.Wrapf(err, "loading current config fingerprint") | ||
} | ||
return false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is wrong (note to self)
301a880
to
381e8c0
Compare
485a3b6
to
6f5f21a
Compare
pkg/daemon/daemon.go
Outdated
@@ -124,6 +125,9 @@ const ( | |||
pathDevNull = "/dev/null" | |||
// pathStateJSON is where we store temporary state across config changes | |||
pathStateJSON = "/etc/machine-config-daemon/state.json" | |||
// currentConfigFingerprint is where we store the current config on disk to validate | |||
// against annotations changes | |||
currentConfigFingerprint = "/etc/machine-config-daemon/current" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a fingerprint, it's the whole config right? I find this idea that we serialize all of the files back into JSON as a file in the filesystem kind of funny. But I guess it's not a huge amount of data...
oc get -o yaml machineconfigs/rendered-worker-74e29c89eb37d7aaa30c33c481fae174 | gzip | wc -c
10197
We can certainly assume 10k of data is free.
On the kind-of bikeshed topic, I think /var
is probably a better place for this. The ostree model is that /etc
is data humans might edit (e.g. "ssh to node and vi
").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed naming and moved under /var/machine-config-daemon
pkg/daemon/daemon.go
Outdated
return nil, nil, err | ||
} | ||
|
||
mcJSON, err := os.Open(currentConfigFingerprint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to do this before we load currentConfig
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhm, we may, I can't see why though, would you explain (I feel dumb)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the current config isn't available in the apiserver, won't the Get
above fail before we can find our cached version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh gosh yeah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No wait, the scenario here is that after a restore we still have something in the apiserver, which is really what we want on the machine itself, if current isn't on the apiserver, there's not much we can do anyway (which is the BZ we got from David yesterday but this PR isn't tackling that).
This is something I wrote to understand this better:
- current=configA == desired=configA
upgrade
- current=configA != desired=configB (desired being generated by upgrade and it's being applied as new current)
roll out new configs and MCD syncs
- current=configB == desired=configB
...
trigger a backup restore pre-upgrade which isn't calling informers
- current=configA == desired=configA
but now what's on disk is still configB
we now need something which retriggers the MCD to sync to configA
configA
is what we need anyway, so it must be in the apiserver after a restore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can i ask: when is a backup restore triggered? i feel like i understand what this pr is supposed to do, but not fully why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's work still ongoing on others to spec out how/when/where an etcd backup is performed (and a subsequent restore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cgwalters can you check my answer above? I think this is working as intended but I'm waiting on a way to test this out
6f5f21a
to
686b0e5
Compare
This is good to review and get in - as it's treated as a bug. The testing goes later once the other teams/steps are complete |
aws route errors: |
I know that this is a bug, but I cant find a BZ/something else for it? Could we add Bug to title? |
there's no BZ for this, came out from a conversation with Derek on Slack :( not sure we need one tho |
@runcom cool! just wanted to make sure I didn't miss it somewhere. |
weird test didn't re-run /retest |
pkg/daemon/daemon.go
Outdated
return nil, nil, err | ||
} | ||
if fingerprintMC.GetName() != currentConfig.GetName() { | ||
return fingerprintMC, currentConfig, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I read your comment a few times...I am finding it really hard to capture the flow/state of things in my head right now.
Edit: ignore this
It feels like this would all be a lot simpler to reason about if basically we had /var/machine-config-daemon/config<hash>.json
and we kept those for every config version that was relevant to us (current, desired) and used it if it was present instead of talking to the api server.
This would also help address the other BZ about having things be deleted from the apiserver.
That said...after thinking about this more I think you're right, it will work. What I was worried about is "if we're saying currentConfig is desiredConfig, how do we then actually honor the real desiredConfig?". But if they are different then we'll do a config transition and then on the next boot we should transition to the real desiredConfig?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK here's another way to look at it - we're basically saying we can't trust our annotations to describe our current state, the real state is a file on disk. So why are we looking at the annotation at all, versus setting it to the right thing if it's different from what's on disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So why are we looking at the annotation at all, versus setting it to the right thing if it's different from what's on disk?
RIght, I had the same thought indeed, assuming we're talking about the currentConfig (desiredConfig has to come from somewhere/apiserver otherwise we can't really progress right?). So I guess a natural follow up to this would be to avoid talking to the apiserver when querying the currentConfig and just use what it's reflected on disk right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said...after thinking about this more I think you're right, it will work. What I was worried about is "if we're saying currentConfig is desiredConfig, how do we then actually honor the real desiredConfig?". But if they are different then we'll do a config transition and then on the next boot we should transition to the real desiredConfig?
right, that's my understand of what will happen if there's a real drift between currentOnAPI vs currentOnDisk.
Basically, the first sync would reconcile currentOnAPI with currentOnDisk, after we're done with that, there will be another immediate sync to honor desiredConfig (which, in our disaster recovery story is gonna be almost always the same as currentOnAPI, unless you backup an etcd with current != desired at the time you actually backup, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like this would all be a lot simpler to reason about if basically we had
/var/machine-config-daemon/config<hash>.json
and we kept those for every config version that was relevant to us (current, desired) and used it if it was present instead of talking to the api server.
I'm not sure how could we keep desired on disk 🤔 that has always have to come from apiserver right? I feel I'm missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or are we saying that in this case the currentConfig is the desired from the fingerprint? would it go from fingerprint->current-> desired? or....?
is there any case where if we have fingerprint, current and desired we would not want to ultimately end up in the desiredConfig?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fingerprint->current-> desired? or....?
exactly that but in the DR case, fingerprint=configB,current=configA,desired=configA. So it's just one real sync+reboot
is there any case where if we have fingerprint, current and desired we would not want to ultimately end up in the desiredConfig?
nope, there shouldn't be any such case afaict. Desired is always evaluated at sync+1 if fingerprint and current drift
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing bugging me too is...why don't we
return fingerprintMC, desiredConfig, nil
?
Uhm, yeah, so using my flow from above comments, would that work in this case where I take a snapshot exactly in the middle of a sync where current != desired (cause yolo!):
current=configA != desired=configB
snapshot
upgrade
- current=configB != desired=configC
roll out new configs and MCD syncs
- current=configC == desired=configC
...
trigger a backup restore pre-upgrade which isn't calling informers
- current=configA != desired=configB
but now what's on disk is still configC
should we first go from configC on disk to configA on disk, and only then configB? Your code would go straight from configC on disk to configB (which avoids a sync, but should we anyway or do we risk missing something?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(which avoids a sync, but should we anyway or do we risk missing something?)
I am not sure what we'd miss. It sounds like you're trying to be more conservative here just on general principle? I don't object to that to be clear. But I don't understand how this would be different from any other config change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, it's really me being just conservative indeed, I guess it would be fine anyway to avoid a sync, so yeah, changing
Signed-off-by: Antonio Murdaca <runcom@linux.com>
686b0e5
to
070dedd
Compare
alrighty, let's see what tests say and then pull the trigger on this for the DR story |
/hold cancel |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kikisdeliveryservice, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
Mimick (and copy from) pkg/controller testing, hopefully one day I'll spend a week trying to abstract all the duplicate code in pkg/controller testings and now pkg/daemon...but not today. This patch is adding tests for openshift#612 which would greatly benefit from some testing. This can be used as a start to add more tests *hint hint*
Signed-off-by: Antonio Murdaca runcom@linux.com
- What I did
Add an on disk fingerprint about what our current config is on a node (avoiding resync if we changed it or restored from an etcd backup). We check what we have on disk with what we have in annotations before triggering a sync.
- How to verify it
- Description for the changelog