-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental Server Implementation #422
Conversation
Signed-off-by: Alec Holmes <alecholmez@me.com>
Signed-off-by: Alec Holmes <alecholmez@me.com>
Signed-off-by: Alec Holmes <alecholmez@me.com>
Signed-off-by: Alec Holmes <alecholmez@me.com>
Signed-off-by: Alec Holmes <alecholmez@me.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this up! Some comments to get you started
I think some comments/docs on the exact contract used by VersionMap would be useful given that it uses "" as a special value to indicate subscription. I think empty string is technically a valid version, so it might be good to use a more explicit sentinel value (e.g. nil with *string values?)
@snowp thanks for the review! I'll go through this today |
Signed-off-by: Alec Holmes <alecholmez@me.com>
@snowp just a heads up I pushed a commit that refactored quite a bit, I haven't addressed all the previous comments so I'll do that now. I just wanted to get your eyes on that code sooner rather than later. The tests are failing and there is an intermittent race condition but I'm looking into it. |
pkg/server/delta/v3/server.go
Outdated
for r, v := range req.InitialResourceVersions { | ||
state.ResourceVersions[r] = v | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we need to read this to be able to resume sessions, I'm just wondering if we should take care to only try to read this on the first request, as otherwise we'd be resetting resource versions to this value? Maybe error out if this is set for subsequent requests for a given type url, since this seems to be in violation with the spec
Signed-off-by: Alec Holmes <alecholmez@me.com>
@snowp just pushed another commit, It seems the tests are failing from the mock CreateDeltaWatch being run exactly 6 times? I have no clue why and I'm currently investigating. Thanks so far for all the help! I'm working to fix the versioning issue with blank I also reorganized a bit just so it's easier to track the tests and such so hopefully that helps. |
Signed-off-by: Alec Holmes <alecholmez@me.com>
Any progress on fixing the CI issue? Should I review this code as is? |
@snowp No luck yet, I’ll look again today to see if I can find anything. A review as is might be helpful! The tests are failing to return the correct number of open watches so somethings getting messed up |
Signed-off-by: Alec Holmes <alecholmez@me.com>
Just found a bug in the server code, working to fix through that now |
Signed-off-by: Alec Holmes <alecholmez@me.com>
@snowp just pushed some changes, I'm seeing some super weird behaviors with ADS, would you mind taking a look at the test? I was seeing this in my debugger:
|
Signed-off-by: Alec Holmes <alecholmez@me.com>
Signed-off-by: Alec Holmes <alecholmez@me.com>
Nevermind! I found the problem haha |
pkg/resource/v3/resource.go
Outdated
ExtensionConfigType = apiTypePrefix + "envoy.config.core.v3.TypedExtensionConfig" | ||
RuntimeType = apiTypePrefix + "envoy.service.runtime.v3.Runtime" | ||
ExtensionConfigType = apiTypePrefix + "envoy.config.core.v3.TypedExtensionConfig" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope not sure why that happened, my editor might've messed up there. Should I revert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well, no need to pollute the git history
pkg/server/v3/delta_test.go
Outdated
if want := map[string]int{ | ||
rsrc.EndpointType: 1, | ||
rsrc.ClusterType: 1, | ||
rsrc.RouteType: 1, | ||
rsrc.ListenerType: 1, | ||
rsrc.SecretType: 1, | ||
}; !reflect.DeepEqual(want, config.deltaCounts) { | ||
t.Errorf("watch counts => got %v, want %v", config.deltaCounts, want) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indention here makes this a bit tricky to read, maybe put want into a separate variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like?
want := map[string]int{rsrc.EndpointType: 1, rsrc.ClusterType: 1, rsrc.RouteType: 1, rsrc.ListenerType: 1, rsrc.SecretType: 1}
if !reflect.DeepEqual(want, config.deltaCounts) {
t.Errorf("watch counts => got %v, want %v", config.deltaCounts, want)
}
Or I can keep the long indentation but that does make it a bit hard to read
@snowp some of the comments you left didn't have the ability for me to respond too so I'll respond here:
That's an interesting thought. Do you think this could be somewhat satisfied by using a
I'm assuming you're referring to this: var nonce string
// The node information might only be set on the first incoming delta discovery request, so store it here so we can
// reset it on subsequent requests that omit it.
if req.Node != nil {
node = req.Node
nonce = req.GetResponseNonce()
} else {
req.Node = node
// If we have no nonce, i.e. this is the first request on a delta stream, set one
nonce = strconv.FormatInt(streamNonce, 10)
} I think it's a matter of how they initialized it? I could be wrong but I'm just as confused by this code as you are. I would've expected (regardless if a nonce := req.GetResponseNonce()
// The node information might only be set on the first incoming delta discovery request, so store it here so we can
// reset it on subsequent requests that omit it.
if req.Node != nil {
node = req.Node
} else {
req.Node = node
} Since on the first request that response nonce will be empty anyways. |
@alecholmez can you describe the clarification you're after here? The thread has a lot of context that is hard to follow :) |
@htuch I think we're confused on whether resource additions/removals should affect subscription state? Specifically with what @snowp brought up:
|
Signed-off-by: Alec Holmes <alecholmez@me.com>
Additions/removals by the client will generally affect the server state, since the server needs to track what the client is subscribing to in delta. |
Signed-off-by: Alec Holmes <alecholmez@me.com>
@snowp I just opened this PR: https://github.com/alecholmez/go-control-plane/pull/7/files I introduced the integration tests for delta and they brought to light some interesting failures. If you're available you might want to take a look. It seems the initial request functions perfectly but afterwards we start closing out streams improperly and failures occur. I'm going to start looking into this but it seems the unit tests haven't caught this. if you'd like I can eventually merge that integration test code into this PR and merge it all in, or wait, whatever works! I branched off this branch since it has the server implementation. |
Signed-off-by: Alec Holmes <alecholmez@me.com>
Signed-off-by: Alec Holmes <alecholmez@me.com>
@snowp I think we're actually able to do away with both aca74f9#diff-616f2ac71437a2f1daab08f6eb5ee33c3ee22cc227f5aca01e8a6a3a46ec60f5R38-R48 ^ right here I'm scared that could be resource intensive if the state becomes large. Unfortunately we have to do that though because go treats maps as reference types so we have to copy into the new snapshot. I made sure to only call it once so we don't do the copy often (only when we get a new request or create a new watch). I might've actually broken the wildcard since I'm going off the version map length but i'm not sure. It really only matters if we check the wildcard on the first request because then everything gets placed in the version tracking map after |
I think it would be fine to leave as a copy for now, though having thought about this some more we might be able to get by just passing the reference if we reorder the code so that we cancel the watch before we subscribe/unsubscribe. All the writes to ResourceVersions happens in sub/unsub/responding to a watch, so if we make sure to never write to ResourceVersion while there is an active watch I think we might get away with it? |
// a channel for receiving incoming delta requests | ||
reqCh := make(chan *discovery.DeltaDiscoveryRequest) | ||
|
||
// we need to concurrently handle incoming requests since we kick off processDelta as a return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do this so we can select
between requests and watch triggers, maybe rephrase in terms of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes i'll make this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snowp I re-read this after I got a bit confused, wouldn't that only be the case if we were actually calling processDelta
inside of the select? I think the the point of the select is to see if we need to close down the request channel if the stream closes itself. Maybe I'm misinterpreting that logic though
@snowp ah cool I'll try that. I have an expiremental branch right now I'm messing around in. I'll merge the code into here when I get it stable or have a question and might need your eyes. I think you are right though if we just cancel the watch before we mutate we should be okay |
Signed-off-by: Alec Holmes <alecholmez@me.com>
Signed-off-by: Alec Holmes <alecholmez@me.com>
Exp state
@snowp So I think we knocked down the data race. I can't seem to make ADS happy though. I've tried quite a few things and even reverted back to using a wildcard map by type but that didn't work either. I just pushed my latest work, I was able to clean up some things. Thankfully we didn't break xDS just ADS. If you've got time would you mind taking a look? |
Seems like the nonce handling was incorrect, this seems to fix it
|
Signed-off-by: Alec Holmes <alecholmez@me.com>
hmm @snowp I made that change locally and ran the integration tests, it still seems like ADS is complaining. That nonce logic originally confused me so maybe its a SOTW problem too. I noticed this after that change:
Note this is for ADS only, xDS is perfectly okay. The unit tests pass no problem though so it's a tough problem to track, maybe some odd behavior in envoy? There seems to be a pattern, the initial request comes through fine the test handlers are able to make their calls appropriately, but as soon as we set a snapshot with a different version it blows up |
I was seeing seeing it go green locally earlier I'm pretty sure, but seeing these failures again locally so I guess I'll take another look. The logs from the integration tests are pretty useless for debugging this, so locally im adding a bunch more useful stuff that I can push upstream later on |
Seems like removing the nonce check altogether fixes thing locally for me. Seems like requests that are just changes to subscription state and not in response to a response (hah) doesn't set a nonce, but we still need to process these requests and respond with any new resources it might be interested in. Not sure why this is only a problem on ADS |
@snowp hm yup that seemed to do the trick. I also noticed the logs seem a lot more logical now in the integration tests. Thanks for the extra eyes! I'll push the change now |
Signed-off-by: Alec Holmes <alecholmez@me.com>
pkg/resource/v3/resource.go
Outdated
ExtensionConfigType = apiTypePrefix + "envoy.config.core.v3.TypedExtensionConfig" | ||
RuntimeType = apiTypePrefix + "envoy.service.runtime.v3.Runtime" | ||
ExtensionConfigType = apiTypePrefix + "envoy.config.core.v3.TypedExtensionConfig" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well, no need to pollute the git history
RE this (not sure why github didn't let me respond):
I actually agree, I'll combine the data into the |
…ting Signed-off-by: Alec Holmes <alecholmez@me.com>
pkg/server/delta/v3/server.go
Outdated
@@ -58,7 +58,7 @@ func (s *server) processDelta(str stream.DeltaStream, reqCh <-chan *discovery.De | |||
streamID := atomic.AddInt64(&s.streamCount, 1) | |||
|
|||
// streamNonce holds a unique nonce for req-resp pairs per xDS stream. The server | |||
// ignores stale nonces and nonce is only modified within send() function. | |||
// TODO: we no longer ignore stale nonces so we'll need to handle that case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes I'll make this a bit more descriptive. We might want to just remove this comment since nonces aren't particularly useful in the delta spec anyways?
Signed-off-by: Alec Holmes <alecholmez@me.com>
Signed-off-by: Alec Holmes <alecholmez@me.com>
Signed-off-by: Alec Holmes <alecholmez@me.com>
…of request subscriptions Signed-off-by: Alec Holmes <alecholmez@me.com>
Signed-off-by: Alec Holmes <alecholmez@me.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright I think this is in a good state to ship and iterate, thanks!
Signed-off-by: Alec Holmes <alecholmez@me.com>
@snowp here's the implementation for the server part of delta. I'll be fixing up the mux and linear cache PRs as well as the integration test PR asynchronously to this being reviewed. Thanks in advanced!