Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add v3 xDS support #140

Merged
merged 9 commits into from
Sep 21, 2020
Merged

Add v3 xDS support #140

merged 9 commits into from
Sep 21, 2020

Conversation

mpuncel
Copy link
Contributor

@mpuncel mpuncel commented Jul 27, 2020

This commit adds support for the v3 xDS transport as well as support for
storing and serving v3 resources out of the resource cache.

Note that resource versions and xDS transport version are decoupled to
provide a migration path to v3. End users can decide to generate v3
resources and serve them over v2 transport during migration, or continue
to generate v2 resources and serve them over v3 transport.

For implementation simplicity, it is not possible to generate a
combination of v2 and v3 resources in the control plane, which would
require either up or down conversion.

Signed-off-by: Michael Puncel mpuncel@squareup.com

* Returns all cluster items in the CDS payload.
*/
public abstract SnapshotResources<Cluster> clusters();
abstract String version(String typeUrl, List<String> resourceNames);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diff came out big, Snapshot basically got renamed to V2Snapshot, and a new V3Snapshot was created. they extend a common abstract class.


import static io.envoyproxy.controlplane.cache.Resources.V2_TYPE_URLS;

public class V2SimpleCache<T> extends SimpleCache<T, V2Snapshot> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These classes make it so callers don't have to worry about the new type variable on SimpleCache, and also passing type URLs when instantiating the cache

* consistent (i.e. all referenced resources must be included in the snapshot).
*
* @param group group identifier
* @param snapshot a versioned collection of node config data
*/
void setSnapshot(T group, Snapshot snapshot);
void setSnapshot(T group, U snapshot);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Callers will need to update to be specific about the type of snapshot they're generating, either V2Snapshot or V3Snapshot

* a v3 request.
*/
@AutoValue
public abstract class XdsRequest {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is basically a oneof of v2 request or v3 request, and delegates methods to the right one.

@@ -0,0 +1,546 @@
package io.envoyproxy.controlplane.cache;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the tests, my approach was basically to rename the existing test to V2, and then copy/paste to a V3 class, updating the imports to use v3 message types, and using v3 type URLs. I also had to implement the new callbacks interfaces, and throw IllegalStateException in the test if the unexpected request version happens.

This means theres' a lot of copy/paste, but IMO that's less concerning for tests. I like tests to be clear rather than super DRY, whereas the implementation being reused as much as possible makes less surface for bugs.

I'm open to feedback on this though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do a second round and try to dry them up myself and see if that is better.

@@ -54,17 +55,42 @@ default void onStreamOpen(long streamId, String typeUrl) {
* will be returned to the client and the stream will be closed with error.
*/
default void onStreamRequest(long streamId, DiscoveryRequest request) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the old method name the same to minimize impact on end users, but maybe we should force a rename? failing to implement the v3 version and turning on v3 transport in the data plane would cause issues

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename it to be explicit. We're already forcing renames in other places.

* @throws RequestException optionally can throw {@link RequestException} with custom status. That status
* will be returned to the client and the stream will be closed with error.
*/
default void onV3StreamRequest(long streamId, io.envoyproxy.envoy.service.discovery.v3.DiscoveryRequest request) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I not default this for safety?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more what the consequences are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe could mean a user accidentally doesn't implement the new callback after doing an upgrade, and then they don't get the callbacks they expect.

@mpuncel mpuncel mentioned this pull request Jul 27, 2020
@mpuncel mpuncel force-pushed the mpuncel/v3-support branch 3 times, most recently from 374c0f9 to 2610f3f Compare July 27, 2020 14:48
@mpuncel
Copy link
Contributor Author

mpuncel commented Jul 27, 2020

I thought envoyproxy/envoy#10776 was done already, I think we have to wait for that

@mpuncel
Copy link
Contributor Author

mpuncel commented Jul 27, 2020

alternatively, I need to implement up/down conversion here

This commit adds support for the v3 xDS transport as well as support for
storing and serving v3 resources out of the resource cache.

Note that resource versions and xDS transport version are decoupled to
provide a migration path to v3. End users can decide to generate v3
resources and serve them over v2 transport during migration, or continue
to generate v2 resources and serve them over v3 transport.

For implementation simplicity, it is not possible to generate a
combination of v2 and v3 resources in the control plane, which would
require either up or down conversion.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
case V3:
return output.toBuilder()
.setTypeUrl(Resources.V2_TYPE_URLS_TO_V3.get(output.getTypeUrl()))
.build();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added conversion in this PR. the type URL of the output Any will be rewritten to match the request version if they're different.

This is covered in V2DiscoveryServerV3ResourcesXdsIT, V2DiscoveryServerV3ResourcesAdsIT, V3DiscoveryServerV2ResourcesXdsIT, and V3DiscoveryServerV2ResourcesAdsIT.

Those tests do not cover using resource_api_version at all, which I'd be open to adding if folks think there'd be value

@mpuncel mpuncel mentioned this pull request Jul 28, 2020
@slonka
Copy link
Member

slonka commented Jul 29, 2020

This is really big :) I'll probably need a couple of rounds of review to get through all.

@mpuncel
Copy link
Contributor Author

mpuncel commented Jul 29, 2020

I also want to note that while it passes integration tests, I haven't tried it in production yet. I put up the PR on the early side since there has been interest from others in the status of this work. I'll keep posted on this PR when we try it out

@slonka yep, definitely a big one. I do think most of the LOC added is in tests, which are mostly copy/paste. Let me know if there is anything I can do to make reviewing easier

slonka added a commit to allegro/envoy-control that referenced this pull request Aug 6, 2020
Copy link
Member

@slonka slonka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, small changes needed. I also did a dry run of migration in our control plane and everything seems ok: https://github.com/allegro/envoy-control/compare/v3-update-dry-run?expand=1 . I still need to understand migration paths (and their tests) more deeply. I'll do it in the second run of review.

}
} catch (InvalidProtocolBufferException e) {
LOGGER.error(
"Failed to convert HTTP connection manager config struct into protobuf message for listener {}",
Copy link
Member

@slonka slonka Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding info about it being V3?

Comment on lines +241 to +248
// For EDS clusters, use the cluster name or the service name override.
if (c.getType() == io.envoyproxy.envoy.config.cluster.v3.Cluster.DiscoveryType.EDS) {
if (!isNullOrEmpty(c.getEdsClusterConfig().getServiceName())) {
refs.add(c.getEdsClusterConfig().getServiceName());
} else {
refs.add(c.getName());
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: This can be extracted to a method and used in V2 version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it? c has a different type for each of them

@@ -13,7 +13,8 @@
long lastWatchRequestTime();

/**
* Returns the node grouping represented by this status, generated via {@link NodeGroup#hash(Node)}.
* Returns the node grouping represented by this status, generated via
* {@link NodeGroup#hashv3(Node)} or {@link NodeGroup#hashV2(io.envoyproxy.envoy.api.v2.core.Node)}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid link, should be V3

Suggested change
* {@link NodeGroup#hashv3(Node)} or {@link NodeGroup#hashV2(io.envoyproxy.envoy.api.v2.core.Node)}.
* {@link NodeGroup#hashV3(Node)} or {@link NodeGroup#hashV2(io.envoyproxy.envoy.api.v2.core.Node)}.

this.latestResponse = new ConcurrentHashMap<>(Resources.TYPE_URLS.size());
this.ackedResources = new ConcurrentHashMap<>(Resources.TYPE_URLS.size());

// NOTE: while V2 URLs are referenced here, the size of V2_TYPE_URLS and V3_TYPE_URLS is the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a comment maybe we can check the size here and throw an exception if it's not.

@@ -54,17 +55,42 @@ default void onStreamOpen(long streamId, String typeUrl) {
* will be returned to the client and the stream will be closed with error.
*/
default void onStreamRequest(long streamId, DiscoveryRequest request) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename it to be explicit. We're already forcing renames in other places.

@@ -0,0 +1,546 @@
package io.envoyproxy.controlplane.cache;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do a second round and try to dry them up myself and see if that is better.

Copy link

@pbetkier pbetkier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the change looks nice, thanks for the work. I left a few comments.

Also, test copies aren't great in general, yet I don't have an idea of how to easily share them between v2 and v3. Maybe somebody else has an idea?

* resources. Snapshots should have distinct versions per node group.
*/
@AutoValue
public abstract class V3Snapshot extends Snapshot {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why version prefixes in class names like here? To me that's what packages are for, i.e. (...).v3.Snapshot. Otherwise, we'll eventually remove v2 support but stay with awkward V3Snapshot and the like forever.

@@ -14,5 +14,12 @@
*
* @param node identifier for the envoy instance that is requesting config
*/
T hash(Node node);
T hashV2(Node node);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use method overloading than version suffixes, so have hash(Node) and hash(v3.Node) instead of hashV2(Node) and hashV3(v3.Node).

+ ".transport_sockets.tls.v3.Secret";

private static final String V2_TYPE_URL_PREFIX = "type.googleapis.com/envoy.api.v2.";
public static final String CLUSTER_TYPE_URL = V2_TYPE_URL_PREFIX + "Cluster";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have CLUSTER_TYPE_URL and V3_ CLUSTER_TYPE_URL, not V2_CLUSTER_TYPE_URL and V3_ CLUSTER_TYPE_URL. Why is that?

Also, we could consider introducing a namespace for constants like:

public class Resources {
    // ...
    public static class V2 {
        public static final String CLUSTER_TYPE_URL = "...";
        // ...
    }

    public static class V3 {
        public static final String CLUSTER_TYPE_URL = "...";
        // ...
    }
}

…adoc link. Use method overloading instead of different names for v2 vs v3 node hash

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@mpuncel
Copy link
Contributor Author

mpuncel commented Aug 8, 2020

Thanks for the feedback! I believe I addressed everything except the test duplication so far

@slonka
Copy link
Member

slonka commented Aug 18, 2020

@mpuncel we're watching ServiceMeshCon & KubeCon this week but we'll try to make a review ASAP. Please make sure the build is passing though.

@slonka
Copy link
Member

slonka commented Aug 20, 2020

@sschepens can you look at this PR as well?

@slonka
Copy link
Member

slonka commented Aug 21, 2020

I'm missing the overall picture of how the migration should occur.

Now:

  • all Envoys are using resource_api_version and transport_api_version v2 (by default)

Where we want to be:

  • all Envoys are using resource_api_version and transport_api_version v3

During the migration we would have some Envoys requesting resources with resource_api_version v3 on transport_api_version v2 protocol, which is fine (we can handle that). Once we're done we would switch from V2DiscoveryServer to V3DiscoveryServer but... Envoys will fail with:

StreamAggregatedResources gRPC config stream closed: 12, Method not found: envoy.service.discovery.v2.AggregatedDiscoveryService/StreamAggregatedResources

since we did not switch them from transport_api_version v2 to v3. Shouldn't the V2DiscoveryServer also handle envoy.service.discovery.v3.AggregatedDiscoveryService/StreamAggregatedResources (and vice versa)?

If I'm completely wrong, can you describe the migration steps on the Envoy side and the CP side?

@mpuncel
Copy link
Contributor Author

mpuncel commented Aug 22, 2020

I was envisioning that a control plane implementation will set up both a V2DiscoveryServer and a V3DiscoveryServer implementation, and give them the same cache instance. This would let you handle a combination of v2 and v3 transport as the migration happens, but only need to generate one resource version.

It's my understanding that all combinations of transport and resource API version can be handled, but I don't see a reason why during migration you wouldn't keep transport_api_version and resource_api_version the same.

It looks like the integration tests might unintentionally always be using the v2 resource version.

we could:

  1. fix the tests using v3 transport_api_version to also use v3 resource_api_version
    OR
  2. add integration tests for all combinations of transport_api_version, resource_api_version, & cache version? that would result in an explosion of integration tests, but potentially worth it?

…on tests

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Note that there are not tests that cover a different transport version
from resource version in the interest of not doubling the number of
integration tests.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@mpuncel
Copy link
Contributor Author

mpuncel commented Aug 23, 2020

I DRY'd up the tests somewhat. I think a lot of the challenge there is that a lot of the heavy lifting is in static field initializers, so I don't think I can use an inheritance based approach very effectively. Java is not my main language though, so feel free to point out suggestions.

I also fixed the integration tests so that the v3 clients are requesting v3 resources as well, it was an oversight that I wasn't setting resource_api_version: V3 before, so it was defaulting to V2. This had some trickle down effect where I had to change the test resources to not use deprecated V2 fields, because those can't be converted to v3.

Also added a migration guide explaining how I think migration will work. Hopefully that helps clear up how this PR should be used, and y'all can double check me, it's definitely a bit confusing.

@mpuncel
Copy link
Contributor Author

mpuncel commented Aug 23, 2020

Also let me know if it would help for me to break this into smaller PRs that can be reviewed piecemeal. I needed to do it all together I think in order to understand how it would work, and I think it's probably useful to see in its entirety before reviewing smaller chunks, but I know it's pretty tough to review in its current form

@rene-m-hernandez
Copy link
Contributor

rene-m-hernandez commented Sep 2, 2020

It would seem that remaining on v2 of xDS API incurs a performance penalty when using Envoy 14+: envoyproxy/envoy#12080

Any chance more progress has been made on v3 support in jcp? I've been hoping I would not need to fork this repo, but after discovering the above issue ....

@nezdolik
Copy link
Member

nezdolik commented Sep 8, 2020

@mpuncel great initiative! We at Spotify are currently migrating edge control plane to use xds api v3 and are interested in this PR. Let us know if we could help.

@slonka
Copy link
Member

slonka commented Sep 8, 2020

I was envisioning that a control plane implementation will set up both a V2DiscoveryServer and a V3DiscoveryServer implementation, and give them the same cache instance.

If you want to give them the same cache instance then I guess SimpleCache cannot be abstract right (because it must server both V3 and V2 Snapshots)?

Edit: I'm working on an integration test that has both V2 and V3 Envoy's connected. If that works, this will remove all doubt that the migration is possible.

Not sure it's completely right but here's the code: slonka@6ff0c80 I do not see other way than this to serve V2 and V3 snapshots.

Edit: Ok, I created a version that serves only V3 Snapshots and both V2 and V3 Envoy consume it. It works because of this #140 (comment) right? If this https://github.com/mpuncel/java-control-plane/compare/mpuncel/v3-support...slonka:v3-support-slonka?expand=1 test shows how you envisioned the migration then I'm convinced and we can move this PR forward.

@slonka
Copy link
Member

slonka commented Sep 9, 2020

I released update-v3-SNAPSHOT from: https://github.com/envoyproxy/java-control-plane/tree/v3-support-slonka and will be integrating it with our Control Plane and testing. I had to disable a couple of tests that passed locally but failed on CI, please work on them when you have time.

@nezdolik
Copy link
Member

nezdolik commented Sep 9, 2020

cc @rulex123

@nezdolik
Copy link
Member

@slonka @mpuncel could you try to retrigger build to check if this is an intermittent issue? i've tested this PR with increased timeout and all checks are passing: #142. It could be flakiness or too small timeout.

@slonka
Copy link
Member

slonka commented Sep 15, 2020

@nezdolik
Copy link
Member

@slonka the latest run is successful. This flakiness could appear in future builds. Did not dig into code much, one assumption is that echo container was not ready to serve request at the time the test was run (docker networking was not fully set up).

@slonka
Copy link
Member

slonka commented Sep 15, 2020

@nezdolik the tests from this branch failed once again: https://app.circleci.com/pipelines/github/envoyproxy/java-control-plane?branch=pull%2F140

image

I do not know why CircleCi reported your PR (142) in this PR as passing (or am I missing something?).

AFAIK our test setup in envoy-control (example) does not suffer from this sort of flakiness and I'm guessing RuleChain might be a problem ( https://github.com/envoyproxy/java-control-plane/pull/140/files#diff-1305245ead97a0cc1311ece8b78e6838R85 ).

I think we can open up an issue about flaky tests and move forward with this PR. Have you had a chance to test this version at Spotify?

@nezdolik
Copy link
Member

@slonka maybe because the git history in my PR is complete copy of this PR, if CI relies on comparing git hashes to distinguish builds...
Opening a follow up issue for fixing tests sgtm.
I can test snapshot version today for our setup and report back.

@nezdolik
Copy link
Member

@slonka @mpuncel while upgrading our custom plane to use code from this PR, i came across io.envoyproxy.controlplane.cache.NodeGroup interface, which has 2 abstract methods:

          @Override
          public String hash(io.envoyproxy.envoy.api.v2.core.Node node) {
          ...
          }

          @Override
          public String hash(io.envoyproxy.envoy.config.core.v3.Node node) {
         ...
          }

So clients of the library will have to override both of those methods which is not optimal. Wdyt?

@slonka
Copy link
Member

slonka commented Sep 15, 2020

@slonka @mpuncel while upgrading our custom plane to use code from this PR, i came across io.envoyproxy.controlplane.cache.NodeGroup interface, which has 2 abstract methods:

          @Override
          public String hash(io.envoyproxy.envoy.api.v2.core.Node node) {
          ...
          }

          @Override
          public String hash(io.envoyproxy.envoy.config.core.v3.Node node) {
         ...
          }

So clients of the library will have to override both of those methods which is not optimal. Wdyt?

I don't see any other way to distinguish between v2 and v3 node, if we want to still allow SimpleCache to serve them. If we split it I think we would have to provide a separate cache for v2 and v3. Can you share an example of how would you do it differently?

@nezdolik
Copy link
Member

nezdolik commented Sep 15, 2020

@slonka if we "force" clients of the library to implement both v2 and v3 methods, it means that users cannot do v3 migration in one go and will have to do extra step by end of Q4 after the library removes all usages of v2 and is released with breaking changes. If this is acceptable and the PR intends to introduce v3 support with some mixed versioning in api then i guess code is fine as it is.
Otherwise, we could either duplicate the code to support exclusively v3 version or benefit from Java 8 innovation and introduce default implementations of those methods in interface, eg:

interface NodeGroup {

          default public String hash(io.envoyproxy.envoy.api.v2.core.Node node) {
           throw new UnsupportedOperationException();
          }

          default public String hash(io.envoyproxy.envoy.config.core.v3.Node node) {
           throw new UnsupportedOperationException();
          }
}

Would be great to hear more opinions on this.

@slonka
Copy link
Member

slonka commented Sep 16, 2020

@mpuncel nad I decided to be explicit when it comes to the v2/v3 methods and removed "default" implementations (see comment and code). As an end user I'd rather know at the compilation step that if I do not implement this, turning on V3 will cause problems.

@pbetkier any thoughts on this?

@nezdolik
Copy link
Member

nezdolik commented Sep 16, 2020

@slonka got it, was just thinking that there are users (like us) who would rather fully move on v3 with no need to maintain v2 code in their control plane.
I've also tested this PR with our setup, everything works well so far.

@pbetkier
Copy link

I agree that it's better to fail at compile time than discover problems at startup. Default methods could cause the latter. On the other hand, I think removing the implementation for a method that no longer exists in the future interface will be rather straightforward to do by the clients – again, the compiler will show what's to change. That's how I see the tradeoff, I can be convinced though.

I didn't dig into the code enough to know if duplicate code approach is feasible. If it is then we could do it this way.

@slonka
Copy link
Member

slonka commented Sep 17, 2020

Otherwise, we could either duplicate the code to support exclusively v3 version

I just want to make sure I understand that completely. Ok, so during the migration period when some Envoy's are v2 and some are v3 we would translate the v2 node to v3 type? Or do you want to have different caches for v2 and v3 nodes?

@nezdolik
Copy link
Member

@slonka thinking about it once again, looks like is a good overall tradeoff (not to duplicate caches for library maintainers + removing one method for library clients some time before end of Q4).
I could have been too focused on our use case (we are going to upgrade entire fleet in one go, no mixed Envoy versions running).

@slonka
Copy link
Member

slonka commented Sep 17, 2020

Pinging @sschepens once more as the author of #139 if you want to take a look. If there are no objections I'm merging and releasing it on Friday afternoon / Monday morning (probably Monday morning)

@slonka slonka merged commit a75b971 into envoyproxy:master Sep 21, 2020
@slonka
Copy link
Member

slonka commented Sep 22, 2020

I was not able to release this version due to failing "WarmingClusters" integration tests. People who want to use this version now should use the snapshot version while I fix the tests in another PR.

@slonka
Copy link
Member

slonka commented Sep 24, 2020

Released under 0.1.24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants