Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2 -> v3 API postmortem #10943

Closed
htuch opened this issue Apr 24, 2020 · 13 comments
Closed

v2 -> v3 API postmortem #10943

htuch opened this issue Apr 24, 2020 · 13 comments
Assignees
Labels
api/v3 Major version release @ end of Q3 2019 stale stalebot believes this issue/PR has not been touched recently

Comments

@htuch
Copy link
Member

htuch commented Apr 24, 2020

We will write a postmortem on the v2 to v3 migration process for Envoy, control planes and other clients such as gRPC. Please record any pertinent information in this ticket as you go with your migration efforts and we can summarize in a document later on.

@htuch htuch self-assigned this Apr 24, 2020
@htuch htuch added api/v3 Major version release @ end of Q3 2019 no stalebot Disables stalebot from closing an issue labels Apr 24, 2020
@jyotimahapatra
Copy link
Contributor

jyotimahapatra commented Apr 27, 2020

My perspective on migration to v3:

  1. Since envoy will call different handlers for v2 vs v3, the control plane has to make sure it handles both handlers.
    Since we use the canonical go-control-plane, the auto generated proto code enables services/methods for v2 & v3. However, the separate services have to be registered with the grpc server.
    This is additional work for operators of control planes.
  2. since the go grpc boiler plate works with strong types, it creates Send/Recv separately for v2 and v3. This is where the complexity starts. Although most fields are compatible, the deprecated fields make it hard to make a generic solution in the go-control-plane.
    As a result the go-control-plane puts the onus of handling separate versions on the private control planes.
  3. At Lyft, we operate a private control plane.
    We need a deployment strategy to granularly migrate to v3 without changing too much code in one go. Like mentioned above, the onus of handling the separate versions is on the private control plane. One way to pursue the migration would be to make sure we serve the same response to v2/v3 requests and avoid creating a layer that understands the version semantics and changing typeurl manually. I think Support mixed v2/v3 xDS request/responses #10776 handles the use case. (We understand that for this to work, we we cannot have any deprecated fields in configs) . This approach helps in rolling back easily if anything goes wrong.
  4. If (3) goes well, we can change all code in control plane to v3. All v3 sidecars will be current. Any lagging envoys at this point can still operate as long as control plane doesn't send new v3 fields.
  5. When all envoys and their bootstrap configs are updated to use v3 resource version, we can call the migration done. Due to unavoidable circumstances, there are going to be envoy sidecars or configs in the fleet which are few weeks old.
  6. migration of rds/eds is easier since they can be easily rolled forward/backward by control plane code. Migration of cds/lds is more tricky since they live in bootstrap configs. And rollback will take time to deploy and cause disruptions. It would be great if there was a way to handle this use case.

@markdroth
Copy link
Contributor

One thought that has come up is that bumping the major version number for both the transport protocol and every individual resource type all in lock-step is probably a lot more heavy-weight than is really necessary. Ideally, we should be able to track and change the versions of the transport protocol and every resource type independently, and decide when to bump the version number of each piece only when there's enough technical debt in that piece to be worthwhile. This would probably require adding a type_url field to ConfigSource, but that seems like it would actually be a fairly logical extension of the API, and it would make it possible to make these kinds of changes in a more granular and therefore less painful manner.

It's worth noting that because the version number is part of the type URL, it doesn't really matter whether we change the version or change the type name itself. For example, once we agree on a new representation for routing (as per the discussions we've been having in the UDPA workgroup), we could replace the existing envoy.config.route.v3.RouteConfiguration to a new message, which could be called either envoy.config.route.v4.RouteConfiguration or udpa.config.route.v1.RouteConfiguration or udpa.v1.some.path.NewTypeNameThatReplacesRouteConfiguration or whatever.

@antoniovicente
Copy link
Contributor

Proto validation and conversion costs have been a problem specially for EDS. Envoy deployments that get used to discovery APIs being relatively inexpensive when using the right versions will face problems and be forced to rollback binaries to those with last-efficient xDS versions around the time of version bumps.

Refs: #10875 and #11362

@lita
Copy link

lita commented Jun 1, 2020

Another thing to note that go-control-plane currently sets the TypeUrl of the request and doesn't allow a passthrough if we want the response to be a different TypeUrl.

envoyproxy/go-control-plane#303

@htuch
Copy link
Member Author

htuch commented Sep 23, 2020

#10964 did not fully resolve #10815, since it didn't capture the following cases of v2 xDS use:

  • AUTO (default) transport version.
  • V2/AUTO (default) resource version.
  • v2 bootstrap
  • Extension/filter configs with v2 type URLs

@htuch
Copy link
Member Author

htuch commented Nov 30, 2020

A pain point around the v2 -> v3 translation is the removal of support for std::regex regexes, since this couples the major version transition with removal of previously but not longer supported functionality.

@htuch
Copy link
Member Author

htuch commented Nov 30, 2020

Similar problem to #10943 (comment) happened with fatal-by-default migration.

@mpuncel
Copy link
Contributor

mpuncel commented Dec 9, 2020

One thing to consider would be separate versioning for non-xds services. At Square we have 2 gRPC access log servers, 2 gRPC ext_authz servers, and a gRPC rate limit server that all had to be upgraded to expose v3 and v2 and then have the transport version switched over. I don't think there were meaningful changes in those particular APIs from v2 to v3, and a lot of these are owned by teams other than the ones responsible for maintaining the xDS servers.

It might be confusing to have multiple API versions throughout the Envoy APIs though. Also it is interesting to note that HTTP APIs were not affected because they aren't versioned the same way.

@htuch
Copy link
Member Author

htuch commented Dec 10, 2020

Thanks @mpuncel. We have similar feedback on coupling transport/resource version bumps from @markdroth. I'd add to your observation that there are third party libraries like OPA for ext_authz that had to upgrade, and in the OPA case they changed their configuration model, so this wouldn't be transparent projects/operators downstream.

@htuch
Copy link
Member Author

htuch commented Feb 7, 2021

Worth noting that git blame analysis is confused by the machine generated protos and can't accurately track origins of API features, pointed out by @asraa.

Also see some straggler v2 uses in 1.17 fixed by #14907

@auni53
Copy link
Contributor

auni53 commented Mar 10, 2021

I understand this may not have been feasible to fix, but one issue I ran into when dealing with our migration was hidden settings of config source proto, which were implicitly set to AUTO -> V2 and not printed by the default proto printing API. I think setting AUTO -> V3 and forcing everyone to explicitly set those to V2 during the deprecated/migration phase might have been helpful.

htuch added a commit to htuch/envoy that referenced this issue Mar 18, 2021
… on main thread.

During the v2 -> v3 migration, some uses of this were added inside filter factory lambdas and could
occur on worker threads, rejecting config on the data plane.

Relevant to envoyproxy#10943

Risk level: Low
Testing: New assertion in method caused existing tests to fail, fix addresses these. Also manually
  audited via grep.

Signed-off-by: Harvey Tuch <htuch@google.com>
htuch added a commit that referenced this issue Mar 30, 2021
… on main thread. (#15548)

During the v2 -> v3 migration, some uses of this were added inside filter factory lambdas and could
occur on worker threads, rejecting config on the data plane.

Relevant to #10943
Fixes #15083

Risk level: Low
Testing: New assertion in method caused existing tests to fail, fix addresses these. Also manually
audited via grep.

Signed-off-by: Harvey Tuch <htuch@google.com>
htuch added a commit to htuch/envoy that referenced this issue Oct 28, 2021
This should reduce the binary size, which is particularly important for Envoy Mobile. Looking at a
local opt build with debug symbols, I'm seeing a drop from ~400MB to ~380MB, so maybe 5% saving.

Related to envoyproxy#10943

Risk level: Low
Testing: bazel query deps to confirm no more v2 API deps.

Signed-off-by: Harvey Tuch <htuch@google.com>
htuch added a commit that referenced this issue Oct 29, 2021
This should reduce the binary size, which is particularly important for Envoy Mobile. Looking at a
local opt build with debug symbols, I'm seeing a drop from ~400MB to ~380MB, so maybe 5% saving. @Reflejo indicates that optimized Envoy Mobile without symbols is observing ~20% improvement.

Related to #10943

Risk level: Low
Testing: bazel query deps to confirm no more v2 API deps.

Signed-off-by: Harvey Tuch <htuch@google.com>
@alyssawilk alyssawilk removed the no stalebot Disables stalebot from closing an issue label Oct 8, 2024
Copy link

github-actions bot commented Nov 7, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 7, 2024
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api/v3 Major version release @ end of Q3 2019 stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

8 participants