Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please reconsider removal of GRPC_ENFORCE_ALPN_ENABLED in version 1.68.0 and beyond. #7769

Closed
DmitriyMV opened this issue Oct 22, 2024 · 12 comments
Labels
Area: Auth Includes regular credentials API and implementation. Also includes advancedtls, authz, rbac etc. Type: Feature New features or improvements in behavior

Comments

@DmitriyMV
Copy link

Hey everyone!

We would like to raise our concerns about the upcoming removal of the GRPC_ENFORCE_ALPN_ENABLED environment variable in version 1.68.0 of the gRPC library, which is currently slated to enforce the proper usage of ALPN permanently (without an opt-out).

In versions prior to 1.67.0, this feature was controlled via the GRPC_ENFORCE_ALPN_ENABLED environment variable, defaulting to false. In 1.67.0, the default was changed to true, and in 1.68.0, the plan is to remove the variable entirely and make the behavior mandatory.

We understand and support the need for proper configurations, but this change poses significant challenges for our usage. We develop Talos OS, where our clients upgrade their systems gradually, independently from us, not all at once, sometimes skipping several versions at one moment. If this change goes through as planned, it would prevent newer nodes from communicating with older nodes, potentially causing upgrade failures (such as workers not able to communicate with control plane nodes).

Additionally, we also develop another product, Omni, which clients can use in a self-hosted mode. In scenarios where older versions of Omni are running with gRPC <= 1.67.x (where GRPC_ENFORCE_ALPN_ENABLED is not set), newer versions of Talos OS using gRPC >= 1.68.0 will lose the ability to manage these clusters of those nodes.

This situation could create significant disruptions for our clients. We believe that we are not alone in facing this challenge, as many microservice-based architectures would similarly struggle to update all services simultaneously, leading to potential downtime and service interruptions.

We kindly request that you reconsider the removal of the GRPC_ENFORCE_ALPN_ENABLED variable in version 1.68.y beyond. It would also be great if you also provide a way to enforce old behavior using the Go code (calling a function in init perhaps?), rather than environment variables.

This change is very important to us and to others in the community who rely on gradual, seamless upgrades.

@frezbo
Copy link

frezbo commented Oct 22, 2024

It would also be great if you also provide a way to enforce old behavior using the Go code (calling a function in init
perhaps?), rather than environment variables.

Or as a better approach use a dial option, but add deprecation warning maybe 🤔

@arjan-bal
Copy link
Contributor

arjan-bal commented Oct 22, 2024

@DmitriyMV, we don't plan to remove the env variable in 1.68, but in some (still undecided) future version.

grpc-go has been automatically setting the correct NextProtos field for more than year (maybe even earlier, I checked up to v1.58). gRPC C and Java should also be configuring ALPN correctly.

  • How is TLS being configured in your clients/servers?
  • Is there a proxy in the middle which is causing issues?

@arjan-bal arjan-bal added Area: Auth Includes regular credentials API and implementation. Also includes advancedtls, authz, rbac etc. Type: Question Status: Requires Reporter Clarification labels Oct 22, 2024
@DmitriyMV
Copy link
Author

DmitriyMV commented Oct 22, 2024

How is TLS being configured in your clients/servers?

Using tls.Config.GetConfigForClient. Should be fixed by this PR but we can't update really old users for now. My best guess is that it will take at least two years for clients to get to the Talos OS with fixed grpc usage.

Is there a proxy in the middle which is causing issues?

No, but the old nodes (which will also act as a servers, for example, once VPN connection is established) need to continue working.

@DmitriyMV DmitriyMV removed their assignment Oct 22, 2024
@arjan-bal
Copy link
Contributor

arjan-bal commented Oct 22, 2024

A few more questions:

@smira
Copy link

smira commented Oct 23, 2024

  • How long are you requesting that grpc-go keeps the env variable to turn off ALPN enforcement?

I would rather prefer it to be a Dial option, as setting an environment variable works, but it is a bit hacky.

I would prefer to keep it throughout 1.x.

We fixed current Talos version, and backported the fix.

But as Talos is an operating system, it's hard to enforce upgrades, so we can expect older versions of Talos to be used at least for a year (and probably more), so implementing any kind of client for Talos which works across versions is tricky (e.g. a Terraform provider which has to talk to older versions of Talos).

Moreover, components of Talos OS are both gRPC clients and servers, so if one machine in a cluster is upgraded, it might block the communication between machines (if the ALPN workaround is removed from grpc-go).

@DmitriyMV DmitriyMV removed their assignment Oct 23, 2024
@arjan-bal
Copy link
Contributor

@smira having a Dial option request is orthogonal to the backward compatibility issue, right?

k8s has a version skew policy which mentions that different k8s components should be with 1-2 minor versions of each other. If Talos OS has a similar version skew compatibility policy, you could have a couple of versions of Talos OS which ship grpc-go >= 1.67 and then start enforcing ALPN in next versions. Is this possible?

@smira
Copy link

smira commented Oct 23, 2024

@smira having a Dial option request is orthogonal to the backward compatibility issue, right?

It just makes it more straightforward for the client to be compatible with ALPN-less servers. It can be enabled for a specific connection, and it is way easier to be set. E.g. if you boot a service, you have to pass the environment variable to it, while it could have been an dial option, which would be way more explicit.

If Talos OS has a similar version skew compatibility policy, you could have a couple of versions of Talos OS which ship grpc-go >= 1.67 and then start enforcing ALPN in next versions. Is this possible?

Yes, this is definitely possible, two releases means ~8 months for us.

But this problem, is yet another problem. Any provisioning tool (see my Terraform provider above) would be expected to work with some number of Talos releases, as one might have a cluster deployed one year ago, and never upgraded after that. Having a Terraform provider locked down to a Talos release makes little sense.

The ALPN issue itself is totally a no-op in our case - as the gRPC API runs on a dedicated endpoint, having or not having ALPN doesn't change from functional perspective.

If fully appreciate though the desire for the client to be compliant with the spec and enforce ALPN on client connections, but as previously it "worked" without ALPN, all I'm suggesting is to add a dial option (and mark it as deprecated) to allow connecting to ALPN-less gRPC servers, and provide it only for backwards compatibility with ALPN-less servers which are hard to upgrade for one or another reason.

For Talos OS, upgrading means calling gRPC Upgrade API, so even an upgrade requires a client which can talk successfully to Talos OS API service.

@arjan-bal
Copy link
Contributor

Thanks for the details, let me get the opinions of other maintainers and get back.

@arjan-bal
Copy link
Contributor

HI @smira, other gRPC language implementations like Java, C don't provide dial/server options (channel args in c) to disable ALPN. So we don't want to to provide such an option in Go.

The ALPN enforcement in Go, is part of the TransportCredentials API:

grpc-go/credentials/tls.go

Lines 92 to 177 in 8212cf0

func (c *tlsCreds) ClientHandshake(ctx context.Context, authority string, rawConn net.Conn) (_ net.Conn, _ AuthInfo, err error) {
// use local cfg to avoid clobbering ServerName if using multiple endpoints
cfg := credinternal.CloneTLSConfig(c.config)
if cfg.ServerName == "" {
serverName, _, err := net.SplitHostPort(authority)
if err != nil {
// If the authority had no host port or if the authority cannot be parsed, use it as-is.
serverName = authority
}
cfg.ServerName = serverName
}
conn := tls.Client(rawConn, cfg)
errChannel := make(chan error, 1)
go func() {
errChannel <- conn.Handshake()
close(errChannel)
}()
select {
case err := <-errChannel:
if err != nil {
conn.Close()
return nil, nil, err
}
case <-ctx.Done():
conn.Close()
return nil, nil, ctx.Err()
}
// The negotiated protocol can be either of the following:
// 1. h2: When the server supports ALPN. Only HTTP/2 can be negotiated since
// it is the only protocol advertised by the client during the handshake.
// The tls library ensures that the server chooses a protocol advertised
// by the client.
// 2. "" (empty string): If the server doesn't support ALPN. ALPN is a requirement
// for using HTTP/2 over TLS. We can terminate the connection immediately.
np := conn.ConnectionState().NegotiatedProtocol
if np == "" {
if envconfig.EnforceALPNEnabled {
conn.Close()
return nil, nil, fmt.Errorf("credentials: cannot check peer: missing selected ALPN property")
}
logger.Warningf("Allowing TLS connection to server %q with ALPN disabled. TLS connections to servers with ALPN disabled will be disallowed in future grpc-go releases", cfg.ServerName)
}
tlsInfo := TLSInfo{
State: conn.ConnectionState(),
CommonAuthInfo: CommonAuthInfo{
SecurityLevel: PrivacyAndIntegrity,
},
}
id := credinternal.SPIFFEIDFromState(conn.ConnectionState())
if id != nil {
tlsInfo.SPIFFEID = id
}
return credinternal.WrapSyscallConn(rawConn, conn), tlsInfo, nil
}
func (c *tlsCreds) ServerHandshake(rawConn net.Conn) (net.Conn, AuthInfo, error) {
conn := tls.Server(rawConn, c.config)
if err := conn.Handshake(); err != nil {
conn.Close()
return nil, nil, err
}
cs := conn.ConnectionState()
// The negotiated application protocol can be empty only if the client doesn't
// support ALPN. In such cases, we can close the connection since ALPN is required
// for using HTTP/2 over TLS.
if cs.NegotiatedProtocol == "" {
if envconfig.EnforceALPNEnabled {
conn.Close()
return nil, nil, fmt.Errorf("credentials: cannot check peer: missing selected ALPN property")
} else if logger.V(2) {
logger.Info("Allowing TLS connection from client with ALPN disabled. TLS connections with ALPN disabled will be disallowed in future grpc-go releases")
}
}
tlsInfo := TLSInfo{
State: cs,
CommonAuthInfo: CommonAuthInfo{
SecurityLevel: PrivacyAndIntegrity,
},
}
id := credinternal.SPIFFEIDFromState(conn.ConnectionState())
if id != nil {
tlsInfo.SPIFFEID = id
}
return credinternal.WrapSyscallConn(rawConn, conn), tlsInfo, nil
}

Users can configure their own custom credentials by implementing the TransportCredentials interface. This should allow Talos OS to copy gRPC's TLS creds implementation and remove the TLS enforcement parts without too much effort. Would this work for your project?

@DmitriyMV
Copy link
Author

DmitriyMV commented Oct 24, 2024

Thats a lot of code we will need to copy and keep with upstream in the future.

Maybe there is another way? Add a public method to the tlsCreds struct type which disables this check (or allows to customize it) but do not export it in TransportCredentials interface. That way people in need can do a type assertion on returned TransportCredentials and customize the behavior, while others will never see this change and TransportCredentials interface remains unaffected.

And at most it will add one more field to the tlsCreds struct.

@smira
Copy link

smira commented Oct 24, 2024

Other gRPC language implementations like Java, C don't provide dial/server options (channel args in c) to disable ALPN. So we don't want to to provide such an option in Go.

I don't think it exactly applies here, as the problem is that Go client accepted ALPN-less servers for years, so the issue was hard to detect.

👍 to @DmitriyMV suggestion

@arjan-bal arjan-bal added Type: Feature New features or improvements in behavior and removed Status: Requires Reporter Clarification Type: Question labels Oct 24, 2024
@purnesh42H
Copy link
Contributor

We will keep the GRPC_ENFORCE_ALPN_ENABLED flag for few more releases but are not reconsidering to remove it. When we do decide to remove the flag, we will provide a custom creds which doesn't enforce ALPN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Auth Includes regular credentials API and implementation. Also includes advancedtls, authz, rbac etc. Type: Feature New features or improvements in behavior
Projects
None yet
Development

No branches or pull requests

5 participants