Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedicated TChannel Client Per Callee Service #778

Merged
merged 25 commits into from
Aug 13, 2021

Conversation

isopropylcyanide
Copy link
Contributor

@isopropylcyanide isopropylcyanide commented Jul 29, 2021

Summary

PR Status Type Impact level
Ready Feature High

Internal JIRA: https://t3.uberinternal.com/browse/EDGE-326 (ERD attached in the ticket)

Release 1.0.0 - 2021-08-05

Changed

  • BREAKING gateway.Channel has been renamed to gateway.ServerTChannel to distinguish between client and server TChannels.

  • BREAKING gateway.TChannelRouter has been renamed to gateway.ServerTChannelRouter

Added

  • A new boolean flag dedicated.tchannel.client is introduced. When set to true, each TChannel client creates a dedicated connection instead of registering on the default shared server connection.
    Clients need not do anything as the default behaviour is to use a shared connection.(#778)

  • A new field has been added to the gateway to propagate the TChannelSubLoggerLevel for clients with dedicated connections.


Reviewer Notes

  • Only 6-7 files need a review. Rest are generated / tests
  • Estimated time to review: 20 mins given that you've read the internal ERD :)

Description

This PR adds / changes the following

  • Changes the way Edge TChannel Clients connect with the sidecar. Instead of using a common gateway channel and router, we create dedicated channels for each client. This is done based on the flag dedicated.tchannel.client: true
    image

  • Shutdown hooks now include the shutdown for clients in a separate goroutine.

  • Refactor existing gateway channel and router to more readable names

  • Does not change templates for tchannel endpoints. They still get registered on the common server router. Only clients get new routers.

  • Fixes some formatting consistencies and typos.

Motivation & Context

  • Current TChannel clients speak like below
    image

  • Maintaining a dedicated connection ensures that in the event of a misbehaving downstream, requests to other “healthy” tchannel clients do not time out. Only the backlog requests in the queue for the unhealthy service would be affected if the timeout expires. This leads to a more fault tolerant request layer in Edge. As a Tier 0 service, this offers significant platform stability.

  • Since each queue would be separate and multiple router-sidecar goroutines would be reading requests in parallel, we expect good improvements in request latency.

  • Slight although insignificant improvements may be seen in CPU (more utilization) and memory (reduced pressure of requests in the queue).

How was this tested?

Tested this version of Zanzibar in Uber's Edge Gateway and verifying that

  • Multiple tchannel clients in Edge get their own channels and are able to make RPCs
  • Statsd metrics are emitted
tchannel.client.running+client=bazService,dc=unknown,env=test,service=test-gateway
      Value:v2.MetricValue{MetricType:2, Count:0, Gauge:1, Timer:0},..., 
  • Shutdown hooks are called

Tested that conditional client generation works and when dedicated.tchannel.client: false | nil, the existing behaviour remains true

Benchmarks

Verdict: No considerable change in op/sec. Zero change in memory allocs

Branch


========

Branch
time -p sh -c "go test -run _NONE_ -bench . -benchmem -benchtime 7s -cpu 2 ./test/... | grep -v '^ok ' | grep -v '\[no test files\]' | grep -v '^PASS'"
goos: darwin
goarch: amd64
pkg: github.com/uber/zanzibar/test
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  108061	     67386 ns/op	   21593 B/op	     165 allocs/op
goos: darwin
goarch: amd64
pkg: github.com/uber/zanzibar/test/endpoints/baz
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
   36069	    232823 ns/op	   85245 B/op	     716 allocs/op
   36259	    229871 ns/op	   80946 B/op	     713 allocs/op
   39594	    213469 ns/op	   74620 B/op	     671 allocs/op
   38223	    213605 ns/op	   74171 B/op	     657 allocs/op
goos: darwin
goarch: amd64
pkg: github.com/uber/zanzibar/test/endpoints/contacts
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
   42102	    190982 ns/op	   73774 B/op	     559 allocs/op
goos: darwin
goarch: amd64
pkg: github.com/uber/zanzibar/test/endpoints/googlenow
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
   52650	    155615 ns/op	   63501 B/op	     441 allocs/op
real 96.87
user 121.16
sys 39.36
========

Master

========
time -p sh -c "go test -run _NONE_ -bench . -benchmem -benchtime 7s -cpu 2 ./test/... | grep -v '^ok ' | grep -v '\[no test files\]' | grep -v '^PASS'"
goos: darwin
goarch: amd64
pkg: github.com/uber/zanzibar/test
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  126316	     56844 ns/op	   21978 B/op	     165 allocs/op
goos: darwin
goarch: amd64
pkg: github.com/uber/zanzibar/test/endpoints/baz
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
   35053	    228300 ns/op	   88222 B/op	     716 allocs/op
   36102	    226970 ns/op	   83794 B/op	     713 allocs/op
   39868	    213884 ns/op	   76952 B/op	     671 allocs/op
   39051	    214590 ns/op	   76357 B/op	     657 allocs/op
goos: darwin
goarch: amd64
pkg: github.com/uber/zanzibar/test/endpoints/contacts
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
   42651	    195224 ns/op	   78067 B/op	     559 allocs/op
goos: darwin
goarch: amd64
pkg: github.com/uber/zanzibar/test/endpoints/googlenow
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
   51500	    151183 ns/op	   60568 B/op	     441 allocs/op
real 98.17
user 129.94
sys 42.91
========

@coveralls
Copy link

coveralls commented Jul 29, 2021

Coverage Status

Coverage increased (+0.004%) to 69.58% when pulling 3720769 on per-client-tchannel-conn into 52402cd on master.

@isopropylcyanide isopropylcyanide changed the title TChannel Client gets its own connection Dedicated Client Per TChannel Callee Jul 29, 2021
@isopropylcyanide isopropylcyanide changed the title Dedicated Client Per TChannel Callee Dedicated TChannel Client Per Callee Service Jul 29, 2021
ServerTChannelRouter *TChannelRouter
TChannelSubLoggerLevel zapcore.Level
Tracer opentracing.Tracer
JSONWrapper jsonwrapper.JSONWrapper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff here might make it hard to understand what has changed in the struct. Refer below

Channel -> .ServerTChannel
TChannelRouter -> ServerTChannelRouter
+ TChannelSubLoggerLevel
+ ClientTChannels

@@ -800,6 +803,7 @@ func (gateway *Gateway) setupTChannel(config *StaticConfig) error {
if !ok {
return errors.Errorf("unknown sub logger level for tchannel server: %s", levelString)
}
gateway.TChannelSubLoggerLevel = level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required in case dedicated.tchannel.client: true in which case we need this config while creating clients in init phase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: How is logger level related to TChannel?

Copy link
Contributor Author

@isopropylcyanide isopropylcyanide Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's how.

  • When we create a new TChannel, we need to specify a tchannel.Logger.
  • To pass this, we create a new TChannelLogger value.
  • To create this value, we create a sublogger based on gateway.logger which essentially clones the original logger with a different level (since we might not want verbose TChannel logs)
  • The level we pass to this logger debug|info|... is determined by gateway. TChannelSubLoggerLevel
  • gateway.TChannelSubLoggerLevel is populated by subLoggerLevel.tchannel: <> in the yaml
func NewTChannelLogger(logger *zap.Logger) tchannel.Logger {}
func (gateway *Gateway) SubLogger(name string, level zapcore.Level) *zap.Logger {}

@@ -62,7 +62,7 @@ func TestCallMetrics(t *testing.T) {
},
)

numMetrics := 12
numMetrics := 13
Copy link
Contributor Author

@isopropylcyanide isopropylcyanide Aug 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we added new gauge tchannel.clients.running

@isopropylcyanide isopropylcyanide force-pushed the per-client-tchannel-conn branch from 7f3728c to 4ee84ed Compare August 6, 2021 08:35
@@ -422,7 +425,7 @@ func (gateway *Gateway) Shutdown() {
swg.Add(1)
go func() {
defer swg.Done()
if err := gateway.shutdownTChannelServer(ctx); err != nil {
if err := gateway.shutdownTChannelServerAndClients(ctx); err != nil {
ec <- errors.Wrap(err, "error shutting down tchannel server")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error string can also be improved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


service.shadow.env.override.enable: true
dedicated.tchannel.client: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: How have we handled the default behaviour? Does it remain the same? Should the example be for default / old behaviour or new behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Default behaviour: Same as before
  • Example-gateway example has the new behaviour. Selective-gateway example has the default behaviour

Refactor log string
var channel *tchannel.Channel

// If dedicated.tchannel.client : true, each tchannel client will create a
// dedicated connection with local Muttley, else it will use a shared connection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's not mention "local Muttley" here.

Copy link
Contributor Author

@isopropylcyanide isopropylcyanide Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Missed it here. Changed.

} else {
channel = createNewTchannelForClient(deps, serviceName)
channel.Peers().Add(ip + ":" + strconv.Itoa(int(port)))
gateway.ClientTChannels[serviceName] = channel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ok given one service can have multiple clients?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get it. The clients is the local tchannel client for each backend

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, this is the required behaviour. For e.g presentation services where multiple tchannel clients talk to the same presentation backend.

for serviceName, clientTchannel := range gateway.ClientTChannels {
go func(service string, ch *tchannel.Channel) {
gateway.Logger.Info(fmt.Sprintf("Closing tchannel client for [%v]", service))
ch.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could lead to a situation where client is closed but server is not? Leading to failures in pending outbound calls during shtudown (similar to #764).

Can we close all the servers first, wait for their server.close() to complete, and then shutdown clients?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Updating

@@ -180,7 +194,35 @@ func {{$exportName}}(deps *module.Dependencies) Client {
}
}

func initializeDynamicChannel(deps *module.Dependencies, headerPatterns []string, altChannelMap map[string]*tchannel.SubChannel, re ruleengine.RuleEngine) ([]string, ruleengine.RuleEngine) {
func createNewTchannelForClient(deps *module.Dependencies, serviceName string) *tchannel.Channel {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't depend on the template so let's move this out so you don't have to generated this several times during the code-gen step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@isopropylcyanide isopropylcyanide Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will lead to an explosion of parameters in the tchannel_client.go as it needs a tracer, sub logger etc.

I found a better place for it in gateway.go which will also clean up the defaultDependencies struct a bit and would avoid repeated codegen. Thanks for the input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants