-
Notifications
You must be signed in to change notification settings - Fork 771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Outgoing Connection Metrics #1343
Conversation
pbsmetrics/go_metrics.go
Outdated
@@ -70,6 +70,11 @@ type AdapterMetrics struct { | |||
BidsReceivedMeter metrics.Meter | |||
PanicMeter metrics.Meter | |||
MarkupMetrics map[openrtb_ext.BidType]*MarkupDeliveryMetrics | |||
ConnSuccess metrics.Counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you suspect, we may not need all of this information. Here's my take:
ConnSuccess
is the default assumption. I don't think we need to log that counter. This also looks like the sum ofConnCreated
andConnReused
which can be computed easily in queries if desired.ConnError
is likely already surfaced with other metrics. I don't remember exactly. I think it either turns into a panic or an adapter error. Thoughts on if this is worth a counter?ConnCreated
seems very useful in tracking connection management.ConnReused
seems very useful in tracking connection management.
There's also two metrics I'd like to throw out for consideration:
- Connection wait time. This one is a bit tricky and might not easily be done with the built-in tracers. The idea is to measure the time when the connection request is first made until a connection is either created or reused. This will tell us how bad the congestion in the connection pool is and how much delay is being introduced. This would be expensive, for sure... likely 12 additional metrics (at least for Prometheus) per adapter... which is now roughly 840 in total. Still, I think it would be super useful information.
- Dns resolution wait time. I don't think this is worthwhile to record per adapter, but might be interesting in aggregate for hosts to understand how their DNS resolution layer is performing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I agree with @SyntaxNode. We only need to know if a new connection was created for a bidder vs if an existing connection was reused. Either of those would mean that we were able to connect with the bidder successfully. Note that we already have a metric called,
adapter_errors
which should record any type of adapter related error including not being able to connect to the adapter. -
I think adding connection wait time metric is worth it because we wanna make sure that our suspicion that creating new connections is the cause of increased latencies during spikey traffic, is indeed true. I believe you should be able to do that by recording the timestamp in
GetConn()
andGotConn()
and then subtracting the two. This will represent the entire time taken to create a new connection (including DNS lookup and TLS handshake time). Although, when we record this, we wanna make sure that we're recording it only when we're creating a new connection and not when we are reusing an existing one. -
One other thing to consider is redirects. In case of redirects, these trace functions may be called multiple times which will then in turn record these metrics multiple time for each redirect connection as well. This could lead to misleading numbers in our metrics. This one is a little tricky and will require a little more thought as to how we can handle this.
pbsmetrics/prometheus/prometheus.go
Outdated
@@ -178,6 +185,11 @@ func NewMetrics(cfg config.PrometheusMetrics) *Metrics { | |||
"Count of requests labeled by adapter, if has a cookie, and if it resulted in bids.", | |||
[]string{adapterLabel, cookieLabel, hasBidsLabel}) | |||
|
|||
metrics.adapterConnections = newCounter(cfg, metrics.Registry, | |||
"adapter_connections", | |||
"Count of connections to bidder hosts labeled by their adapters. Tracks if connection was successful, connection was reused, or if connection was left idle", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be separated out into their own metrics. I don't think the correlation serves meaningful purposes here. Some of the labels are mutually exclusive (success vs error, reused vs created) and knowing the relation between reused vs created and idle doesn't seem very interesting.
@@ -61,7 +61,7 @@ func TestMetricCountGatekeeping(t *testing.T) { | |||
// Verify Per-Adapter Cardinality | |||
// - This assertion provides a warning for newly added adapter metrics. Threre are 40+ adapters which makes the | |||
// cost of new per-adapter metrics rather expensive. Thought should be given when adding new per-adapter metrics. | |||
assert.True(t, perAdapterCardinalityCount <= 22, "Per-Adapter Cardinality") | |||
assert.True(t, perAdapterCardinalityCount <= 54, "Per-Adapter Cardinality") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, see? With having these as labels on the same adapter the cardinality is very high for the information we're recording and we're adding 32 new metrics instead of just 5 per adapter.
Update with version 0.115.0 changes
Update with latest changes
Below is the A new connection and connection wait time
DNS lookup metrics
|
pbsmetrics/go_metrics.go
Outdated
ConnError: metrics.NilCounter{}, | ||
ConnCreated: metrics.NilCounter{}, | ||
ConnReused: metrics.NilCounter{}, | ||
ConnIdleTime: &metrics.NilTimer{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, what would be the use case for ConnIdleTime
metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it'd be useful to know which connections are more efficient. I'm not sure if I'm going for a "nice to have" here instead of a "must have" or we "definitely need" this. Do think we should discard them?
pbsmetrics/go_metrics.go
Outdated
@@ -234,6 +244,11 @@ func makeBlankAdapterMetrics() *AdapterMetrics { | |||
BidsReceivedMeter: blankMeter, | |||
PanicMeter: blankMeter, | |||
MarkupMetrics: makeBlankBidMarkupMetrics(), | |||
ConnError: metrics.NilCounter{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can skip ConnError
metric as we're only calling the RecordAdapterConnections
from the GotConn
method and the fact that the code got to that piece of the code means that we were able to successfully get a connection. Also, we have another metric called adapter_errors
which I believe should record connection related errors for adapters as well. We might wanna check what is a connection error reported as in the adapter_error
metric and perhaps add a new error type if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty good catch! The error metrics were not getting updated at all. I can remove them if you want, for the moment I left them there because I think it'd be interesting to know whether or not an adapter endpoint is having connection issues. But I do realize this is probably just a "nice to have" and not a "we definitely need this". We can discuss this in standup if you want and I can removed them if you want me to.
exchange/bidder.go
Outdated
// This function adds an httptrace.ClientTrace object to the context so, if connection with the bidder | ||
// endpoint is established, we can keep track of whether the connection was newly created, reused, and | ||
// even know if it was left idle and for how much time. | ||
func (bidder *bidderAdapter) AddClientTrace(ctx context.Context) context.Context { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale behind having this method defined on the bidderAdapter
receiver and also with keeping it exported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Let me go back to the whiteboard. Maybe we can modify the context in exchange/exchange.go
.
pbsmetrics/go_metrics_test.go
Outdated
|
||
"github.com/prebid/prebid-server/config" | ||
"github.com/prebid/prebid-server/openrtb_ext" | ||
metrics "github.com/rcrowley/go-metrics" | ||
"github.com/stretchr/testify/assert" | ||
) | ||
|
||
const OneSecond time.Duration = time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time.Second
is already a constant, any reason why we wanna define it as a constant again over here with a diff name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Corrected
|
||
// RecordDNSTime mock | ||
func (me *MetricsEngineMock) RecordDNSTime(dnsLookupTime time.Duration) { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we call the Called
function for this method? Looks like something you might have missed by mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary for TestRecordDNSTime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still recommend to add the calls, so future tests can rely on that behavior if needed in the future.
pbsmetrics/prometheus/prometheus.go
Outdated
@@ -63,6 +70,10 @@ const ( | |||
requestStatusLabel = "request_status" | |||
requestTypeLabel = "request_type" | |||
successLabel = "success" | |||
|
|||
adapterConnErrorLabel = "adapter_connection_error" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you've defined separate metrics for ConnError
, ConnReused
and ConnCreated
, any reason why we also need labels? Also, they don't seem to be used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty good catch. This is a leftover from the previous commit. Removed
exchange/bidder.go
Outdated
@@ -321,6 +324,7 @@ func (bidder *bidderAdapter) doRequest(ctx context.Context, req *adapters.Reques | |||
} | |||
httpReq.Header = req.Headers | |||
|
|||
ctx = bidder.AddClientTrace(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a feature flag around this option? Just in case it doesn't work as expected for some reason in prod or if a host would like to disable the feature to save a little bit in memory usage. If so, the prometheus preload should also only preload the metrics if the feature flag is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. Let me know your thoughts with the addition
Modified according to feedback. Kept both the error connection metrics and the connection idle time. Added a configuration boolean flag to disable the adapter connection metrics. Ran the server locally and posted a request to check out if the metrics work as expected. These are the metrics before and after said request to the appnexus bidder.
|
config/config.go
Outdated
@@ -675,6 +680,7 @@ func SetupViper(v *viper.Viper, filename string) { | |||
v.SetDefault("http_client_cache.idle_connection_timeout_seconds", 60) | |||
// no metrics configured by default (metrics{host|database|username|password}) | |||
v.SetDefault("metrics.disabled_metrics.account_adapter_details", false) | |||
v.SetDefault("metrics.disabled_metrics.adapter_connections_metrics", false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open Question: Should make this disabled by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, set to true
so they are disabled by default.
exchange/bidder.go
Outdated
|
||
type bidderMetrics struct { | ||
engine pbsmetrics.MetricsEngine | ||
disabled config.DisabledMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would normally expect a property called disabled
to be a bool. disabledMetrics
perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes total sense. Corrected.
exchange/bidder.go
Outdated
@@ -334,6 +347,7 @@ func (bidder *bidderAdapter) doRequest(ctx context.Context, req *adapters.Reques | |||
} | |||
|
|||
} | |||
bidder.me.engine.RecordAdapterFailedConnections(bidder.BidderName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about this. A connection may fail due to our context deadline, which I don't think should be reported as a connection error. Are there other ways for ctxhttp.Do
to error which has nothing to do with retrieving a connection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. This metric will be removed per our conversation
pbsmetrics/go_metrics.go
Outdated
am.ConnError = metrics.GetOrRegisterCounter(fmt.Sprintf("%[1]s.%[2]s.connections_error", adapterOrAccount, exchange), registry) | ||
am.ConnCreated = metrics.GetOrRegisterCounter(fmt.Sprintf("%[1]s.%[2]s.connections_created", adapterOrAccount, exchange), registry) | ||
am.ConnReused = metrics.GetOrRegisterCounter(fmt.Sprintf("%[1]s.%[2]s.connections_reused", adapterOrAccount, exchange), registry) | ||
am.ConnIdleTime = metrics.GetOrRegisterTimer(fmt.Sprintf("%[1]s.%[2]s.connection_idle_time", adapterOrAccount, exchange), registry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was asked before, but I forgot the answer. Per-adapter metrics are very expensive, especially for timers. Adding a timer for Prometheus results in nearly 1,000 new unique metrics. We'll use ConnWaitTime
to tweak connection pools and so I think it earns it worth. What will we use ConnIdleTime
for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per our conversation, ConnIdleTime
was removed
|
||
// RecordDNSTime mock | ||
func (me *MetricsEngineMock) RecordDNSTime(dnsLookupTime time.Duration) { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still recommend to add the calls, so future tests can rely on that behavior if needed in the future.
|
||
type bidderMetrics struct { | ||
engine pbsmetrics.MetricsEngine | ||
disabledMetrics config.DisabledMetrics | ||
} | ||
|
||
func (bidder *bidderAdapter) requestBid(ctx context.Context, request *openrtb.BidRequest, name openrtb_ext.BidderName, bidAdjustment float64, conversions currencies.Conversions, reqInfo *adapters.ExtraRequestInfo) (*pbsOrtbSeatBid, []error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you've added BidderName
to the bidderAdapter
struct, can we please remove the name
parameter from this function args list as it is redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exchange/legacy.go
's implementation of requestBid
does use this parameter. Do you think it is better to keep it? I know it's unnecessary in every implementation of requestBid
out there except for exchange/legacy.go
, but for the moment, it is better to keep it to not break anything.
Another option is to eliminate bidderAdapter
's BidderName
field and move ctx = bidder.addClientTrace(ctx)
upwards from doRequest
to requestBid(ctx context.Context, request *openrtb.BidRequest, bidAdjustment float64, conversions currencies.Conversions, reqInfo *adapters.ExtraRequestInfo)
in order to make use of its name
argument. Though I bet it is better not to rely on name
given that in the future legacy.go
will eventually be dropped and we want less arguments in function signatures. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok! Yeah, requestBid is part of the interface so I guess we'll have to let it be for now.
config/config.go
Outdated
|
||
// True if we don't want to collect metrics about the connections prebid | ||
// server establishes with bidder servers such as the number of connections | ||
// that were created or reused and the connection idle and waiting times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might wanna remove the connection idle time
from this comment as well as we're no longer recording that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
exchange/bidder.go
Outdated
Client: client, | ||
me: me, | ||
config: bidderAdapterConfig{ | ||
DebugConfig: cfg.Debug, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider naming this as just Debug
as the struct name itself has config
in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
exchange/bidder.go
Outdated
me: me, | ||
config: bidderAdapterConfig{ | ||
DebugConfig: cfg.Debug, | ||
LogConnectionMetrics: cfg.Metrics.Disabled.AdapterConnectionMetrics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider naming this DisableConnMetrics
or something along those lines because if this value is true, we won't record the connection metrics but the name LogConnectionMetrics
indicates that we will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pbsmetrics/config/metrics.go
Outdated
func (me *DummyMetricsEngine) RecordAdapterConnections(bidderName openrtb_ext.BidderName, info httptrace.GotConnInfo, obtainConnectionTime time.Duration) { | ||
} | ||
|
||
// Times the DNS resolution process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please consider changing this comment to be RecordDNSTime as a noop
for consistency purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pbsmetrics/go_metrics.go
Outdated
@@ -268,6 +275,11 @@ func makeBlankAdapterMetrics() *AdapterMetrics { | |||
PanicMeter: blankMeter, | |||
MarkupMetrics: makeBlankBidMarkupMetrics(), | |||
} | |||
if disabledMetrics.AdapterConnectionMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be if !disabledMetrics.AdapterConnectionMetrics
? If disabledMetrics.AdapterConnectionMetrics
is set to true then that means we don't want to record those metrics, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally right. That's a bug
// Setup mock server | ||
respStatus := 200 | ||
respBody := "{\"bid\":false}" | ||
server := httptest.NewServer(mockHandler(respStatus, "getBody", respBody)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider passing the getBody
param as an empty string here rather than the literal getBody
. From what I understand you don't care about get body here and just post body. An empty strings represents that better IMHO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many other tests do the exact same. I'm leaving it for consistency. A future PR can probably clean, refactor and make the code a bit easier to read in the entire file.
exchange/bidder_test.go
Outdated
Body: []byte("{\"key\":\"val\"}"), | ||
Headers: http.Header{}, | ||
}, | ||
bidResponse: mockBidderResponse, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just have the bidResponse
here to be an empty adapters.BidResponse
struct as we don't really care about the bid response in this case. Or the other thing you can do is extract this single bid response to a global variable in this test file because I am sure this same response is used elsewhere in this test file too. This will reduce the redundant code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Looks better. Done.
} | ||
|
||
// Run test | ||
bidder.doRequest(context.Background(), &adapters.RequestData{Method: "POST", Uri: "http://www.example.com/"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we have the above test testing RecordAdapterConnections
also just go through doRequest
rather than requestBid
as you're introducing the change in the doRequestImpl
function and so testing on that level for unit tests should be enough, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it hurt to go through that extra layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, for unit tests it is better if we keep the tests close to the functions that are being modified because say tomorrow if somebody makes some changes to the requestBid
function that has nothing to do directly with the connection metrics but can potentially lead to the failure of this test.
pbsmetrics/go_metrics_test.go
Outdated
}{ | ||
{ | ||
description: "Five second DNS lookup time", | ||
in: testIn{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test, I guess we can skip wrapping the dnsLookupDuration
and expDuration
in testIn
and testOut
structs as they have only those single fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Modified
pbsmetrics/go_metrics_test.go
Outdated
|
||
m.RecordDNSTime(test.in.dnsLookupDuration) | ||
|
||
assert.Equal(t, test.out.expDuration.Nanoseconds(), m.DNSLookupTimer.Sum(), "Test [%d] incorrect DNS lookup time", i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to add the test description
as part of the error message here rather than the test case index so that the reader doesn't necessarily have to come back here to map the index to the test case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index was supposed to be for testing purposes. Modified.
} | ||
|
||
// Run test | ||
bidder.doRequest(context.Background(), &adapters.RequestData{Method: "POST", Uri: "http://www.example.com/"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, for unit tests it is better if we keep the tests close to the functions that are being modified because say tomorrow if somebody makes some changes to the requestBid
function that has nothing to do directly with the connection metrics but can potentially lead to the failure of this test.
Add metrics to track the number of connections per host and differentiate between an actively used connection vs an idle connection.
Technical Reference:
https://medium.com/@deeeet/trancing-http-request-latency-in-golang-65b2463f548c
https://blog.golang.org/http-tracing