-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dns: rate limit DNS resolution requests #2760
Conversation
Addresses #2402. |
resolver/dns/dns_resolver_test.go
Outdated
} | ||
|
||
gotResolveCnt := tr.cnt[target] | ||
wantResolveCnt := 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also verify that we get two resolves eventually, but no more than two.
Maybe shorten the delay to avoid the test taking longer while it's just sleeping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Let me know if you feel the time.Duration
used in this test is too short/long.
PTAL. |
resolver/dns/dns_resolver_test.go
Outdated
// this loop, we expect to see two and only two resolution requests making it | ||
// past the rate limiter. | ||
start := time.Now() | ||
end := start.Add(750 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're now doing the following:
T = 0: ResolveNow - Resolves immediately
T = 1 - 499ms: ResolveNow (squashed)
T = 500: Resolves now
T = 500-750ms: ResolveNow (squashed)
T = 1s: Resolves now
I think you want to stop calling ResolveNow around 400ms or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timer created in Build() was adding some flakiness to the number of resolution requests made (2 vs 3). So, modified the test a little to handle that, and made sure the test passed for 50 runs.
resolver/dns/dns_resolver_test.go
Outdated
|
||
// The actual number of resolution requests made could be 2 or 3, because | ||
// when the resolver is built in Build(), we also create a timer initialized | ||
// to time.Duration(0) and which is eventually reset to defaultFreq (30 mins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. Something sounds wrong. I don't see where Build
in the DNS (name) path of the dnsResolver will do a name resolution automatically. And even then, it should be predictable what happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to this in Build() where we initialize a timer:
d := &dnsResolver{
freq: b.minFreq,
backoff: backoff.Exponential{MaxDelay: b.minFreq},
host: host,
port: port,
ctx: ctx,
cancel: cancel,
cc: cc,
t: time.NewTimer(0),
rn: make(chan struct{}, 1),
disableServiceConfig: opts.DisableServiceConfig,
}
And in watcher()
where we select on a bunch of things:
func (d *dnsResolver) watcher() {
defer d.wg.Done()
for {
select {
case <-d.ctx.Done():
return
case <-d.t.C:
case <-d.rn:
}
And for the first iteration of this loop, sometimes we read from d.t.C and sometimes from d.rn and that changes the number of resolution requests (2 or 3, respectively).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha. In that case, let's wait until the first resolution happens before starting the ResolveNow
storm - we should be able to do that with signals instead of sleeps, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually - I think the problem is when we reset the timer - we should do a Stop()
and drain the channel before resetting, in case the timer elapses at the same time a ResolveNow
call comes in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
But I also needed a select
while attempting to drain the channel. Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that doesn't sound right. As the documentation of Reset
says, if t.Stop()
returns false
you should be able to unconditionally drain the timer channel. (Unless it's being select
ed on concurrently in another goroutine.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the t.Stop()
call as mentioned in the other comment seems to have helped. Thanks.
resolver/dns/dns_resolver_test.go
Outdated
if calls <= 2 { | ||
t.Fatalf("expected to make more than two calls to ResolveNow, made only %d", calls) | ||
} | ||
time.Sleep(1 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fond of sleeps in tests when they can be avoided.
Maybe we can instrument the dnsResolver a bit more so that we can inject signals through channels in tests instead of relying on time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could essentially let the first resolution request happening as part of d.t.C run its course, and then start the ResolveNow()
storm, and expect exactly two requests. Does that seem reasonable to you? But we would still need a sleep, before the ResolveNow()
storm, to make sure that the first resolution happened and we got past the sleep for minDNSRate
.
Any other ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple ideas:
-
Factor the sleeping part of the code into a function that we replace when testing with something that blocks on a channel instead. (I don't really like stubbing out too much real code when testing, but it could work.)
-
Instead of sleeping and checking what happened when we wake up, call
ResolveNow
frequently in a loop in one goroutine (with ~1ms sleep between calls). The main Test goroutine waits until it sees X resolutions happen (this can be signaled by the testClientConn
closing a channel). Then it checks the time elapsed since start and ensures it's at leastX*minDNSResRate
. Then we can set poll interval to something small like 10ms, and X to something like 5. That should not be flaky and will complete quickly.
I think I would prefer #2. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I restructured the test a little so that the test could be notified of resolver events, and therefore doesn't have to sleep anymore. I made sure it is not flaky with -count=100.
* Stop the timer in watcher() before resetting. * Avoid sleeping in the newly added test, but use channels to signal resolution events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL.
resolver/dns/dns_resolver_test.go
Outdated
if calls <= 2 { | ||
t.Fatalf("expected to make more than two calls to ResolveNow, made only %d", calls) | ||
} | ||
time.Sleep(1 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I restructured the test a little so that the test could be notified of resolver events, and therefore doesn't have to sleep anymore. I made sure it is not flaky with -count=100.
resolver/dns/dns_resolver.go
Outdated
result, sc := d.lookup() | ||
if !d.t.Stop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see what's going on here. How about moving this Stop
call to inside the case <-d.rn
above instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Stop the timer in the watcher() only when we are unblocked because of a read on the resolveNow channel.
resolver/dns/dns_resolver_test.go
Outdated
select { | ||
case tr.ch <- struct{}{}: | ||
default: | ||
// Do not block when the test is not reading from the channel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means LookupHost
events will be lost, which sounds like it could lead to test bugs. If this is a problem in practice, we could increase the buffer size to avoid the loss. Though I would expect 1 should always be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about don't initialize tr.ch
except for the one new test that cares about LookupHost
, and only write to the channel if it is non-nil here? (E.g. pass a channel to replaceNetFunc
which may be nil.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
resolver/dns/dns_resolver_test.go
Outdated
} | ||
} | ||
}() | ||
wg.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can eliminate the waitgroup and the third goroutine by using the main goroutine for one of the tasks:
go func() {
// loop calling ResolveNow.
}()
to := time.NewTimer(time.Second)
gotCalls := 0
for gotCalls != wantCalls {
select {
<-tr.ch: gotCalls++
<-to.C: t.Fatalf("Timed out waiting for %v calls after 1 second; got %v", wantCalls, gotCalls)
}
}
close(done)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
resolver/dns/dns_resolver_test.go
Outdated
// end times, and expect the duration elapsed to be in the interval | ||
// {2*dnsResRate, 3*dnsResRate} | ||
start := time.Now() | ||
wantCalls := 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const wantCalls = 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
resolver/dns/dns_resolver_test.go
Outdated
if gotCalls != wantCalls { | ||
t.Fatalf("resolve count mismatch for target: %q = %+v, want %+v\n", target, gotCalls, wantCalls) | ||
} | ||
if elapsed < 2*dnsResRate || elapsed > 3*dnsResRate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wantCalls*dnsResRate || (wantCalls+1)*dnsResRate
(and below -- to make it easier:)
if min, max := wantCalls*dnsResRate, (wantCalls+1)*dnsResRate; elapsed < min || elapsed > max {
t.Fatalf(...,_ min, max)
}
..and now that I think about it. If you set the timer suggested above to max
then you don't even need to check against max
here - the test will already have failed if it takes too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
resolver/dns/dns_resolver_test.go
Outdated
|
||
wantAddrs := []resolver.Address{{Addr: "1.2.3.4" + colonDefaultPort}, {Addr: "5.6.7.8" + colonDefaultPort}} | ||
var gotAddrs []resolver.Address | ||
var cnt int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: cnt
can be declared inside the loop (minimize scope).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo the last two small things.
resolver/dns/dns_resolver_test.go
Outdated
select { | ||
case tr.ch <- struct{}{}: | ||
default: | ||
// Do not block when the test is not reading from the channel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about don't initialize tr.ch
except for the one new test that cares about LookupHost
, and only write to the channel if it is non-nil here? (E.g. pass a channel to replaceNetFunc
which may be nil.)
Fixes #2402