Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns: rate limit DNS resolution requests #2760

Merged
merged 13 commits into from
May 2, 2019
Merged

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Apr 9, 2019

Fixes #2402

@easwars
Copy link
Contributor Author

easwars commented Apr 9, 2019

Addresses #2402.

resolver/dns/dns_resolver.go Outdated Show resolved Hide resolved
resolver/dns/dns_resolver.go Outdated Show resolved Hide resolved
resolver/dns/dns_resolver.go Outdated Show resolved Hide resolved
resolver/dns/dns_resolver.go Show resolved Hide resolved
resolver/dns/dns_resolver_test.go Outdated Show resolved Hide resolved
resolver/dns/dns_resolver_test.go Outdated Show resolved Hide resolved
resolver/dns/dns_resolver_test.go Outdated Show resolved Hide resolved
}

gotResolveCnt := tr.cnt[target]
wantResolveCnt := 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also verify that we get two resolves eventually, but no more than two.

Maybe shorten the delay to avoid the test taking longer while it's just sleeping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Let me know if you feel the time.Duration used in this test is too short/long.

@easwars
Copy link
Contributor Author

easwars commented Apr 11, 2019

PTAL.

@dfawley dfawley self-assigned this Apr 11, 2019
resolver/dns/dns_resolver_test.go Show resolved Hide resolved
// this loop, we expect to see two and only two resolution requests making it
// past the rate limiter.
start := time.Now()
end := start.Add(750 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're now doing the following:

T = 0: ResolveNow - Resolves immediately
T = 1 - 499ms: ResolveNow (squashed)
T = 500: Resolves now
T = 500-750ms: ResolveNow (squashed)
T = 1s: Resolves now

I think you want to stop calling ResolveNow around 400ms or so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timer created in Build() was adding some flakiness to the number of resolution requests made (2 vs 3). So, modified the test a little to handle that, and made sure the test passed for 50 runs.


// The actual number of resolution requests made could be 2 or 3, because
// when the resolver is built in Build(), we also create a timer initialized
// to time.Duration(0) and which is eventually reset to defaultFreq (30 mins)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. Something sounds wrong. I don't see where Build in the DNS (name) path of the dnsResolver will do a name resolution automatically. And even then, it should be predictable what happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to this in Build() where we initialize a timer:

	d := &dnsResolver{
		freq:                 b.minFreq,
		backoff:              backoff.Exponential{MaxDelay: b.minFreq},
		host:                 host,
		port:                 port,
		ctx:                  ctx,
		cancel:               cancel,
		cc:                   cc,
		t:                    time.NewTimer(0),
		rn:                   make(chan struct{}, 1),
		disableServiceConfig: opts.DisableServiceConfig,
	}

And in watcher() where we select on a bunch of things:

func (d *dnsResolver) watcher() {
	defer d.wg.Done()
	for {
		select {
		case <-d.ctx.Done():
			return
		case <-d.t.C:
		case <-d.rn:
		}

And for the first iteration of this loop, sometimes we read from d.t.C and sometimes from d.rn and that changes the number of resolution requests (2 or 3, respectively).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha. In that case, let's wait until the first resolution happens before starting the ResolveNow storm - we should be able to do that with signals instead of sleeps, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually - I think the problem is when we reset the timer - we should do a Stop() and drain the channel before resetting, in case the timer elapses at the same time a ResolveNow call comes in.

https://golang.org/pkg/time/#Timer.Reset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
But I also needed a select while attempting to drain the channel. Is that expected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that doesn't sound right. As the documentation of Reset says, if t.Stop() returns false you should be able to unconditionally drain the timer channel. (Unless it's being selected on concurrently in another goroutine.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the t.Stop() call as mentioned in the other comment seems to have helped. Thanks.

if calls <= 2 {
t.Fatalf("expected to make more than two calls to ResolveNow, made only %d", calls)
}
time.Sleep(1 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fond of sleeps in tests when they can be avoided.

Maybe we can instrument the dnsResolver a bit more so that we can inject signals through channels in tests instead of relying on time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could essentially let the first resolution request happening as part of d.t.C run its course, and then start the ResolveNow() storm, and expect exactly two requests. Does that seem reasonable to you? But we would still need a sleep, before the ResolveNow() storm, to make sure that the first resolution happened and we got past the sleep for minDNSRate.

Any other ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple ideas:

  1. Factor the sleeping part of the code into a function that we replace when testing with something that blocks on a channel instead. (I don't really like stubbing out too much real code when testing, but it could work.)

  2. Instead of sleeping and checking what happened when we wake up, call ResolveNow frequently in a loop in one goroutine (with ~1ms sleep between calls). The main Test goroutine waits until it sees X resolutions happen (this can be signaled by the test ClientConn closing a channel). Then it checks the time elapsed since start and ensures it's at least X*minDNSResRate. Then we can set poll interval to something small like 10ms, and X to something like 5. That should not be flaky and will complete quickly.

I think I would prefer #2. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
I restructured the test a little so that the test could be notified of resolver events, and therefore doesn't have to sleep anymore. I made sure it is not flaky with -count=100.

@dfawley dfawley assigned easwars and unassigned dfawley Apr 24, 2019
* Stop the timer in watcher() before resetting.
* Avoid sleeping in the newly added test, but use channels to signal
  resolution events.
Copy link
Contributor Author

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL.

if calls <= 2 {
t.Fatalf("expected to make more than two calls to ResolveNow, made only %d", calls)
}
time.Sleep(1 * time.Second)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
I restructured the test a little so that the test could be notified of resolver events, and therefore doesn't have to sleep anymore. I made sure it is not flaky with -count=100.

@dfawley dfawley assigned dfawley and unassigned easwars Apr 25, 2019
result, sc := d.lookup()
if !d.t.Stop() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see what's going on here. How about moving this Stop call to inside the case <-d.rn above instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Stop the timer in the watcher() only when we are unblocked because of a
read on the resolveNow channel.
select {
case tr.ch <- struct{}{}:
default:
// Do not block when the test is not reading from the channel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means LookupHost events will be lost, which sounds like it could lead to test bugs. If this is a problem in practice, we could increase the buffer size to avoid the loss. Though I would expect 1 should always be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about don't initialize tr.ch except for the one new test that cares about LookupHost, and only write to the channel if it is non-nil here? (E.g. pass a channel to replaceNetFunc which may be nil.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}
}()
wg.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can eliminate the waitgroup and the third goroutine by using the main goroutine for one of the tasks:

go func() {
  // loop calling ResolveNow.
}()
to := time.NewTimer(time.Second)
gotCalls := 0
for gotCalls != wantCalls {
	select {
		<-tr.ch: gotCalls++
		<-to.C: t.Fatalf("Timed out waiting for %v calls after 1 second; got %v", wantCalls, gotCalls)
	}
}
close(done)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

// end times, and expect the duration elapsed to be in the interval
// {2*dnsResRate, 3*dnsResRate}
start := time.Now()
wantCalls := 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const wantCalls = 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if gotCalls != wantCalls {
t.Fatalf("resolve count mismatch for target: %q = %+v, want %+v\n", target, gotCalls, wantCalls)
}
if elapsed < 2*dnsResRate || elapsed > 3*dnsResRate {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wantCalls*dnsResRate || (wantCalls+1)*dnsResRate (and below -- to make it easier:)

if min, max := wantCalls*dnsResRate, (wantCalls+1)*dnsResRate; elapsed < min || elapsed > max {
  t.Fatalf(...,_ min, max)
}

..and now that I think about it. If you set the timer suggested above to max then you don't even need to check against max here - the test will already have failed if it takes too long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.


wantAddrs := []resolver.Address{{Addr: "1.2.3.4" + colonDefaultPort}, {Addr: "5.6.7.8" + colonDefaultPort}}
var gotAddrs []resolver.Address
var cnt int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: cnt can be declared inside the loop (minimize scope).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@easwars
Copy link
Contributor Author

easwars commented Apr 26, 2019

PTAL.

Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the last two small things.

select {
case tr.ch <- struct{}{}:
default:
// Do not block when the test is not reading from the channel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about don't initialize tr.ch except for the one new test that cares about LookupHost, and only write to the channel if it is non-nil here? (E.g. pass a channel to replaceNetFunc which may be nil.)

resolver/dns/dns_resolver_test.go Outdated Show resolved Hide resolved
@dfawley dfawley changed the title Rate limit DNS resolution requests. dns: rate limit DNS resolution requests May 2, 2019
@dfawley dfawley added the Type: Feature New features or improvements in behavior label May 2, 2019
@dfawley dfawley added this to the 1.21 Release milestone May 2, 2019
@dfawley dfawley merged commit 5ed5cba into grpc:master May 2, 2019
@easwars easwars deleted the 2402-v2 branch May 14, 2019 23:39
@lock lock bot locked as resolved and limited conversation to collaborators Nov 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rate limit DNS resolutions
2 participants