-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add taskList throttling to allow users to limit activities executed per second #432
Conversation
service/matching/taskListManager.go
Outdated
func (rl *rateLimiter) shouldUpdate(maxDispatchPerSecond *float64) bool { | ||
rl.RLock() | ||
defer rl.RUnlock() | ||
if maxDispatchPerSecond == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to put this check under the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
service/matching/taskListManager.go
Outdated
rl.ttl.Reset(rl.ttlRaw) | ||
return true | ||
default: | ||
return *maxDispatchPerSecond < *rl.maxDispatchPerSecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is the only place that needs the lock.
service/matching/taskListManager.go
Outdated
if ok, _ := c.rateLimiter.TryConsume(1); !ok { | ||
// Note: Is a better error counter more appropriate here? We might not want high sevs for this | ||
c.metricsClient.IncCounter(scope, metrics.PollErrorsCounter) | ||
return nil, createServiceBusyError("TaskList dispatch exceeded limit") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to block consumers on long poll and return empty polls to maintain the configured rate. We don't want them receive any busy errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Fixed.
service/matching/matchingEngine.go
Outdated
e.taskListsLock.RUnlock() | ||
return result, nil | ||
} | ||
e.taskListsLock.RUnlock() | ||
mgr := newTaskListManager(e, taskList, e.config) | ||
e.taskListsLock.Lock() | ||
mgr := newTaskListManager(e, taskList, e.config, maxDispatchPerSecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this after your RUnlock() on line 178
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we had some dup code here in checking the map. Cleaning up.
service/matching/taskListManager.go
Outdated
return newTaskListManagerWithRateLimiter(e, taskList, config, rl) | ||
} | ||
|
||
// Only for tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment accurate ? newTaskListManager uses this helper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is for use in tests, will clarify
service/matching/taskListManager.go
Outdated
@@ -323,6 +405,12 @@ func (c *taskListManagerImpl) getTask(ctx context.Context) (*getTaskResult, erro | |||
}() | |||
} | |||
|
|||
// Wait till long poll expiration for token. If token acquired, proceed. | |||
if ok := c.rateLimiter.Consume(1, c.config.LongPollExpirationInterval); !ok { | |||
// Note: Is a better error counter more appropriate here? We might not want high sevs for this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, emit a metric to indicate the error type i.e. being rate limited
service/matching/taskListManager.go
Outdated
String() string | ||
} | ||
|
||
func newTaskListManager(e *matchingEngineImpl, taskList *taskListID, config *Config) taskListManager { | ||
type rateLimiter struct { | ||
maxDispatchPerSecond *float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
common.TokenBucket doesn't support rates lower than one rps. So, there is no point in taking a float64. We likely want to support rates lower than one rps, so, consider switching to the golang/x/ratelimiter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. Started out a float assumption but thinking about it a bit more, is there a real use case for wanting it?
service/matching/taskListManager.go
Outdated
maxDispatchPerSecond *float64 | ||
tokenBucket common.TokenBucket | ||
ttl *time.Timer | ||
sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by convention, anonymous members / mutex that protects whole struct go at the beginning of the struct
service/matching/taskListManager.go
Outdated
@@ -323,6 +405,12 @@ func (c *taskListManagerImpl) getTask(ctx context.Context) (*getTaskResult, erro | |||
}() | |||
} | |||
|
|||
// Wait till long poll expiration for token. If token acquired, proceed. | |||
if ok := c.rateLimiter.Consume(1, c.config.LongPollExpirationInterval); !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the caller context deadline could technically be less than the LongPollExpirationInterval, this probably needs to be fixed later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixing.
service/matching/taskListManager.go
Outdated
} | ||
|
||
func (rl *rateLimiter) Consume(count int, timeout time.Duration) bool { | ||
rl.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't like the fact that we hold the lock while we wait on a caller provided timeout. i.e. if there are 20 go-routines, 19 blocked on Consume() and one blocked on UpdateMaxDispatch(), the latter can starve for a long time (and meanwhile, context deadline could even be exceeded). Here is an alternative - don't hold the lock within the Consume / TryConsumer methods, instead use atomic.Value to hold the token bucket reference. Whenever you change the token bucket, just atomically set the Value to the new token bucket. That way, callers who hold the old token bucket will stick to the old limit. Newer callers will use the new bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I would say let's not optimize for that yet and keep it simple, changes are usually rare. Definitely something we can revisit if we hit performance issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
madhu - this is more than just an optimization. Its a bit of correctness as well, in that, we can block the caller for way more than the deadline on a lock contention. With atomic.Value, you will end up with exactly the same lines of code, so IMO, let's just do it right the first time around and forget about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And fyi - its not about the "value changing" - UpdateMaxDispatch is called regardless for every Poll() request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, it will not affect correctness. Remember that all Poll's will converge on the same throttling value. If one Poll times out, the next one will update it. Also it's not about lines of code, locks are far more common and easier to understand and don't involve type casts. Good read https://texlution.com/post/golang-lock-free-values-with-atomic-value/
Also yes UpdateMaxDispatch is called for every Poll, but write locks are only acquired if the value is changing. Otherwise, only read locks are acquired. Check the shouldUpdate function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that locks are more more common / standard pattern, but atomic.Values aren't that complicated either. In this case, its clearly the right fit IMO. As for correctness, I understand polls will converge, but you will block the caller for more than the caller provided timeout (deadline). Anyways, will leave it up to you.
Also yes UpdateMaxDispatch is called for every Poll, but write locks are only acquired if the value is changing
yup, you are right.
service/matching/taskListManager.go
Outdated
type rateLimiter struct { | ||
maxDispatchPerSecond *float64 | ||
tokenBucket common.TokenBucket | ||
ttl *time.Timer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider just using epoch time to check for ttl expiration instead of using a timer i.e. just store expiration as now().nanos+ttl and within your UpdateMaxDispatch just check if time.Now() > expiration.
Timers use the global heap within the go runtime (lock protected afaik) i.e. more expensive than time.Now()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep it simple here, not sure the perf is significant enough to make it more complicated. Also timer uses channels internally, no lock as far as I can see https://golang.org/src/time/sleep.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
madhu - storing the expiry as epoch is a well understood pattern on the server side. And given that this is just couple of lines of code change and it buys us efficiency, why wait until there is a perf issue, lets just do it and be done with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timer is the recommended pattern and simple/easy to understand. What efficiency does epoch buy us? What is the perf advantage here exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommended ? see this golang/go#15133
service/matching/taskListManager.go
Outdated
@@ -42,20 +42,95 @@ const ( | |||
done time.Duration = -1 | |||
) | |||
|
|||
// NOTE: Is this good enough for stress tests? | |||
const ( | |||
_maxDispatchDefault = 100000.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[ not a blocker ] consider renaming this to _taskListOutboundRPS for clarity
service/matching/taskListManager.go
Outdated
// NOTE: Is this good enough for stress tests? | ||
const ( | ||
_maxDispatchDefault = 100000.0 | ||
_dispatchLimitTTL = time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[ not a blocker ] consider renaming this to _taskListRateLimiterTTL
common/metrics/defs.go
Outdated
@@ -625,6 +626,7 @@ var MetricDefs = map[ServiceIdx]map[int]metricDefinition{ | |||
LeaseFailureCounter: {metricName: "lease.failures"}, | |||
ConditionFailedErrorCounter: {metricName: "condition-failed-errors"}, | |||
RespondQueryTaskFailedCounter: {metricName: "respond-query-failed"}, | |||
ThrottleErrorCounter: {metricName: "throttle.errors"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't treat these like errors. What about throttle.empty-polls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty polls sounds misleading. It's useful to differentiate between truly empty where no error is returned versus throttling limit is reached. I think it's still an error, but one we might not want in the same monitoring classification. On server side, this should not cause any tickets but users should be able to set alerts on these. There could be potential issues if there are spikes in throttling errors.
service/matching/taskListManager.go
Outdated
c.metricsClient.IncCounter(scope, metrics.ThrottleErrorCounter) | ||
msg := fmt.Sprintf("TaskList dispatch exceeded limit: %s", err.Error()) | ||
// TODO: It's the client that's busy, not the server. Doesn't fit into any other error. | ||
return nil, createServiceBusyError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to return service busy error. Also it is going to be returned even if there is no backlog as we throttle before trying to match. From the poller point of view it is ErrNoTasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, want to differentiate between truly no tasks vs tasks being throttled. Either way I need a new thrift type so user can react to it. ErrNewTasks is not a type, it just has a specific string message.
service/matching/matchingEngine.go
Outdated
e.taskListsLock.Lock() | ||
if result, ok := e.taskLists[*taskList]; ok { | ||
result.UpdateMaxDispatch(maxDispatchPerSecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if aren't gonna switch to atomics, consider moving this outside the lock so that, we don't block addition of new task lists, when another tasklist is blocked on UpdateMaxDispatch for multiple seconds if/when they roll out a change in poll limit
service/matching/taskListManager.go
Outdated
@@ -323,6 +409,16 @@ func (c *taskListManagerImpl) getTask(ctx context.Context) (*getTaskResult, erro | |||
}() | |||
} | |||
|
|||
// Wait till long poll expiration for token. If token acquired, proceed. | |||
stopWatch := c.metricsClient.StartTimer(scope, metrics.PollThrottleLatency) | |||
err := c.rateLimiter.Wait(ctx, c.config.LongPollExpirationInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just realized that this approach is not going to work for the following scenario:
- Rate limit of 10 polls per second
- No tasks are scheduled for a minute.
- Rate limiter will let 600 pollers to go through.
- Tasks are scheduled with 100 per second.
- For 6 seconds polls will return 100 tasks per second.
service/matching/matchingEngine.go
Outdated
e.taskListsLock.Lock() | ||
if result, ok := e.taskLists[*taskList]; ok { | ||
result.UpdateMaxDispatch(maxDispatchPerSecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm little worried we are switching to a write lock on this hot path. Can we figure out a way which does not require acquiring the write lock on the top level map for TaskListMgr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one way with
RLock
Check map,
if ok,
runlock, update max dispatch
if not ok,
RUnlock
WLock
Again check map
if still not ok, then lock and write
So we will have some duplicate code, we will be checking if it is in the map twice. That way, if it is in map originally, it will onyl write lock. Makes code uglier but avoids write lock in majority of the cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have that logic in GetTaskList. I'm not sure if this new method is needed. If you move the logic of setting the throttling limit to GetTask than we can simplify this code a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed Update method but FYI most of this code is still needed for task list creation.
service/matching/matchingEngine.go
Outdated
@@ -341,10 +351,14 @@ pollLoop: | |||
} | |||
|
|||
taskList := newTaskListID(domainID, taskListName, persistence.TaskListTypeActivity) | |||
var maxDispatch *float64 | |||
if req.PollRequest.TaskListMetadata != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use "request.TaskListMetadata"
service/matching/matchingEngine.go
Outdated
@@ -341,10 +351,14 @@ pollLoop: | |||
} | |||
|
|||
taskList := newTaskListID(domainID, taskListName, persistence.TaskListTypeActivity) | |||
var maxDispatch *float64 | |||
if req.PollRequest.TaskListMetadata != nil { | |||
maxDispatch = req.PollRequest.TaskListMetadata.MaxTasksPerSecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use "request.TaskListMetadata"
service/matching/taskListManager.go
Outdated
type taskListManager interface { | ||
Start() error | ||
Stop() | ||
AddTask(execution *s.WorkflowExecution, taskInfo *persistence.TaskInfo) error | ||
GetTaskContext(ctx context.Context) (*taskContext, error) | ||
SyncMatchQueryTask(ctx context.Context, queryTask *queryTaskInfo) error | ||
CancelPoller(pollerID string) | ||
UpdateMaxDispatch(maxDispatchPerSecond *float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a separate API for this? Can you pass in the additional parameter to GetTaskContext?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The task list is constructed before getTaskContext. The max dispatch is a property of the task list, makes more sense to add it as part of the task list fetch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
service/matching/taskListManager.go
Outdated
if !rsv.OK() { | ||
scope := metrics.MatchingTaskListMgrScope | ||
c.metricsClient.IncCounter(scope, metrics.AddThrottleCounter) | ||
return nil, fmt.Errorf("cannot add to tasklist, limit exceeded") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define a specific error for this? We also need to make sure we don't return back an error to the caller when this happens and instead store it to database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting returning nil here and not an error at all? Or returning a specific error type that is handled in the handler code. For the second option, we would add a new error type in thrift, and Max was a bit determined that this should not be considered an error which is why I also named the metric a counter instead of an error. Some history from previous thread with Max when I had used a ServiceBusyError classification for this:
mfateev on Nov 28, 2017 Member
I don't think we want to return service busy error. Also it is going to be returned even if there is no backlog as we throttle before trying to match. From the poller point of view it is ErrNoTasks.
@madhuravi
madhuravi on Nov 28, 2017 Member
Same as above, want to differentiate between truly no tasks vs tasks being throttled. Either way I need a new thrift type so user can react to it. ErrNewTasks is not a type, it just has a specific string message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a change for this. Let me know what you think.
service/matching/taskListManager.go
Outdated
c.logger.Warn("Tasklist manager context is cancelled, shutting down") | ||
break deliverBufferTasksLoop | ||
} | ||
c.logger.Warn("Unable to send tasks for poll, limit exceeded") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused here. I thought Wait will block indefinitely until new tokens are available. What other error will cause Wait to unblock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should really only be that error. I'm simply handling this scenario so as to not let it fall through.
func (c *taskListManagerImpl) getTasksPump() { | ||
defer close(c.taskBuffer) | ||
c.startWG.Wait() | ||
|
||
go c.deliverBufferTasksForPoll() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need another go routine per tasklist? We already have 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come? There are two goroutines I see, one for the task pump and one to move tasks to poll. So it is 2 with this.
They do need to happen in parallel, so can't bundle them into one goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke offline.
return nil, errPumpClosed | ||
case result := <-c.tasksForPoll: | ||
if result.syncMatch { | ||
c.metricsClient.IncCounter(scope, metrics.PollSuccessWithSyncCounter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove any counters which are not used anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is not used? We still want a metric for successful poll and whether it is due to sync match or not.
if !ok { // Task list getTasks pump is shutdown | ||
break deliverBufferTasksLoop | ||
} | ||
c.tasksForPoll <- &getTaskResult{task: task} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should measure the perf impact of this change. Considering we are not serializing all tasks through a non-buffered channel. Can you try to add a bench test to cover this scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about preference for buffered tasks over sync tasks? We discussed that and came to the conclusion that even before this, we will still prefer buffer tasks over sync tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is going to prioritize sync match just because sync match uses multiple goroutines to call trySyncMatch while deliverBufferTasksToPoll is called from a single goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sync match is not in multiple goroutines. Do you mean there are several parallel calls to AddTask whereas only one goroutine per host adding the buffer tasks? That is true, and that behavior would be true before this code change as well.
break deliverBufferTasksLoop | ||
} | ||
c.tasksForPoll <- &getTaskResult{task: task} | ||
case <-c.deliverBufferShutdownCh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we have 2 separate ways to shutdown this channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke offline, this is to satisfy existing tests. There is a comment as well.
service/matching/taskListManager.go
Outdated
@@ -50,6 +50,8 @@ const ( | |||
_defaultTaskDispatchRPSTTL = 60 * time.Second | |||
) | |||
|
|||
var errAddTasklistThrottled = fmt.Errorf("cannot add to tasklist, limit exceeded") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use errors.New
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix one minor comment and then you can merge it into master.
Fixes #50