Skip to content

Conversation

@jazev-stripe
Copy link
Contributor

@jazev-stripe jazev-stripe commented Jul 26, 2024

Summary

This PR exposes the activity_max_tasks_per_second option when constructing a new Temporal::Worker instance. This setting sets the max_tasks_per_second option when making PollActivityTaskQueue RPCs to the server, which has the effect of limiting how fast the server will give out activity tasks (both new activities and retries of already-running activities).

This same option is exposed in other SDKs; for example in Java it's available on the WorkerOptions struct as the setMaxTaskQueueActivitiesPerSecond option. The doc comment added in this PR was based off of the doc comment in the Java SDK. The semantics of the values activity_max_tasks_per_second can be (non-negative decimal number) and when it sets set on the wire (if it's > 0) was copied directly from the Java SDK's implementation, here (although it diverges slightly from the Go SDK).

Motivation

Allow Ruby workers to control the rate of processing activities, to control impact on downstream systems.

Test plan

Unit tests

Existing unit tests/example unit tests pass, and the new tests I added validate that, at each step, the option is making its way to the underlying gRPC connection:

bundle exec rspec spec/
cd examples && bundle exec rspec spec/

Manual tests

I also manually validated that the activity task rate limit works as expected. To do so, I modified the examples/bin/worker script to include a rate-limit on the Worker.new call:

worker = Temporal::Worker.new(binary_checksum: `git show HEAD -s --format=%H`.strip, activity_max_tasks_per_second: 0.5)

Then, I ran the AsyncHelloWorldWorkflow workflow with 10 activities, and validated that it took around 20-30 seconds (10 activities * 0.5 activities/second + a bit of overhead/inaccuracy) for the workflow to complete:

cd examples
docker-compose up
bin/worker # with special code patch
bin/trigger AsyncHelloWorldWorkflow 10

image

Repeating the test without the change to examples/bin/worker (i.e. an unlimited activity task queue) shows that the rate-limit makes a meaningful difference:

image

@cj-cb cj-cb merged commit 0d3a8bb into coinbase:master Jul 30, 2024
@gytisgreitai
Copy link

@jazev-stripe can you elaborate a bit more of how this should work?

I've launched a single worker:

Temporal::Worker.new(
    activity_max_tasks_per_second: 0.1,
    activity_thread_pool_size: 1,
  )

and have one activity + one workflow:

class HelloWorldWorkflow < Temporal::Workflow
  def execute
    HelloWorldActivity.execute!
  end
end

class HelloWorldActivity < Temporal::Activity
  def execute
    Net::HTTP.get_response(URI.parse('http://localhost:8000'))

    "Hello World"
  end
end

then I've locally launched a simple python server on the 8000 port. And instantly scheduled 20 workflows with simple (1..20).each do ....And what I'm seeing is a bit strange to me:

::ffff:127.0.0.1 - - [16/Aug/2024 15:23:01] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:23:01] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:23:04] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:23:04] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:23:08] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:23:08] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:23:12] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:23:12] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:23:16] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:23:20] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:23:24] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:23:28] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:23:40] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:24:20] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:25:00] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:25:40] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:26:20] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:27:00] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:27:40] "GET / HTTP/1.1" 200 -
::ffff:127.0.0.1 - - [16/Aug/2024 15:28:20] "GET / HTTP/1.1" 200 -

I would somehow expect it to run once every 10 seconds? But what I get is two parallel activities every 4 seconds which seem to then slow down until almost 40 seconds between runs?

@jazev-stripe
Copy link
Contributor Author

jazev-stripe commented Aug 16, 2024

I would somehow expect it to run once every 10 seconds? But what I get is two parallel activities every 4 seconds which seem to then slow down until almost 40 seconds between runs?

The rate-limiting is done on the server-side, so unfortunately I think this is just a result of the rate-limiting not being super accurate for very small values (and also possibly not accurate at short time scales). I'm not familiar with the server implementation, but if you ask in the Temporal communnity Slack they would probably be able to give you more details.

If I'm right, you should be able to reproduce that test/the results with any of the official SDKs, since they all use the same mechanism to implement activity task queue rate-limiting (that one Protobuf field when polling for activity tasks).

I also noticed a similar phenomenon when testing locally ("+ a bit of overhead/inaccuracy"):

it took around 20-30 seconds (10 activities * 0.5 activities/second + a bit of overhead/inaccuracy) for the workflow to complete...

One other thing to keep in mind is that the "current" value of the rate-limit seems to be stored on the server side in a way that takes O(minutes) for a new value to propagate. So, make sure you haven't run the worker on the same task queue without a rate-limit shortly before your test. In other words, make sure your test runs with a "clean slate" on that task queue, to make sure it's not being affected by anything else (you could try cleaning all of the server state before running your test to be sure).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants