-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Rate Limit and Retry for Models #1734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@grll Nice work! Looks like we still have some failing tests, let me know if you'd like some guidance there. As for the code, can we please reduce the duplication a bit by moving the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 this is nicely done.
@grll @DouweM what do you think about having a GCRA implementation as implemented by something like https://github.com/ZhuoZhuoCrayon/throttled-py ?
It's a more efficient variant of the leaky bucket algorithm without its downsides (ie: doesn't need a background "drip" process). See https://brandur.org/rate-limiting
Hey thanks for review, I will have a look was indeed struggling a bit with the test in the CI while it was working ok locally need to investigate a bit more. I can also have a look at the refactor you suggested! |
Hey thanks for the suggestion, the initial goal with using aiolimiter was to keep things as simple as possible but we could instead integrated with throttled-py in order to let user choose from various algorithm / redis backend as well. |
@grll I am the developer of throttled-py. throttled-py provides flexible current limiting strategies and storage backend configurations. I am very willing to participate in the discussion and implementation of this PR. |
Hey @ZhuoZhuoCrayon thanks for jumping in and great work in throttled-py. I am concerned about one thing here is that asyncio support is limited in the current implementatino of throttled-py. What is the current status around asyncio? |
@grll Features are in the development branch and a stable version will be released this weekend |
|
throttled-py asynchronous support has been officially released in v2.1.0. The core API is the same for synchronous and asynchronous code, just replace import asyncio
from throttled.asyncio import RateLimiterType, Throttled, rate_limiter, store, utils
throttle = Throttled(
using=RateLimiterType.GCRA.value,
quota=rate_limiter.per_sec(1_000, burst=1_000),
store=store.MemoryStore(),
)
async def call_api() -> bool:
result = await throttle.limit("/ping", cost=1)
return result.limited
async def main():
benchmark: utils.Benchmark = utils.Benchmark()
denied_num: int = sum(await benchmark.async_serial(call_api, 100_000))
print(f"❌ Denied: {denied_num} requests")
if __name__ == "__main__":
asyncio.run(main())I think GCRA has lower performance overhead and smoother throttling. At the same time, making the storage backend optional can increase subsequent scalability. Anyway, adding throttling and retries to the model is very cool, thank you! |
|
@ZhuoZhuoCrayon Thanks a ton, throttled looks great! @grll Are you up for changing this PR to use throttled instead? |
Sure happy to! |
|
Hi @grll , |
|
@ZhuoZhuoCrayon thanks for the reminder, busy times. I am on it now! I will switch throttled and fix the CI :) |
|
I have made the changes but it seems throttled is not working as expected unless I am doing something wrong: #!/usr/bin/env python3
import asyncio
import time
from throttled.asyncio import RateLimiterType, Throttled, rate_limiter, store
async def test_throttled():
# Create a throttle that allows 2 requests per second
throttle = Throttled(
using=RateLimiterType.GCRA.value,
quota=rate_limiter.per_sec(2),
store=store.MemoryStore(),
)
print("Testing Throttled rate limiter...")
start_time = time.time()
# Make 5 sequential requests
for i in range(5):
await throttle.limit('default', cost=1)
print(f"Request {i+1} completed at {time.time() - start_time:.2f}s")
total_time = time.time() - start_time
print(f"\nTotal time for 5 requests at 2/sec: {total_time:.2f}s")
print(f"Expected time: ~2.0s (5 requests / 2 per sec - 1)")
if __name__ == "__main__":
asyncio.run(test_throttled())Produces: |
|
In function call mode, limit returns RateLimitResult immediately. You can use # Make 5 sequential requests
for i in range(5):
await result = throttle.limit('default', cost=1)
# Here's the problem.
assert result.limitedYou can specify a # Make 5 sequential requests
for i in range(5):
# Set timeout=5 to enable wait-and-retry (max wait 5 second), returns the final RateLimitResult.
await result = throttle.limit('default', cost=1, timeout=5)
assert result.limitedIn addition, in decorator, and context manager modes, triggering current limiting will throw LimitedError: #!/usr/bin/env python3
import asyncio
import time
from throttled.asyncio import RateLimiterType, Throttled, rate_limiter, store, exceptions
async def test_throttled():
# Create a throttle that allows 2 requests per second
throttle = Throttled(
key='default',
# You can use `timeout` to enable wait-retry mode.
# timeout=1,
using=RateLimiterType.GCRA.value,
quota=rate_limiter.per_sec(2),
store=store.MemoryStore(),
)
print("Testing Throttled rate limiter...")
start_time = time.time()
# Make 5 sequential requests
for i in range(5):
# If no timeout is set, exceptions.LimitedError will be thrown on the third execution.
async with throttle:
print(f"Request {i+1} completed at {time.time() - start_time:.2f}s")
total_time = time.time() - start_time
print(f"\nTotal time for 5 requests at 2/sec: {total_time:.2f}s")
if __name__ == "__main__":
asyncio.run(test_throttled()) |
|
@grll #!/usr/bin/env python3
import asyncio
import time
from throttled.asyncio import RateLimiterType, Throttled, rate_limiter, store
async def test_throttled():
# Create a throttle that allows 2 requests per second
throttle = Throttled(
using=RateLimiterType.GCRA.value,
quota=rate_limiter.per_sec(2),
store=store.MemoryStore(),
)
print("Testing Throttled rate limiter...")
start_time = time.time()
# Make 5 sequential requests
for i in range(5):
# ⏳ Set timeout=0.5 to enable wait-and-retry (max wait 0.5 second)
await throttle.limit('default', cost=1, timeout=0.5)
print(f"Request {i+1} completed at {time.time() - start_time:.2f}s")
total_time = time.time() - start_time
print(f"\nTotal time for 5 requests at 2/sec: {total_time:.2f}s")
print(f"Expected time: ~1.5s")
if __name__ == "__main__":
asyncio.run(test_throttled()) |
Replace aiolimiter with throttled-py (>=2.2.0) as the rate limiting library for RateLimitedModel. Throttled provides more flexible rate limiting with different algorithms (GCRA, FixedWindow) and better control over blocking behavior.
- Change from aiolimiter.AsyncLimiter to throttled.asyncio.Throttled - Update usage from context manager to explicit limit() calls - Add key, cost, and timeout parameters to request methods - Set default timeout to 30s to enable blocking behavior - Update docstring with new usage example The timeout parameter is crucial - without it or with -1, the rate limiter returns immediately instead of waiting for capacity.
- Replace AsyncLimiter with Throttled in all tests - Remove unused imports (AsyncMock, patch) - Update rate limiter configurations to use GCRA algorithm - Fix concurrent requests test to verify actual rate limiting - Adjust timing expectations to account for GCRA burst behavior The concurrent requests test now properly verifies that rate limits are enforced by measuring actual execution time.
- Add test for RuntimeError when retryer is exhausted without exceptions - Test FailingModel properties (system, model_name) to cover lines 34, 38 - Test SuccessResponseStream properties (model_name, timestamp) to cover lines 90, 94 - Achieve 100% test coverage for rate_limited.py and test_rate_limited.py
- Use string 'gcra' instead of RateLimiterType.GCRA enum - Import MemoryStore from throttled.asyncio.store - Replace unused loop variable 'i' with '_' - Fix __aiter__ override to properly implement async iteration protocol - Update docstring examples to match correct usage
|
@DouweM look like we are ready for a review! PS: That's definetly the toughest CI I have ever seen but I guess coming from typing experts no surprise here. Anyway I got to the bottom of it. |
| raise RuntimeError('Model request failed after all retries') | ||
| else: | ||
| if self.limiter: | ||
| await self.limiter.limit(key, cost, timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we consider verifying the rate limit result(RateLimitResult.limited=False When the rate limit is still exceeded after the timeout) at this point and raising a LimitedError (or another exception compliant with the pydantic-ai specification)?
For example:
if self.limiter:
await result = self.limiter.limit(key, cost, timeout)
# 💡 Check if the limit has been exceeded.
if result.limited:
raise RuntimeError('Rate limit exceeded.')
return await super().request(...)I am concerned that skipping the check and still executing the request after exceeding the rate limit may cause the model to encounter unpredictable third-party exception errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good point I will adjust
|
@DouweM any chance we get some review soon on this? I will address the point above but wanted to also combine with your feedback if possible |
| model_request_parameters: ModelRequestParameters, | ||
| key: str = 'default', | ||
| cost: int = 1, | ||
| timeout: float | None = 30.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to move these to the initializer? Or could they be values on the limiter that's passed? Users don't call request and request_stream directly, so they wouldn't have an obvious way to set these values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can move the parameters to the initialization stage:
throttle = Throttled(
using='gcra',
quota=rate_limiter.per_sec(1_000, burst=1_000),
# store can be omitted, the global MemoryStore is provided by default.
# store=MemoryStore(),
key='default',
cost=1,
timeout=30
)| messages: list[ModelMessage], | ||
| model_settings: ModelSettings | None, | ||
| model_request_parameters: ModelRequestParameters, | ||
| key: str = 'default', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the implication of this value being the same for all models (unless it's overwritten)? Should we use a different value for each model/agent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@grll @DouweM The target of rate limiting is the LLMs API, which is different for each model and needs to be independent. Can we use self.model_name as the default key unless a specific key already exists for the limiter?
if self.limiter:
# 💡 Priority: self.limiter.key > self.model_name.
await result = self.limiter.limit(self.limiter.key or self.model_name)
....
...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZhuoZhuoCrayon I think that makes sense, but let's add in the provider as well: {self.system}:{self.model_name}
| "opentelemetry-api>=1.28.0", | ||
| "typing-inspection>=0.4.0", | ||
| "tenacity>=9.1.2", | ||
| "throttled-py>=2.2.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kludex What do you think of these being included by default? Should we put them in an optional dependency group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add these to a new optional dependency group called rate_limiting, and add a try:/except ImportError: block to models/rate_limited.py like we have on providers/openai.py that suggests installing with that dependency group.
| from pydantic_ai.usage import Usage | ||
|
|
||
|
|
||
| class FailingModel(Model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already do some testing with fake failing models in tests/models/test_fallback.py. Could we borrow the approach taken there using FunctionModel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update your uv, reset uv.lock to the version from main, and run make install again? We should see only a handful of changes here for the new packages, not "2,071 additions, 2,049 deletions" :)
|
@ZhuoZhuoCrayon What exactly did you want our take on? 😄 I assumed the conversation between you and @grll about how |
|
@grll I had a chat about this with @Kludex, and we've got a few thoughts!
If we try to separately 1) allow request retries and 2) handle API rate limits, I think we'd end up with a different implementation than what we got here by conflating them as they are. What do you think? My bad for letting you run in this direction for a bit before properly thinking it through 😅 |
Hey @DouweM no worries at all, this week is really packed with events, but I have in my todos to review your proposal to split. One thing that came to mind is that OpenAI is one thing but I am not sure all providers populate retry headers in a consistent and similar way so we might have challenges there |
|
Those headers are widely used. Every serious provider should have it. |
|
@grll I'm closing this PR as it's been a few weeks with no response; if you're still interested in implementing this feel free to reopen it! |
|
@grll If we could make it use the rate limit headers, definitely! |
|
#2282 implements support for HTTP 429 with respect for the |
Add a quite popular feature request: Rate Limit and Retry support for models.
It is implemented as a wrapper like the InstrumentedModel. It leverages
aiolimiterfor a simple implementation of the leaking bucket algorithm for rate limit while retry leverage thetenacitylibrary.Usage: