You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some AI providers have rate limits on top of the model limit. These rate limits can be:
Max number of request per minutes
Max request per day
Max tokens per minutes
Max images per minutes
and probably others.
This is the case for OpenAI and Anthropic providers.
These limits highly depend on the plan the user has subscribed. The problem with these limits is that they can be lower than the actual model context limit. As consequence, depending on how you use gptme, you can easily hit the token rate limit while still far behind the model context limit especially because gptme can chain multiple requests quickly. It would be great to gracefully handle these limits even if you have a low limit.
There is already a process to truncate the messages when the context limit is exceeded but as said before the rate per minute can be relatively low on some plan (30 000 tokens for the first paid plan for instance). It would be great to use this process or something similar, for the rate limit.
I think in the with gptme the most common situations are:
to hit the max token per minute because we sent multiple requests in a raw
to hit the max token per minute because the log is bigger than the limit
The text was updated successfully, but these errors were encountered:
Here some thoughts on how to solve this issue in chat gptme.
Solution 1 - catch the exception and retry
We could catch the RateLimitError and when it happens we could just retry with an exponential backoff. It would solve the case when you are exceeding the rate limit because of sending requests too fast. It doesn't work when the message log is too big for one request. This the solution described by openai
Solution 2 - Track the limit with the response headers
OpenAI and Anthropic are returning the current rate limit within response headers. It seems to be the recommend way to follow the token consumption. We could keep the current limit in a shared context (may be the current model?) and check that while preparing the message to take the right decision. The two decisions are wait before sending the request or reduce the log depending on which limit is exceeded.
Some AI providers have rate limits on top of the model limit. These rate limits can be:
This is the case for OpenAI and Anthropic providers.
These limits highly depend on the plan the user has subscribed. The problem with these limits is that they can be lower than the actual model context limit. As consequence, depending on how you use gptme, you can easily hit the token rate limit while still far behind the model context limit especially because gptme can chain multiple requests quickly. It would be great to gracefully handle these limits even if you have a low limit.
There is already a process to truncate the messages when the context limit is exceeded but as said before the rate per minute can be relatively low on some plan (30 000 tokens for the first paid plan for instance). It would be great to use this process or something similar, for the rate limit.
I think in the with gptme the most common situations are:
The text was updated successfully, but these errors were encountered: