Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle the RateLimitError when AI provider has API limitations #250

Open
jrmi opened this issue Nov 10, 2024 · 2 comments
Open

Handle the RateLimitError when AI provider has API limitations #250

jrmi opened this issue Nov 10, 2024 · 2 comments

Comments

@jrmi
Copy link
Contributor

jrmi commented Nov 10, 2024

Some AI providers have rate limits on top of the model limit. These rate limits can be:

  • Max number of request per minutes
  • Max request per day
  • Max tokens per minutes
  • Max images per minutes
  • and probably others.

This is the case for OpenAI and Anthropic providers.

These limits highly depend on the plan the user has subscribed. The problem with these limits is that they can be lower than the actual model context limit. As consequence, depending on how you use gptme, you can easily hit the token rate limit while still far behind the model context limit especially because gptme can chain multiple requests quickly. It would be great to gracefully handle these limits even if you have a low limit.

There is already a process to truncate the messages when the context limit is exceeded but as said before the rate per minute can be relatively low on some plan (30 000 tokens for the first paid plan for instance). It would be great to use this process or something similar, for the rate limit.

I think in the with gptme the most common situations are:

  • to hit the max token per minute because we sent multiple requests in a raw
  • to hit the max token per minute because the log is bigger than the limit
@jrmi
Copy link
Contributor Author

jrmi commented Nov 10, 2024

Here some thoughts on how to solve this issue in chat gptme.

Solution 1 - catch the exception and retry

We could catch the RateLimitError and when it happens we could just retry with an exponential backoff. It would solve the case when you are exceeding the rate limit because of sending requests too fast. It doesn't work when the message log is too big for one request. This the solution described by openai

Solution 2 - Track the limit with the response headers

OpenAI and Anthropic are returning the current rate limit within response headers. It seems to be the recommend way to follow the token consumption. We could keep the current limit in a shared context (may be the current model?) and check that while preparing the message to take the right decision. The two decisions are wait before sending the request or reduce the log depending on which limit is exceeded.

What do you think?

@jrmi
Copy link
Contributor Author

jrmi commented Nov 10, 2024

Another quick fix could be to add a CLI option to force the max token count per request. It could help to save money as well :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant