-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenAI: Migrate to async client and enhance API support #219
base: main
Are you sure you want to change the base?
Conversation
Major changes: - Migrate to async OpenAI client to support query cancellation and timeouts - Add client caching using global dictionary (GD) to improve performance - Migrate to using raw responses to minimize type conversions and improve performance - Add comprehensive support for all OpenAI API parameters - Add support for client create/destroy methods Implementation details: - Replace sync OpenAI client with AsyncOpenAI for better control flow - Implement client caching in GD to reuse connections - Add query cancellation support using asyncio - Remove list_models and embed function implementations from openai.py to consolidate API handling - Move functionality directly into the SQL functions for consistency - Return raw API responses to minimize conversions - Add complete OpenAI API parameter support across all functions - Standardize parameter naming with leading underscore - Update OpenAI and tiktoken package versions Package updates: - openai: 1.44.0 -> 1.51.2 - tiktoken: 0.7.0 -> 0.8.0 Breaking changes: - Functions now return raw JSON responses instead of parsed objects - Functions marked as parallel unsafe due to HTTP API constraints - Parameter names now prefixed with underscore for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this PR! I did a preliminary check and have a bunch of questions. I just want to understand the motivation/reasoning behind some decisions. Also we'll have to decide on the json vs vector return of some of these functions. I think you are right we'll need both sets of functions. Let me ask some of my colleagues about the naming conventions we want to use here.
projects/extension/ai/openai.py
Outdated
|
||
return openai.AsyncOpenAI(**client_kwargs) | ||
|
||
def get_or_create_client(plpy, GD: Dict[str, Any], api_key: str = None, api_key_name: str = None, base_url: str = None) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any number showing that creating the client is expensive (and thus worth it to store in GD)?. Does this allow connection reuse or something? And if it's the latter then how/when do connections get closed? Is there a keepalive timeout.
Storing the client in GD seems like a good amount of complexity and I'd like to find out what we are gaining/loosing for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, benchmarks here: #116 (comment)
Also note, that there is a known issue with the 2nd (and 3rd...etc) call to the client for the api has some extra 40ms delay that doesn't happen when I have this code running outside of a pl/python environment (noted in the thread above). I really should have mentioned that directly in the PR. will edit to mention that fact that it needs to be identified. Once that is fixed, the benchmark numbers should look much better.
Even with the above issue, this is still much faster, and lower CPU than the original implementation where we recreate the client.
Note specifically the CPU reduction. Recreating the client is heavy on CPU, I know this from past projects but the benchmarks also bare this out.
I believe the connection is closed after the request is complete, and the client becomes ready for the next call. If the request is cancelled early, we attempt to kill things gracefully.
@cevian Alright, so when removing the _underscore prefix we will need to make changes to the Let me know what to go with. |
@Tostino I believe we used |
Fix naming conflicts with Postgres reserved words. Reverted parallel safety changes. Went from unsafe -> safe for functions. Reverted function volatility changes.
@cevian Alright, all changes made. Also noticed that when I rebased things I had accidentally committed changes to the ai--0.4.0.sql file, so I got that reverted. This still has the performance problem we need to dig into before it's merged, but at least any of the other issues can be discussed and fixed in the meantime. |
@Tostino I am still not convinced we need |
I'm good with that solution. Seems to solve the problem statement I was originally trying to solve. Will get it done tonight. |
… client_extra_args parameter to all relevant functions that interact with the client (other than the `_simple` function that I think needs a rethinking).
Well...some kid went and ripped out my neighborhoods internet interconnection wiring last night. Was slightly delayed. Tested to make sure the client_extra_args were being passed through properly, and it seems to be with my initial "kick the tires" tests. |
I should have a little time to try and figure out that 2nd run issue this week (or at least attempt, i'm not a Python dev so I am not used to the profiling tools in this space). @cevian Is there anything else you see that needs attention at this point? |
Hey @Tostino is:
A problem with the current state as well? Or introduced by your PR? If it's a current issue, could you open a github issue for it and we can tackle that on a separate PR. |
@alejandrodnm No, the current state is just a much slower overall call time every time (I believe it was roughly 25-30ms/call) and much higher CPU usage. Not a current issue, was introduced by the PR. Sorry, holidays had me a bit busy. Will get back to this as soon as I can. |
@Tostino don't worry. Just wanted to see if we could support you better. You've put a lot of effort into this, and we really appreciate it. |
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* docs: add voyage ai quickstart * chore: review comments * chore: update images and review comments
…ale#293) * docs: fix psql command and pstgres data mount * docs: align to compose command
Signed-off-by: Matthew Peveler <mpeveler@timescale.com>
Bizarrely if you configure but don't use the pip cache, the setup-python action fails in its "post" step. See: actions/setup-python#436
* feat: load api keys from db in self hosted vectorizer * chore: make env var override db setting * chore: add debug logs
…escale#312) Adds a script to evaluate and compare different embedding models using the Paul Graham essays dataset. The script: - Supports evaluation of multiple embedding models (Nomic, OpenAI, BGE) - Generates diverse question types (short, long, direct, implied, unclear) - Measures retrieval accuracy by checking if source chunks are in top-K results - Provides detailed performance metrics by question type - Includes step-by-step evaluation process with CSV outputs for analysis - The evaluation framework is configurable and can be extended to test additional embedding models. Results are saved to CSV files for further analysis.
* feat: add sqlalchemy vectorizer field * chore: add installing extras to ci * chore: simplify interface, add simple docs * feat: allow arbitrary primary keys on parent model * docs: update docs with simplified vectorizer field * chore: rename VectorizerField to Vectorizer * chore: update alembic exclusion mechanism * docs: update docs with review comments * chore: align automatic table name with create_vectorizer * chore: add option for any relationship properties * chore: setup class event based rather than lazy so relationship works on first query * chore: update to embedding_relationship * chore: refactor tests add vcr mocks * chore: rename to vectorizer_model; cleanup * chore: fix uv lock * chore: remove dummy key
* Code for evaluating open source embedding models * Update examples/finding_best_open_source_embedding_model/best_embedding_model_rag_app.ipynb Co-authored-by: James Guthrie <JamesGuthrie@users.noreply.github.com> Signed-off-by: Hervé Ishimwe <75611379+ihis-11@users.noreply.github.com> * Changed from GPT 4o-mini to Llama3.2 & Addressed the comments --------- Signed-off-by: Hervé Ishimwe <75611379+ihis-11@users.noreply.github.com> Co-authored-by: James Guthrie <JamesGuthrie@users.noreply.github.com>
Signed-off-by: Matthew Peveler <mpeveler@timescale.com>
Major changes: - Migrate to async OpenAI client to support query cancellation and timeouts - Add client caching using global dictionary (GD) to improve performance - Migrate to using raw responses to minimize type conversions and improve performance - Add comprehensive support for all OpenAI API parameters - Add support for client create/destroy methods Implementation details: - Replace sync OpenAI client with AsyncOpenAI for better control flow - Implement client caching in GD to reuse connections - Add query cancellation support using asyncio - Remove list_models and embed function implementations from openai.py to consolidate API handling - Move functionality directly into the SQL functions for consistency - Return raw API responses to minimize conversions - Add complete OpenAI API parameter support across all functions - Standardize parameter naming with leading underscore - Update OpenAI and tiktoken package versions Package updates: - openai: 1.44.0 -> 1.51.2 - tiktoken: 0.7.0 -> 0.8.0 Breaking changes: - Functions now return raw JSON responses instead of parsed objects - Functions marked as parallel unsafe due to HTTP API constraints - Parameter names now prefixed with underscore for consistency
Fix naming conflicts with Postgres reserved words. Reverted parallel safety changes. Went from unsafe -> safe for functions. Reverted function volatility changes.
… client_extra_args parameter to all relevant functions that interact with the client (other than the `_simple` function that I think needs a rethinking).
|
Well, spent some time rebasing this branch this morning because I had a bit of time. It looks like recent changes have broken the build process again on my machine, so no progress made towards fixing whatever the issue is. I've got a baby on the way in the next few days, so I likely won't get back to this now for a while. This PR is likely dead unless someone else wants to take it up. |
Congrats on the baby 🙏🏼 sending all the good vibes to you so everything goes perfectly (I got 2 so I know how it's). I can try taking it up later, or maybe someone from the team will. Just one thing, could you leave a comment explaining what we'll need to do next? Thanks, and congrats again. |
@alejandrodnm Very appreciated. This is the first, so I'm sure i'll be in for some surprises. I figured out the changes with the dev process and got the build working again on my machine so I could make one last attempt at it. Anyways, I believe it has to do with this: openai/openai-python#1596 (and more details here: BerriAI/litellm#6592 (comment)) but I can't confirm at this point. The issue can be observed by using this profiled execute_with_cancellation method in the
Here is my example query:
And you can use this dummy server for testing which removes all the server-side delay:
And finally, here is the profile I collected from the above: openai-profile-01.txt The 2nd (3rd, 4th, etc..) runs being slow is only true if we are reusing the client. We can temporarily just disable reusing the client and force recreation every time until an upstream fix is in place. Another option is to just ignore it for now, because most requests will be hitting a remote endpoint with a lengthy latency so the little bit extra won't matter anyways. |
So I spent more time today, and replaced the async implementation with an equivalent sync implementation. It made absolutely no difference in behavior between async and sync. The second call to the client (even when just called twice within the same function call) always has that extra ~45ms delay with both implementations. When caching the client:
~6s/100 calls to openai_chat_completion 5% CPU per connection (e.g. can easily have 100+ connections to the server running inference and not kill performance) When creating new client:
~3s/100 calls to openai_chat_completion 100% CPU per connection (e.g. can only have NUM_CORES connections to the server before saturating CPU) So when creating a new client even though each individual connection will finish the calls twice as fast, the server will be overloaded by just a handful of users calling these functions. I hate that python performance is just so damn terrible to begin with that this is something we have to spend time optimizing. With that said, I think the only real option is to stick with caching the client (just like it is now in this branch) because it doesn't knock over otherwise healthy servers when users decide to actually use the functions. |
Uh, looks like I made a mistake in rebasing and pulled in some unexpected changes somehow. Gotta get that fixed right now. Edit: Not sure... But on this PR it says there are 26k lines changed, which is very wrong. |
Major changes:
Implementation details:
Package updates:
Breaking changes:
Known issues: