Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAI: Migrate to async client and enhance API support #219

Open
wants to merge 68 commits into
base: main
Choose a base branch
from

Conversation

Tostino
Copy link

@Tostino Tostino commented Nov 13, 2024

Major changes:

  • Migrate to async OpenAI client to support query cancellation and timeouts
  • Add client caching using global dictionary (GD) to improve performance
  • Migrate to using raw responses to minimize type conversions and improve performance
  • Add comprehensive support for all OpenAI API parameters
  • Add support for client create/destroy methods

Implementation details:

  • Replace sync OpenAI client with AsyncOpenAI for better control flow
  • Implement client caching in GD to reuse connections
  • Add query cancellation support using asyncio
  • Remove list_models and embed function implementations from openai.py to consolidate API handling
  • Move functionality directly into the SQL functions for consistency
  • Return raw API responses to minimize conversions
  • Add complete OpenAI API parameter support across all functions
  • Standardize parameter naming with leading underscore
  • Update OpenAI and tiktoken package versions

Package updates:

  • openai: 1.44.0 -> 1.51.2
  • tiktoken: 0.7.0 -> 0.8.0

Breaking changes:

  • Functions now return raw JSON responses instead of parsed objects
  • Functions marked as parallel unsafe due to HTTP API constraints
  • Parameter names now prefixed with underscore to reduce naming conflicts

Known issues:

  • Inconsistent performance inside the plpython environment.
    • The first call to the endpoint is quick (3ms), and then every call after it is very delayed (40ms). Need to figure out what is happening here. I cannot reproduce outside of plpython.

Major changes:
- Migrate to async OpenAI client to support query cancellation and timeouts
- Add client caching using global dictionary (GD) to improve performance
- Migrate to using raw responses to minimize type conversions and improve performance
- Add comprehensive support for all OpenAI API parameters
- Add support for client create/destroy methods

Implementation details:
- Replace sync OpenAI client with AsyncOpenAI for better control flow
- Implement client caching in GD to reuse connections
- Add query cancellation support using asyncio
- Remove list_models and embed function implementations from openai.py to consolidate API handling
- Move functionality directly into the SQL functions for consistency
- Return raw API responses to minimize conversions
- Add complete OpenAI API parameter support across all functions
- Standardize parameter naming with leading underscore
- Update OpenAI and tiktoken package versions

Package updates:
- openai: 1.44.0 -> 1.51.2
- tiktoken: 0.7.0 -> 0.8.0

Breaking changes:
- Functions now return raw JSON responses instead of parsed objects
- Functions marked as parallel unsafe due to HTTP API constraints
- Parameter names now prefixed with underscore for consistency
@Tostino Tostino requested a review from a team as a code owner November 13, 2024 16:54
Copy link
Collaborator

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this PR! I did a preliminary check and have a bunch of questions. I just want to understand the motivation/reasoning behind some decisions. Also we'll have to decide on the json vs vector return of some of these functions. I think you are right we'll need both sets of functions. Let me ask some of my colleagues about the naming conventions we want to use here.

projects/extension/sql/idempotent/001-openai.sql Outdated Show resolved Hide resolved

return openai.AsyncOpenAI(**client_kwargs)

def get_or_create_client(plpy, GD: Dict[str, Any], api_key: str = None, api_key_name: str = None, base_url: str = None) -> Any:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any number showing that creating the client is expensive (and thus worth it to store in GD)?. Does this allow connection reuse or something? And if it's the latter then how/when do connections get closed? Is there a keepalive timeout.

Storing the client in GD seems like a good amount of complexity and I'd like to find out what we are gaining/loosing for it.

Copy link
Author

@Tostino Tostino Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, benchmarks here: #116 (comment)

Also note, that there is a known issue with the 2nd (and 3rd...etc) call to the client for the api has some extra 40ms delay that doesn't happen when I have this code running outside of a pl/python environment (noted in the thread above). I really should have mentioned that directly in the PR. will edit to mention that fact that it needs to be identified. Once that is fixed, the benchmark numbers should look much better.

Even with the above issue, this is still much faster, and lower CPU than the original implementation where we recreate the client.

Note specifically the CPU reduction. Recreating the client is heavy on CPU, I know this from past projects but the benchmarks also bare this out.

I believe the connection is closed after the request is complete, and the client becomes ready for the next call. If the request is cancelled early, we attempt to kill things gracefully.

projects/extension/sql/idempotent/001-openai.sql Outdated Show resolved Hide resolved
projects/extension/sql/idempotent/001-openai.sql Outdated Show resolved Hide resolved
projects/extension/sql/idempotent/001-openai.sql Outdated Show resolved Hide resolved
projects/extension/sql/idempotent/001-openai.sql Outdated Show resolved Hide resolved
@Tostino
Copy link
Author

Tostino commented Nov 14, 2024

@cevian Alright, so when removing the _underscore prefix we will need to make changes to the user argument that will differ from the OpenAI API / python client. It is an invalid argument name, along with text which you changed to text_input already.

Let me know what to go with.

@cevian
Copy link
Collaborator

cevian commented Nov 14, 2024

@Tostino I believe we used openai_user before. Let's stick with that.

Fix naming conflicts with Postgres reserved words.

Reverted parallel safety changes. Went from unsafe -> safe for functions.

Reverted function volatility changes.
@Tostino
Copy link
Author

Tostino commented Nov 14, 2024

@cevian Alright, all changes made. Also noticed that when I rebased things I had accidentally committed changes to the ai--0.4.0.sql file, so I got that reverted.

This still has the performance problem we need to dig into before it's merged, but at least any of the other issues can be discussed and fixed in the meantime.

@cevian
Copy link
Collaborator

cevian commented Nov 14, 2024

@Tostino I am still not convinced we need ai.openai_client_create( as a public function. Can we just pass any client options we need as a jsonb to the other functions via a client_extra_args parameter?

@Tostino
Copy link
Author

Tostino commented Nov 14, 2024

I'm good with that solution. Seems to solve the problem statement I was originally trying to solve. Will get it done tonight.

… client_extra_args parameter to all relevant functions that interact with the client (other than the `_simple` function that I think needs a rethinking).
@Tostino
Copy link
Author

Tostino commented Nov 15, 2024

Well...some kid went and ripped out my neighborhoods internet interconnection wiring last night. Was slightly delayed.

Tested to make sure the client_extra_args were being passed through properly, and it seems to be with my initial "kick the tires" tests.

@Tostino
Copy link
Author

Tostino commented Nov 19, 2024

I should have a little time to try and figure out that 2nd run issue this week (or at least attempt, i'm not a Python dev so I am not used to the profiling tools in this space).

@cevian Is there anything else you see that needs attention at this point?

@alejandrodnm
Copy link
Contributor

Hey @Tostino is:

The first call to the endpoint is quick (3ms), and then every call after it is very delayed (40ms). Need to figure out what is happening here. I cannot reproduce outside of plpython.

A problem with the current state as well? Or introduced by your PR?

If it's a current issue, could you open a github issue for it and we can tackle that on a separate PR.

@Tostino
Copy link
Author

Tostino commented Dec 3, 2024

@alejandrodnm No, the current state is just a much slower overall call time every time (I believe it was roughly 25-30ms/call) and much higher CPU usage. Not a current issue, was introduced by the PR.

Sorry, holidays had me a bit busy. Will get back to this as soon as I can.

@alejandrodnm
Copy link
Contributor

@Tostino don't worry. Just wanted to see if we could support you better. You've put a lot of effort into this, and we really appreciate it.

MasterOdin and others added 22 commits December 17, 2024 10:22
Signed-off-by: Matthew Peveler <mpeveler@timescale.com>
Bizarrely if you configure but don't use the pip cache, the
setup-python action fails in its "post" step.

See: actions/setup-python#436
* feat: load api keys from db in self hosted vectorizer

* chore: make env var override db setting

* chore: add debug logs
…escale#312)

Adds a script to evaluate and compare different embedding models using the Paul Graham essays dataset.

The script:
- Supports evaluation of multiple embedding models (Nomic, OpenAI, BGE)
- Generates diverse question types (short, long, direct, implied, unclear)
- Measures retrieval accuracy by checking if source chunks are in top-K results
- Provides detailed performance metrics by question type
- Includes step-by-step evaluation process with CSV outputs for analysis
- The evaluation framework is configurable and can be extended to test additional
  embedding models. Results are saved to CSV files for further analysis.
* feat: add sqlalchemy vectorizer field

* chore: add installing extras to ci

* chore: simplify interface, add simple docs

* feat: allow arbitrary primary keys on parent model

* docs: update docs with simplified vectorizer field

* chore: rename VectorizerField to Vectorizer

* chore: update alembic exclusion mechanism

* docs: update docs with review comments

* chore: align automatic table name with create_vectorizer

* chore: add option for any relationship properties

* chore: setup class event based rather than lazy so relationship works on first query

* chore: update to embedding_relationship

* chore: refactor tests add vcr mocks

* chore: rename to vectorizer_model; cleanup

* chore: fix uv lock

* chore: remove dummy key
* Code for evaluating open source embedding models

* Update examples/finding_best_open_source_embedding_model/best_embedding_model_rag_app.ipynb

Co-authored-by: James Guthrie <JamesGuthrie@users.noreply.github.com>
Signed-off-by: Hervé Ishimwe <75611379+ihis-11@users.noreply.github.com>

* Changed from GPT 4o-mini to Llama3.2 & Addressed the comments

---------

Signed-off-by: Hervé Ishimwe <75611379+ihis-11@users.noreply.github.com>
Co-authored-by: James Guthrie <JamesGuthrie@users.noreply.github.com>
Signed-off-by: Matthew Peveler <mpeveler@timescale.com>
Major changes:
- Migrate to async OpenAI client to support query cancellation and timeouts
- Add client caching using global dictionary (GD) to improve performance
- Migrate to using raw responses to minimize type conversions and improve performance
- Add comprehensive support for all OpenAI API parameters
- Add support for client create/destroy methods

Implementation details:
- Replace sync OpenAI client with AsyncOpenAI for better control flow
- Implement client caching in GD to reuse connections
- Add query cancellation support using asyncio
- Remove list_models and embed function implementations from openai.py to consolidate API handling
- Move functionality directly into the SQL functions for consistency
- Return raw API responses to minimize conversions
- Add complete OpenAI API parameter support across all functions
- Standardize parameter naming with leading underscore
- Update OpenAI and tiktoken package versions

Package updates:
- openai: 1.44.0 -> 1.51.2
- tiktoken: 0.7.0 -> 0.8.0

Breaking changes:
- Functions now return raw JSON responses instead of parsed objects
- Functions marked as parallel unsafe due to HTTP API constraints
- Parameter names now prefixed with underscore for consistency
Fix naming conflicts with Postgres reserved words.

Reverted parallel safety changes. Went from unsafe -> safe for functions.

Reverted function volatility changes.
… client_extra_args parameter to all relevant functions that interact with the client (other than the `_simple` function that I think needs a rethinking).
@CLAassistant
Copy link

CLAassistant commented Dec 26, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
5 out of 10 committers have signed the CLA.

✅ Tostino
✅ adolsalamanca
✅ cevian
✅ jgpruitt
✅ smoya
❌ github-actions[bot]
❌ Askir
❌ jackyliang
❌ ihis-11
❌ MasterOdin
You have signed the CLA already but the status is still pending? Let us recheck it.

@Tostino
Copy link
Author

Tostino commented Dec 26, 2024

Well, spent some time rebasing this branch this morning because I had a bit of time. It looks like recent changes have broken the build process again on my machine, so no progress made towards fixing whatever the issue is.

I've got a baby on the way in the next few days, so I likely won't get back to this now for a while. This PR is likely dead unless someone else wants to take it up.

@alejandrodnm
Copy link
Contributor

Well, spent some time rebasing this branch this morning because I had a bit of time. It looks like recent changes have broken the build process again on my machine, so no progress made towards fixing whatever the issue is.

I've got a baby on the way in the next few days, so I likely won't get back to this now for a while. This PR is likely dead unless someone else wants to take it up.

Congrats on the baby 🙏🏼 sending all the good vibes to you so everything goes perfectly (I got 2 so I know how it's).

I can try taking it up later, or maybe someone from the team will. Just one thing, could you leave a comment explaining what we'll need to do next?

Thanks, and congrats again.

@Tostino
Copy link
Author

Tostino commented Dec 27, 2024

@alejandrodnm Very appreciated. This is the first, so I'm sure i'll be in for some surprises.

I figured out the changes with the dev process and got the build working again on my machine so I could make one last attempt at it.

Anyways, I believe it has to do with this: openai/openai-python#1596 (and more details here: BerriAI/litellm#6592 (comment)) but I can't confirm at this point.
Essentially, the httpx project has some issues that need to be addressed. OpenAI said they would try and prioritize fixing the issues upstream, like 6 months ago. Still an open issue.

The issue can be observed by using this profiled execute_with_cancellation method in the openai.py:

def execute_with_cancellation(plpy, client: openai.AsyncOpenAI, async_func: Callable[[openai.AsyncOpenAI, Dict[str, Any]], Awaitable[Dict[str, Any]]], **kwargs) -> Dict[str, Any]:

    async def main():
        # 1. Start the profiler
        profiler = cProfile.Profile()
        profiler.enable()

        # 2. Create your async task
        task = asyncio.create_task(async_func(client, kwargs))

        # 3. Periodically check for query cancellation
        while not task.done():
            if is_query_cancelled(plpy):
                task.cancel()
                raise plpy.SPIError("Query cancelled by user")
            await asyncio.sleep(0.01)

        # 4. Wait for the result
        response = await task

        # 5. Stop the profiler
        profiler.disable()

        # 6. Print the stats to PostgreSQL logs
        stream = io.StringIO()
        sortby = pstats.SortKey.CUMULATIVE  # or pstats.SortKey.TIME, etc.
        stats = pstats.Stats(profiler, stream=stream).sort_stats(sortby)
        stats.print_stats()

        plpy.info("=== cProfile Results for OpenAI Call ===")
        plpy.info(stream.getvalue())

        return response

    loop = asyncio.get_event_loop()
    return loop.run_until_complete(main())

Here is my example query:

SELECT ai.openai_chat_complete( 
            api_key=>'none'
            , base_url => 'http://localhost:8001/v1'
            , model=>'Qwen/Qwen2.5-0.5B-Instruct'
            , max_tokens=>1
            , messages=>jsonb_build_array
              ( jsonb_build_object('role', 'user', 'content', 'reply with `hello` and nothing else.')
              )
            ) FROM generate_series(1,3);

And you can use this dummy server for testing which removes all the server-side delay:

from fastapi import FastAPI, Request
from pydantic import BaseModel
from typing import List, Optional
import logging
from datetime import datetime
import time

# Configure logging
logging.basicConfig(
    format='%(asctime)s - %(levelname)s - %(message)s',
    level=logging.INFO,
    datefmt='%Y-%m-%d %H:%M:%S'
)
logger = logging.getLogger(__name__)

app = FastAPI()

class Message(BaseModel):
    role: str
    content: str

class ChatCompletionRequest(BaseModel):
    model: str
    messages: List[Message]
    temperature: Optional[float] = None
    top_p: Optional[float] = None
    n: Optional[int] = None
    stream: Optional[bool] = None
    stop: Optional[List[str]] = None
    max_tokens: Optional[int] = None
    presence_penalty: Optional[float] = None
    frequency_penalty: Optional[float] = None
    logit_bias: Optional[dict] = None
    user: Optional[str] = None

@app.middleware("http")
async def log_requests(request: Request, call_next):
    start_time = time.time()
    response = await call_next(request)
    process_time = (time.time() - start_time) * 1000
    logger.info(
        f"Path: {request.url.path} | "
        f"Method: {request.method} | "
        f"Processing Time: {process_time:.2f}ms"
    )
    return response

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
    logger.info(f"Received chat completion request for model: {request.model}")

    response = {
        "id": "chat-aabeda70641c4c88ac523bddb544ad87",
        "model": "Qwen/Qwen2.5-0.5B-Instruct",
        "usage": {
            "total_tokens": 39,
            "prompt_tokens": 38,
            "completion_tokens": 1
        },
        "object": "chat.completion",
        "choices": [
            {
                "index": 0,
                "message": {
                    "role": "assistant",
                    "content": "hello",
                    "tool_calls": [],
                    "function_call": None
                },
                "logprobs": None,
                "stop_reason": None,
                "finish_reason": "length"
            }
        ],
        "created": int(time.time()),
        "prompt_logprobs": None,
        "system_fingerprint": None
    }

    logger.info(
        f"Completed request | "
        f"Tokens used: {response['usage']['total_tokens']} | "
        f"Response length: {len(response['choices'][0]['message']['content'])}"
    )

    return response

if __name__ == "__main__":
    import uvicorn
    logger.info("Starting server...")
    uvicorn.run("benchmark_pgai_server:app", host="0.0.0.0", port=8001, workers=4)

And finally, here is the profile I collected from the above: openai-profile-01.txt

The 2nd (3rd, 4th, etc..) runs being slow is only true if we are reusing the client. We can temporarily just disable reusing the client and force recreation every time until an upstream fix is in place. Another option is to just ignore it for now, because most requests will be hitting a remote endpoint with a lengthy latency so the little bit extra won't matter anyways.

@Tostino
Copy link
Author

Tostino commented Dec 28, 2024

So I spent more time today, and replaced the async implementation with an equivalent sync implementation. It made absolutely no difference in behavior between async and sync. The second call to the client (even when just called twice within the same function call) always has that extra ~45ms delay with both implementations.

When caching the client:

  1. (call 1) Initial client setup is slow (~20ms) and takes a lot of CPU + Using that client for a (local) call the first time is very quick (~1ms)
  2. (call 2) No client setup (fetched from cache) ( ~1ms) + Further calls are slow (~45ms) but don't take much CPU
  3. (call 3) No client setup (fetched from cache) ( ~1ms) + Further calls are slow (~45ms) but don't take much CPU
  4. (call...) ...
  5. (call 99) No client setup (fetched from cache) ( ~1ms) + Further calls are slow (~45ms) but don't take much CPU
  6. (call 100) No client setup (fetched from cache) ( ~1ms) + Further calls are slow (~45ms) but don't take much CPU

~6s/100 calls to openai_chat_completion 5% CPU per connection (e.g. can easily have 100+ connections to the server running inference and not kill performance)

When creating new client:

  1. (call 1) Client setup slow and takes a lot of CPU every call (~20ms) + Using that client for a (local) call the first time is very quick (~1ms)
  2. (call 2) Client setup slow and takes a lot of CPU every call (~20ms) + Using that client for a (local) call the first time is very quick (~1ms)
  3. (call 3) Client setup slow and takes a lot of CPU every call (~20ms) + Using that client for a (local) call the first time is very quick (~1ms)
  4. (call ...) ...
  5. (call 99) Client setup slow and takes a lot of CPU every call (~20ms) + Using that client for a (local) call the first time is very quick (~1ms)
  6. (call 100) Client setup slow and takes a lot of CPU every call (~20ms) + Using that client for a (local) call the first time is very quick (~1ms)

~3s/100 calls to openai_chat_completion 100% CPU per connection (e.g. can only have NUM_CORES connections to the server before saturating CPU)

So when creating a new client even though each individual connection will finish the calls twice as fast, the server will be overloaded by just a handful of users calling these functions.

I hate that python performance is just so damn terrible to begin with that this is something we have to spend time optimizing.

With that said, I think the only real option is to stick with caching the client (just like it is now in this branch) because it doesn't knock over otherwise healthy servers when users decide to actually use the functions.
I'm sure someone can figure out what is going on with httpx and why it is introducing extra latency...but I sure failed to figure out a solution.

@Tostino
Copy link
Author

Tostino commented Dec 29, 2024

Uh, looks like I made a mistake in rebasing and pulled in some unexpected changes somehow. Gotta get that fixed right now.

Edit: Not sure...
My branch has the correct number of files / lines changed when I compare it to main: main...Tostino:pgai:fix_openai_redux

But on this PR it says there are 26k lines changed, which is very wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.