Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port to Quart #503

Merged
merged 8 commits into from
Aug 9, 2023
Merged

Port to Quart #503

merged 8 commits into from
Aug 9, 2023

Conversation

pamelafox
Copy link
Collaborator

Purpose

This PR ports the app to Quart, a Python ASGI web framework.
Flask is a WSGI framework, which is synchronous. That means that when a request comes in, the server will wait for the request to be processed before it can handle another request. We have added support for multiple gunicorn workers/threads, but they can still easily be blocked by a long-running request. Each /chat request makes 4 API calls, which take from 0.2-7 seconds each.

Quart is designed with the same API as Flask, with a few changes needed for asynchronous support. Most of my changes are the addition of await/async. For the Azure SDK calls, I used the async versions of the clients (imported from .aio). For the OpenAI API calls, I used the async versions of the functions (acreate instead of create).

For deployment on App Service, I use gunicorn with the uvicorn worker.

Does this introduce a breaking change?

[X] Yes
[ ] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[X] Feature
[ ] Code style update (formatting, local variables)
[X] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

How to Test

  • app/start.sh should start the app
  • azd up should deploy the app
  • python3 -m pytest should run the tests

@pamelafox pamelafox marked this pull request as ready for review August 6, 2023 00:06
filter=filter,
query_type=QueryType.SEMANTIC,
query_type=QueryType.SIMPLE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that was unintentionally committed, thats a change I make locally due to running out of semantic search credits this month. Reverted.

@pamelafox
Copy link
Collaborator Author

Btw, the creator of Quart took a look at this PR and says the Quart usage looks good to him:
https://fosstodon.org/@pgjones/110842739565496118

@srbalakr srbalakr requested a review from chuwik August 7, 2023 17:14
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be removed ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was something I added for mypy, it was confused without it. Generally Python modules should have an init.py
See https://docs.python.org/3/tutorial/modules.html#packages
Let me know if you think there's a reason to not have one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. My impression was that it would be implicitly created on the runtime. Also app worked previously without it. Lets keep it if thats the good practice.

else:
results = [doc[self.sourcepage_field] + ": " + nonewlines(doc[self.content_field]) for doc in r]
results = [doc[self.sourcepage_field] + ": " + nonewlines(doc[self.content_field]) async for doc in r]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, async in iterations, does it yield faster results ? Our list size does not exceeed 10 or 20 at max !

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's not a big change in performance, but I had to make the change for API compatibility reasons once I started using the aio client. I've done a little loadtesting so far but not deep profiling, I need to try @anthony's Opentelemetry integration.

@srbalakr srbalakr self-requested a review August 7, 2023 19:15
@vrajroutu
Copy link

vrajroutu commented Aug 8, 2023

@pamelafox Could you please merge the streaming feature into the quart branch? Streaming is significantly faster and begins generating responses immediately, as opposed to waiting for the entire response to be generated. Additionally, I've observed a few issues when enabling streaming alongside asynchronous processing

@BitcoinCode
Copy link

@pamelafox thank you , this is really useful, could you please shed some light on the following as this is important for me to understand clearly:

What are the 4 APIs that get called when user submits a question ?
How many approaches the application uses, to the best of my understanding it uses the chatreadretrieveread.py file ?
Is there any tech documentation for the applications that would help understand how the application works end to end ?

appreciate your help on this

@BitcoinCode
Copy link

I've implemented the Quart changes by @pamelafox but the performance is still the same if two users using the application it answers some questions then just stuck on generate answer... then the json error, the only thing I could think off is that the azure subscription is in EU but the repo deployed to east US, would this contribute to the issue ?

I have also upgraded the app service from B1 to B3 and scaled out the instances manually to 3 instances as the auto scale is greyed out

is there anything else I can do to improve the performance with the assumption that 30 users to be accessing the app around the same time ?

Failed to forward request to http://000.000.000.00:8000.
Encountered a System.Threading.Tasks.TaskCanceledException
exception after 300221.453ms with message:
The request was canceled due to the configured
HttpClient.Timeout of 300 seconds elapsing..
Check application logs to verify the application is properly handling HTTP traffic.

Appreciate your help.

Request time out

json error

@pamelafox
Copy link
Collaborator Author

I've been doing additional load testing on this branch, and though I think it's an improvement, there are still some hurdles to getting 0% errors.

First I ran into the issue described here: openai/openai-python#371

There were two suggested workarounds there, setting the session yourself and using tenacity to retry. I implemented the second suggestion and re-ran the loadtests. I no longer encountered the "Connection reset" error, but I still had errors with responses taking too long and timing out.

I realized that App Service has a hard limit of 230 seconds, which corresponded with when errors occurred:
https://learn.microsoft.com/en-us/troubleshoot/azure/app-service/web-apps-performance-faqs#why-does-my-request-time-out-after-230-seconds

I then tried specifying a request_timeout parameter for the OpenAI API request, but that caused every request to timeout, due to this issue:
openai/openai-python#387

My proposal for next steps is to implement retry on the client in JavaScript, but I am still discussing this with other Azure folks for ideas on optimal configuration.

@pamelafox
Copy link
Collaborator Author

@BitcoinCode What region is your OpenAI instance deployed in? My loadtests have been in South Central US currently, and I have been recommended to try one of these regions:
Australia East
Canada East
East United States 2
Japan East
United Kingdom South

I'll re-provision to one of those and re-run loadtests on this branch.

@vrajroutu
Copy link

vrajroutu commented Aug 9, 2023

@pamelafox I have also experimented with this branch. It performed satisfactorily for me, demonstrating noticeable improvements in performance. However, I am streaming the responses, which could pose challenges when using Quart and asynchronous requests.

@pamelafox
Copy link
Collaborator Author

@vrajroutu I'm working on adding streaming (as an opt-in option) to Quart in another branch, as you requested that. I'm curious what issues you foresee with adding streaming?
Also, glad to hear you've seen improvements with this branch!

@BitcoinCode
Copy link

@BitcoinCode What region is your OpenAI instance deployed in? My loadtests have been in South Central US currently, and I have been recommended to try one of these regions: Australia East Canada East East United States 2 Japan East United Kingdom South

I'll re-provision to one of those and re-run loadtests on this branch.

My Open AI instance is in East US, this is interesting as I have another Open AI instance on West Europe and this instance doesn't seem to have this issue even with Flask, I will implement Quart on it just in case.

Thank you.

@BitcoinCode
Copy link

@pamelafox I have also experimented with this branch. It performed satisfactorily for me, demonstrating noticeable improvements in performance. However, I am streaming the responses, which could pose challenges when using Quart and asynchronous requests.

would you please share the code for streaming the responses?

@pamelafox pamelafox merged commit 9da71ef into Azure-Samples:main Aug 9, 2023
@pamelafox
Copy link
Collaborator Author

This is now merged. If you want to incorporate these changes in an existing branch, make sure you run azd up since this requires re-provisioning to update the appCommandLine. To avoid re-indexing docs, you can delete the postprovision hooks section from azure.yaml.

@pamelafox pamelafox deleted the quart branch August 9, 2023 21:55
@pamelafox
Copy link
Collaborator Author

@BitcoinCode I will create a PR when the streaming functionality is ready. You can watch this repo to be notified.

@BitcoinCode
Copy link

What worked for me as error mitigation and improved users experience is the following:

1- I've added request_timeout=15 to all acreate functions
2- I've added a function to click the button before the answer error component

by doing the above the users are not seeing the error and the retry implemented automatically still in testing phase though but the results are a lot better as the request does not take 4 to 5 mins to timeout and the 15 seconds seems to be good enough to get a response .

@ShantanuNair
Copy link

I've been doing additional load testing on this branch, and though I think it's an improvement, there are still some hurdles to getting 0% errors.

First I ran into the issue described here: openai/openai-python#371

There were two suggested workarounds there, setting the session yourself and using tenacity to retry. I implemented the second suggestion and re-ran the loadtests. I no longer encountered the "Connection reset" error, but I still had errors with responses taking too long and timing out.

I realized that App Service has a hard limit of 230 seconds, which corresponded with when errors occurred: https://learn.microsoft.com/en-us/troubleshoot/azure/app-service/web-apps-performance-faqs#why-does-my-request-time-out-after-230-seconds

I then tried specifying a request_timeout parameter for the OpenAI API request, but that caused every request to timeout, due to this issue: openai/openai-python#387

My proposal for next steps is to implement retry on the client in JavaScript, but I am still discussing this with other Azure folks for ideas on optimal configuration.

This was super helpful @pamelafox I'm running into these issues via langchain, which uses the openai-python client. Can I ask - why didn't you go with the first solution of setting the context? And has this issue been solved in openai-python(I ask because of this PR)?

@pamelafox
Copy link
Collaborator Author

@ShantanuNair I did actually end up using the first approach of setting the session:

async with aiohttp.ClientSession() as s:

I didn't add it to the stream endpoint yet as it doesn't seem needed based on my loadtests so far, but I will keep testing.
I do not believe there has been a resolution in openai-python, based off what I've read of their changelog.

@ShantanuNair
Copy link

@pamelafox Appreciate your help! I thought the PR I linked, allows you to pass in a ClientSession, similar to what you're doing? You are setting a new session on every request right?

I then tried specifying a request_timeout parameter for the OpenAI API request, but that caused every request to timeout, due to this issue: https://github.com/openai/openai-python/pull/387

It makes me wonder how people are using the library in long running chains in production. Also, do you ever notice a tonne of 502 bad_gateway errors when you send multiple requests with a large prompt token count? I do with 11-12k token inputs on 3.5k-turbo-16k openai, and I assumed it's because of langchain's tenacity retries failing due to the openai issue#371.

HughRunyan pushed a commit to RMI/RMI_chatbot that referenced this pull request Mar 26, 2024
* Quart draft

* Fix ask and test

* Quart deploying now

* Use semantic

* Get tests working

* Revert simple

* Typing fixes

* dont use pipe
vuculescu pushed a commit to vuculescu/azure-search-openai-demo that referenced this pull request Jan 27, 2025
* Quart draft

* Fix ask and test

* Quart deploying now

* Use semantic

* Get tests working

* Revert simple

* Typing fixes

* dont use pipe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants