-
-
Notifications
You must be signed in to change notification settings - Fork 938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] AWS Connection.py Using Private Botocore get_response
Which Expects JSON But is Being Passed XML
#1726
Comments
Hey @o-nikolas 👋, We also offer priority support for our sponsors. |
I am willing to lean toward the official public API based refactor. is it possible to share the release notes? |
Nice :) We also think this is the "best" solution. I am not sure what you are asking by "release notes" though. You can find documentation of boto3 here. Boto3 interacts will all (or at least most) AWS services, you can find the documentation specific to SQS here. The documentation is pretty well done, you have specification for all APIs with examples. Feel free to ping us if you need any help/guidance |
I didn't know that we are using non public API! |
can you aws expert guys have a look at this issue? celery/celery#8015 (comment) |
Happy to help on this issue indeed but it is not related to this issue right? At least, to me, it does not look related. Have you, by any chance, started to work on the |
The above issue is not the same as this issue, since the cause for this issue was rolled back. However, after a quick look, the above does seem like another issue related to hand crafting http requests instead of simply using the boto3 SDK |
@auvipy - I noticed that this issue is marked for the 5.3.x release, which is due by June 30th. I'm curious to know whether the Kombu community is planning to address this in the next release? If not, is there anything we can do to get this prioritized? |
we have lot on the plate already. this is a big overhaul. but I will try. the 5.3.x mile stone will have several bug fix releases |
@auvipy , I am from the Amazon MWAA team. I experimented with this and managed to do a PoC for a solution by making Basically, my solution involves updating the def receive_message(
self, queue_url, number_messages=1, visibility_timeout=None,
attributes=('ApproximateReceiveCount',), wait_time_seconds=None,
callback=None
):
def make_request(
queue_url, number_messages, wait_time_seconds, visibility_timeout,
attributes, callback
):
kwargs = {
"QueueUrl": queue_url,
"MaxNumberOfMessages": number_messages,
"MessageAttributeNames": attributes,
"WaitTimeSeconds": wait_time_seconds,
}
if visibility_timeout:
kwargs["VisibilityTimeout"] = visibility_timeout
resp = self.sqs_connection.receive_message(**kwargs)
if callback:
callback(resp)
# Let kombu's event loop call `make_request` asynchronously.
return self.hub.call_soon(
make_request,
queue_url, number_messages, wait_time_seconds, visibility_timeout,
attributes, callback,
) I tested this and confirmed it is fixing the issue. However, one thing that still needs to be tested is with regard to multi-threading and multi-processing. Based on my testing, the same process and thread execute both the
Looking forward to hearing from you soon. I am happy to discuss this with you over a meeting to move things faster. |
I think you have done a great job here. in future we will swittch to async io based approach. I think we can start with what you have discussed so far. kombu/celery was designed back in 2009 when there was no standard async. the async features here are mostly monkey patched versions which are not so reliable. so we would better off to something more mainstream & standard. if you can come with a draft PR, it will be easier for me to analyze it further alongside you. |
Also this could be a great alternative for future https://github.com/terrycain/aioboto3 |
Thanks for the confirmation. I will publish a PR for this soon. One question I would like to ask is with regards to clean up. I see a lot of functions unnecessary functions in the SQS async client. I am willing to delete them, as opposed to spending time re-implementing them using boto3. I confirmed they are not being used within Celery, but can you confirm external dependencies are not supposed to access those functions directly? (There is still the concern some libraries might still use them regardless, e.g. what happens with kombu accessing internal functions with boto3, but I believe I still believe it is better to clean up the code.) |
it is always better to use public API. also it will be great to get help from AWS folks to improve SQS support in kombu & celery. currently we lack on that part. |
I will need to discuss this my manager before making any commitment. Do you have anything spec you want to get improved, or just in general? |
if you guys could guide us regarding the best practices for the SQS and related parts in kombu & celery, it would be great. |
…)" This reverts commit 862d0bc.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now borrowed the implementation of `get_response` from botocore and changed the code such that the protocol is hard-coded to `query`. This way, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. There are two problems with this approach: 1. It doesn't address the fundamental problem discussed in celery#1726, which is that `kombu` is using some stuff that are kind of internal to botocore, namely the `StreamingBody` class. 2. It is still making an assumption, namely that the protocol of communication is the `query` protocol. While this is true, and likely going to be true for some time, in theory nothing stops SQS (the backend, not client) from changing the default protocol, rendering the hard-coding of protocol in the new `get_response` method to `query` to be problematic. As such, the long term solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests all async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now changed the `AsyncSQSConnection` class such that it crafts either a `query` or a `json` request depending on the protocol used by the SQS client. Thus, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. The final solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests are async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now changed the `AsyncSQSConnection` class such that it crafts either a `query` or a `json` request depending on the protocol used by the SQS client. Thus, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. The final solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests are async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
TL;DR - The use of boto3 in celery#1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in celery#1783. `kombu` previously used to craft AWS requests manually as explained in detail in celery#1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted celery#1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on celery#1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in celery#1726, I now changed the `AsyncSQSConnection` class such that it crafts either a `query` or a `json` request depending on the protocol used by the SQS client. Thus, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. The final solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests are async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing.
* Use the correct protocol for SQS requests TL;DR - The use of boto3 in #1759 resulted in relying on blocking (synchronous) HTTP requests, which caused the performance issue reported in #1783. `kombu` previously used to craft AWS requests manually as explained in detail in #1726, which resulted in an outage when botocore temporarily changed the default protocol to JSON (before rolling back due to the impact on celery and airflow.) To fix the issue, I submitted #1759, which changes `kombu` to use `boto3` instead of manually crafting AWS requests. This way when boto3 changes the default protocol, kombu won't be impacted. While working on #1759, I did extensive debugging to understand the multi-threading nature of kombu. What I discovered is that there isn't an actual multi-threading in the true sense of the word, but an event loop that runs on the same thread and process and orchestrate the communication with SQS. As such, it didn't appear to me that there is anything to worry about my change, and the testing I did didn't discover any issue. However, it turns out that while kombu's event loop doesn't have actual multi-threading, its [reliance on pycurl](https://github.com/celery/kombu/blob/main/kombu/asynchronous/http/curl.py#L48) (and thus libcurl) meant that the requests to AWS were being done asynchronously. On the other hand, boto3 requests are always done synchronously, i.e. they are blocking requests. The above meant that my change introduced blocking on the event loop of kombu. This is fine in most of the cases, since the requests to SQS are pretty fast. However, in the case of using long-polling, a call to SQS's ReceiveMessage can last up to 20 seconds (depending on the user configuration). To solve this problem, I rolled back my earlier changes and, instead, to address the issue reported in #1726, I now changed the `AsyncSQSConnection` class such that it crafts either a `query` or a `json` request depending on the protocol used by the SQS client. Thus, when botocore changes the default protocol of SQS to JSON, kombu won't be impacted, since it crafts its own request and, after my change, it uses a hard-coded protocol based on the crafted requests. This solution shouldn't be the final solution, and it is more of a workaround that does the job for now. The final solution should be to completely rely on boto3 for any communication with AWS, and ensuring that all requests are async in nature (non-blocking.) This, however, is a fundamental change that requires a lot of testing, in particular performance testing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update kombu/asynchronous/aws/sqs/connection.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Asif Saif Uddin <auvipy@gmail.com>
Hello Kombu folks!
tl;dr
Kombu is using an internal/private method, get_response, from Botocore which changed behaviour recently due to other Botocore updates to the wire protocol used for SQS, which breaks Kombu.
The change has been temporarily reverted, but Kombu cannot depend on a stable behaviour of this method going forward as the changes will soon be applied again.
Full Context/Explanation
The implementation of the AWS asynchronous connection is currently using the non-public method
get_response
frombotocore
, here is an example:kombu/kombu/asynchronous/aws/connection.py
Lines 232 to 236 in 3e098dc
get_response
is vendored in here:kombu/kombu/asynchronous/aws/ext.py
Lines 1 to 9 in 3e098dc
Botocore recently switched to JSON for the wire protocol it uses to communicate with SQS here (Note: Commit url doesn't expand nicely, you'll need to follow the link).
Kombu builds XML AWS Query API requests manually (instead of using something like the boto3 SDK client) to communicate with SQS:
kombu/kombu/asynchronous/aws/connection.py
Lines 185 to 203 in 3e098dc
The resulting XML responses are then passed to the aforementioned
get_response
, which now expects JSON due to the change in Botocore also linked above, for parsing. In this situation,get_response
returns a response object with an empty body, which Kombu reads to be a legitimate empty result. This means that a request to get messages, for example, will always return empty. In this way it fails silently.Botocore has reverted the change here. But this is only a temporary fix, as JSON is the preferred wire protocol.
Possible Paths Forward
pros: Seamless support from a public interface which will handle the wire protocol migration smoothly
cons: This is a large refactoring, and presumably, there is some historical reason that we're unaware of for why Kombu is hand crafting requests since it is a complex piece of code and there was likely a reason for the effort to build it this way.
get_response
(which was done in this PR here).pros: Smaller scope for the change which could likely be completed sooner. Code worked in the past
cons: More complex handcrafted request/response handling existing in Kombu rather than simply using boto3 SDK
Thanks in advance for your time 🙏
CC:
The text was updated successfully, but these errors were encountered: