Skip to content

HTTP interface updates (part 2) #131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Feb 16, 2023
Merged

HTTP interface updates (part 2) #131

merged 23 commits into from
Feb 16, 2023

Conversation

jonemo
Copy link
Contributor

@jonemo jonemo commented Feb 4, 2023

Updates the HTTP client related interfaces and implementations to match the most recent spec.

A few high level notes for orientation:

  • fields & fieldlists interfaces and implementation #122 was a precurser PR covering only HTTP headers/fields. This PR starts using the classes added in fields & fieldlists interfaces and implementation #122.
  • Much of the line count of the diff is due to trivial renaming, for example request.url to request.destination.
  • The spec defines the request.body of requests and responses as an iterable of byte chunks. Because we currently focus on async clients, I chose to make this an AsyncIterable. This caused me a lot of head scratching and led to smithy_python.async_utils.async_list which can be used to turn a plain list into a value for request.body.
  • The sync awscrt client has been removed and everything in smithy_python._private.http.crt got consolidated accordingly. Updating crt.py to conform to the new interface is a big part of the diff.
  • Some scattered use of exceptions in the http module is now consolidated to use smithy_python.exceptions.SmithyHTTPException.
  • I removed any test code that builds on integration tests pytest fixture for running local http server #128 from this PR.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@jonemo jonemo force-pushed the http-interface-update-2 branch 2 times, most recently from 0986894 to cb87f50 Compare February 7, 2023 07:24
"""An asynchronous HTTP client interface."""

def __init__(self, *, client_config: HttpRequestConfiguration | None) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the return type for __init__ functions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just run a test that does AST crawling and validates these kinds of things for us so it can be automated. It doesn't matter which way we go, it should just be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlm6693 If this matters to you, please update the repo's config so that mypy doesn't flag it up as an error. Once that's working I will gladly drop all the return type annotations for __init__s. Currently, if I remove these everywhere, I get this:

➜  smithy-python ✗ ./pants check ::
11:38:14.74 [ERROR] Completed: Typecheck using MyPy - mypy failed (exit code 1).
python-packages/smithy-python/smithy_python/_private/http/crt.py:31: error: Function is missing a return type annotation
python-packages/smithy-python/smithy_python/_private/http/crt.py:31: note: Use "-> None" if function does not return a value
python-packages/smithy-python/smithy_python/_private/http/crt.py:41: error: Function is missing a return type annotation
python-packages/smithy-python/smithy_python/_private/http/crt.py:41: note: Use "-> None" if function does not return a value
python-packages/smithy-python/smithy_python/_private/http/crt.py:185: error: Call to untyped function "_AWSCRTEventLoop" in typed context
python-packages/smithy-python/smithy_python/_private/http/crt.py:205: error: Call to untyped function "AwsCrtHttpResponse" in typed context
Found 4 errors in 1 file (checked 30 source files)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok I'd seen a comment about this before. I don't really care one way or the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous discussion was here: #122 (comment)

There and here, I initially dropped it in response to the comment just to get hit with a mypy error and bringing it back.

first byte over an established, open connection before timing out.
"""

force_http_2: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd bias against having booleans like this for configuration. It will likely be more scalable in the long term to have a single protocol enum that supports a subset like HTTP/1.1, HTTP/2, Auto at the start. That will enable us to integrate QUIC(H/3) without a new configuration option and having to worry about how they all interact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are describing approximately matches the spec for HttpVersionPolicy in the "Connections and Connection Pooling" doc of the reference architecture specification. For this PR I have set the boundary of what's included at the "Smithy Requests and Responses" doc (and still ended up with this huge diff).

This HttpClientConfiguration only exists in this PR because I wanted to keep the client-level config concept that already existed in smithy_python._private.crt.AwsCrtHttpSessionConfig. It matches conceptually to the ConnectConfiguration in "Connections and Connection Pooling", and I would prefer if we defer the details of this to a future task where that doc gets implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. If we're going to wait on doing the implementation properly, I'd almost prefer we keep this as an empty class then with a TODO comment rather than doing something we'd consider wrong.

"""An asynchronous HTTP client interface."""

def __init__(self, *, client_config: HttpRequestConfiguration | None) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just run a test that does AST crawling and validates these kinds of things for us so it can be automated. It doesn't matter which way we go, it should just be consistent.



@pytest.fixture
def aws_request() -> HTTPRequest:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have anything aws_* related in the smithy_python package. Let's try to keep naming/example domains generic unless we're in aws-smithy-python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to change the name if you think that's necessary.

Note that the only thing AWS specific about this is that the request gets made to an AWS owned domain. I think it's generally a good idea to use hosts that we trust to remain available and to not do weird things with request logs in tests.

I also proposed an alternative that would do away with the need to make external requests at all in #128.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave the URL and change the fixture to something like sample_request then. While I agree with the domain safety concern, I'd like to keep as many references out of the smithy-python client as possible. This is mostly to prevent muddying the barrier between both packages.

We've already had a couple PRs that struggled with that and I'd prefer to avoid any visual queues that might perpetuate the problems if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 518cd6a

Comment on lines 22 to 23
config = AwsCrtHttpClientConfig()
session = AioHttpClient(client_config=config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Why are we passing a CRT config to an AioHttpClient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note about decoupling this package from AWS.

Copy link
Contributor Author

@jonemo jonemo Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we passing a CRT config to an AioHttpClient?

That's a mistake. However, it highlights that there currently isn't much of a point in having different [HTTPClientName]Config classes for each http client. Maybe that will start being important in future, or maybe I should just consolidate this into a single HTTPClientConfig.

Same note about decoupling this package from AWS.

Is this a general statement about moving the AWS Common Runtime HTTP client from smithy-python to aws-smithy-python? I didn't consider that because the CRT http client was already here. But is the CRT's http client truly AWS specific? Or is it just another generic HTTP client that can exist next to aiohttp and any others we will implement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since writing the comment above I read the "Connections and Connection Pooling" doc again and got reminded that all the details of the various config objects are covered there. I wrote up a separate task to implement all these config objects.

Comment on lines 108 to 112
def test_uri_password_but_no_username() -> None:
uri = URI(host="test.aws.dev", password="def")
assert uri.password == "def"
# the password is ignore if no username is present
# the password is ignored if no username is present
assert uri.netloc == "test.aws.dev"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? I don't think there's anything precluding this, I know Basic auth specifically allows for a password without an explicit username.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference architecture doc mentions RFC 3987 as the guideline for URI. Section 3.2.1 declares the concept of any part of the URI being a "password" deprecated:

Use of the format "user:password" in the userinfo field is deprecated.

While not explicitly prohibited, my interpretation of this sentence is that something other than the empty string needs to appear before the first (optional) colon character:

The userinfo subcomponent may consist of a user name and, optionally, scheme-specific information about how to gain authorization to access the resource.

Of course, that's the RFC and then there's reality. I erred towards staying close to the RFC since that's what the RA doc references.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I know we've had this issue even recently in regards to proxies that still use userinfo for auth. Corporations consider it "safe" because it never leaves their internal firewall. There's also libraries that will strip this from the visible URL and encode it in the Authorization header.

If we don't support it, it should be all or nothing. This specific case of not allowing :password is an odd middle ground. I'm in favor of making it possible (but difficult to do wrong) rather than completely removing the functionality. I'll concede on that though if there's a strong opinion to the opposite.

@jonemo jonemo force-pushed the http-interface-update-2 branch from cb87f50 to fd2a1c2 Compare February 8, 2023 00:47
@jonemo jonemo force-pushed the http-interface-update-2 branch from fd2a1c2 to 264671b Compare February 8, 2023 00:49
@jonemo jonemo force-pushed the http-interface-update-2 branch from 264671b to 76d57fb Compare February 8, 2023 01:06
Comment on lines 165 to 186
@dataclass(kw_only=True)
class HttpClientConfiguration:
"""Client-level HTTP configuration.

:param force_http_2: How long, in seconds, the client will attempt to read the
first byte over an established, open connection before timing out.
"""

force_http_2: bool = False
Copy link
Contributor

@JordonPhillips JordonPhillips Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for this are wrong. But also, configuration should ideally be per-request. For libraries that have http version as session configuration (such as aiohttp), the implementation of HttpClient could just construct and cache the necessary underlying sessions as needed. That being said, do we even need this yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d113c5e

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You fixed the docs but didn't address the remainder of the comment.

@dataclass
class Endpoint(interfaces.http.Endpoint):
url: interfaces.http.URI
headers: interfaces.http.HeadersList = field(default_factory=list)
url: interfaces.URI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May as well fix this while we're updating all of this anyway

Suggested change
url: interfaces.URI
uri: interfaces.URI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced "url" with "uri" everywhere (where it's possible) in 3db2f53

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this was originally specified as URL and not URI is because an endpoint in the context of HTTP is explicitly a location. It constrains the usage of the broader URI concept which is an arbitrary identifier.

We can move to uri here if we feel it's less confusing but we're leaking some context from outside the transport stack into the HTTP module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move to uri here if we feel it's less confusing but we're leaking some context from outside the transport stack into the HTTP module.

In this context we could also have a discussion about what should be inside vs outside the HTTP module. I made a half-hearted attempt to sort things into the right place by moving interfaces.http.URI to interfaces.URI but then stopped to not make the diff even bigger. If I understand correctly, all of the following should be outside the HTTP module:

  1. URI
  2. Fields, Field, FieldPosition
  3. Endpoint, EndpointParam, EndpointResolver

I can take care of (1) and (2) in this PR or in a follow-up PR if you want. (3) should probably be left for whoever implements the Endpoints component of the reference architecture, I assume there will be a separate Endpoints module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that all makes sense. The concepts of URI/Fields can definitely live externally from the HTTP module. Otherwise we'll be re-implementing them for things like WebSockets, MQTT, etc.

I'm also fine creating a rough shape for Endpoint right now and deferring proper integration until implementation starts.

Copy link
Contributor Author

@jonemo jonemo Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the change to move URI and everything related to Fields out of the http modules. If I pushed this to the branch here it would increase the PR diff to 1,110 additions and 829 deletions. I don't think that's what we want, so I'll leave that for an immediate follow-on PR after this one is merged.

(Note to self: The commit with those changes is ff6a879.)

url=self._serialize_url_without_query(request.destination),
params=parse_qs(request.destination.query),
headers=headers_list,
data=await request.consume_body(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't assume that the body is going to fit into memory.

aiohttp's documentation for this parameter is actually wrong because it says:

This can be a FormData object or anything that can be passed into FormData

But that would be an absurd api if true because it would mean that aiohttp only supports bodies with the content types multipart/form-data or application/x-www-form-urlencoded. If you look at the code that handles the body you can see it actually delegates to this registry thing that supports delegating to a bunch of different body handlers depending on the type passed in. Fortunately for us, one of those handlers takes an AsyncIterable, so we can just hand it off directly.

Suggested change
data=await request.consume_body(),
data=request.body,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @JordonPhillips for doing the detective work on this. Your suggestion is included in 3ae2bea

async def async_list(lst: Iterable[_ListEl]) -> AsyncIterable[_ListEl]:
"""Turn an Iterable into an AsyncIterable"""
for x in lst:
await sleep(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've actually found this function useful in writing tests. Its not exactly straightforward to write an async iterable from scratch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that this question is specifically about the sleep: I am not certain it is necessary in any realistic case where this function is used. The general idea is that it makes sure that control is handed back to the event loop at some point while iterating.

# Conflicts:
#	codegen/smithy-python-codegen/src/main/java/software/amazon/smithy/python/codegen/HttpProtocolTestGenerator.java
#	codegen/smithy-python-codegen/src/main/java/software/amazon/smithy/python/codegen/integration/HttpBindingProtocolGenerator.java
endpoint.url.host = context.transport_request.url.host + endpoint.url.host
context._transport_request.url = endpoint.url
context._transport_request.headers.extend(endpoint.headers)
if not endpoint.uri.path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're setting a falsy value here, I think this should check if its None instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dummy comment to request changes. accidentally hit approve button 🙃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may be able to handle that in a future PR but since we're not functionally changing the behavior here, we should try to avoid additional scope creep.

@dlm6693 dlm6693 self-requested a review February 9, 2023 14:59
endpoint.url.host = context.transport_request.url.host + endpoint.url.host
context._transport_request.url = endpoint.url
context._transport_request.headers.extend(endpoint.headers)
if not endpoint.uri.path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dummy comment to request changes. accidentally hit approve button 🙃

Copy link
Contributor

@dlm6693 dlm6693 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions and one minor formatting nit, but I think this is looking pretty good!

Comment on lines +72 to 75
def _serialize_uri_without_query(self, uri: interfaces.URI) -> str:
"""Serialize all parts of the URI up to and including the path."""
components = (uri.scheme, uri.host, uri.path or "", "", "", "")
return urlunparse(components)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular advantage to doing it this way vs. constructing a new URI with no query and calling .build on it? Wondering if doing it that way may be 'better' but it probably doesn't matter.

Comment on lines +127 to +130
def reason(self) -> str | None:
"""Optional string provided by the server explaining the status."""
# TODO: See how CRT exposes reason.
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if this was explained before, but could you provide additional context on why this is being emptied out? Will this be a part of the follow-up PR you're working on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "emptied out"? The interface prescribes having a reason field. My understanding is that this is the string that a server optionally returns in the same line as the numeric HTTP status code. The awscrt Python package doesn't return this reason string, at least not in an obvious place. I also don't see it getting dealt with in the awscrt C code. I left this TODO comment here to signal that this is something that maybe could be dug into. I did not make a followup task specifically for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be a part of the follow-up PR you're working on?

The only followup PR I am currently committed to is the one mentioned in this comment. Apart from that, I am not currently assigned to any tasks in this repo.

@jonemo jonemo requested a review from dlm6693 February 15, 2023 20:09
Copy link
Contributor

@JordonPhillips JordonPhillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still one bit un-addressed but I'm fine deferring that to a follow-up.

@jonemo jonemo merged commit 21a9e01 into develop Feb 16, 2023
@jonemo jonemo deleted the http-interface-update-2 branch February 16, 2023 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants