Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid requirement that AWS Secret Manager JSON values be urlencoded. #25432

Conversation

dwreeves
Copy link
Contributor

@dwreeves dwreeves commented Jul 31, 2022

This PR addresses #25104


Changes

1. JSON secrets do not need to be URL-encoded when full_url_mode=False.

^ The whole reason for this PR. This is the main behavior that is being implemented.

Specifically, this behavior is implemented for when get_connection() is called. This method returns a Optional[Connection] object. The Connection() can be built either using a uri=?, or with kwargs corresponding with the parts of the URI, e.g. host=?, port=?, etc. In the former case, URL-encoding is required; in the latter case, it is not.

Users will receive a DeprecationWarning if the code detects that their secrets are URL-encoded (more on that in section 2 below). In most cases, the user should be able to simply decode their secret and everything will continue working normally.

I tried to make this change, as well as the other changes, as quietly as possible for typical Airflow runtimes. Basically, if the user isn't doing anything weird, the only thing they will be required to do is change their secrets from being URL-encoded to decoded.

To support this behavior, a few additional things were either needed to be changed, or made a lot of sense to change. The rest of this message will describe what those additional changes are.

2. Added secret_values_are_urlencoded=? kwarg for SecretsManagerBackend.__init__, albeit most users do not need to touch this.

@potiuk suggested adding a kwarg for assisting in migration. I wanted to avoid this if necessary because it is very obtrusive and causes a negative user experience.

Is it possible to avoid needing to reconfigure the secrets manager backend? Yes, in most cases!

What I dicovered is that if decoding the URL-encoded values is idempotent, the user needs to decode their secrets, but the backend_kwargs config option does not need to be adjusted. The reason why idempotency obviates a need to touch the config is that idempotency means the intended value of the secret is unambiguous once the values are decoded. This is a pretty big win from a user experience perspective because adjusting the Airflow configuration can be a nuisance in practice (e.g. a developer might need to get a system administrator involved).

I add a longer explanation in the comments for the code:

# When ``unquote(v) == v``, then removing unquote won't affect the user, regardless of
# whether or not ``v`` is URL-encoded. For example, "foo bar" is not URL-encoded. But
# because decoding it doesn't affect the value, then it will migrate safely when
# ``unquote`` gets removed.
#
# When parameters are URL-encoded, but decoding is idempotent, we need to warn the user
# to un-escape their secrets. For example, if "foo%20bar" is a URL-encoded string, then
# decoding is idempotent because ``unquote(unquote("foo%20bar")) == unquote("foo%20bar")``.
#
# In the rare situation that value is URL-encoded but the decoding is _not_ idempotent,
# this causes a major issue. For example, if "foo%2520bar" is URL-encoded, then decoding is
# _not_ idempotent because ``unquote(unquote("foo%2520bar")) != unquote("foo%2520bar")``
#
# This causes a problem for migration because if the user decodes their value, we cannot
# infer that is the case by looking at the decoded value (from our vantage point, it will
# look to be URL-encoded.)
#
# So when this uncommon situation occurs, the user _must_ adjust the configuration and set
# ``parameters_are_urlencoded`` to False to migrate safely. In all other cases, we do not
# need the user to adjust this object to migrate; they can transition their secrets with
# the default configuration.

The kwarg is needed in the very rare case that the kwarg is not idempotent, i.e. the string literal for the decoded secret contains a %. In this case, the user will need to manually set secret_values_are_urlencoded to False.

3. Send DeprecationWarning in a niche situation for get_conn_value().

get_conn_value() is allowed to return a JSON as of 2.3.0.

However, in the unlikely case that someone is both (1) using get_conn_value() directly, and (2) is using full_url_mode=False, we want to warn them that they will no longer receive a URL in the future.

  1. The base implementation of get_connection() will generate a Connection object when get_conn_value() returns a JSON-- or more conceptually, when the secret is a valid JSON.

  2. When get_conn_value() returns a JSON, get_connection() is able to create a Connection object from the JSON.
    a. Crucially, this does not require URL-encoding for the base Airflow APIs.

This is really challenging to do elegantly if the method SecretsManagerBackend.get_conn_value() needs to retain 100% backwards compatibility. By that I mean: if it is never allowed to return a JSON string representation of the secret.

For example, if _get_secret() returns a string '{"host": "foo", "conn_type": "postgres", "schema": "mydb"}', then the current behavior is that SecretsManagerBackend.get_conn_value() will return a string 'postgres://:@foo/mydb'.

Under the base class's implementation, BaseSecretsBackend.get_conn_value() is allowed to return a JSON string. But SecretsManagerBackend never does that. If the behavior of the overridden is relaxed to allow for JSON strings as per the base implementation, then this becomes a little easier to do without writing complete spaghetti.

For "typical" Airflow API usage there is no harm because get_conn_value() is not typically called directly. However, this is a pretty big package, and lots of people do lots of things you might not expect, so I opted to go with smoothly transitioning away from returning a URI.

4. Deprecate ast.literal_eval (i.e. support for dict reprs) for JSON decoding.

I want to be diplomatic, but I also want to get to the point, so please do not interpret this negatively. Here is the original code:

try:
    secret_string = self._get_secret(self.connections_prefix, conn_id)
    # json.loads gives error
    secret = ast.literal_eval(secret_string) if secret_string else None
except ValueError:  # 'malformed node or string: ' error, for empty conns
    connection = None
    secret = None

^ This is speculation, but upon review, it appears that ast.literal_eval may have been added to the PR because one of the test cases contains a trailing comma in a JSON, and the author forced the test to pass using ast.literal_eval instead of removing the trailing comma and using json.loads, which would be the more typical thing to do.

There are two reasons to remove ast.literal_eval and just use json.loads. The first reason is a bit more philosophical, which is that the API should not support an odd implementation and should not hand-hold for mistakes at this level of simplicity.

The second reason is to provide a little more consistency across the Airflow API:

  • {'conn_type': 'postgres', 'host': 'postgres'} is not a valid secret elsewhere in Airflow, but {"conn_type": "postgres", "host": "postgres"} is.
  • get_conn_value() is allowed to return a JSON string, but not a dict repr. When we migrate toward get_conn_value() returning a JSON string, we should discourage use of dict reprs.

The original PR that implemented the ast.literal_eval() approach (#15104) was primarily focused on adding more support for various aliases for keys. Overall, this a fine addition to the code. For various reasons, maintainers should be committed to retaining this feature. However, the use of ast.literal_eval was never part of the discussion for this specification; there was not a PR specifically devoted to migrating json.loads to ast.literal_eval.

Removing it also should not be disruptive to the vast majority of people, since:

  • The Secrets Manager UI defaults to parsing manually input key-value pairs as a JSON.
  • If you input a custom string that is a dict repr but not a JSON, the UI will show an error. It would be unusual for a user to submit this as a secret, since the user would be ignoring the UI's warnings. (I just tested and confirmed manually that this is the case by creating a secret that is {'foo': 'bar'}.)

5. Support extra being either a JSON string or a dict

AWS Secrets Manager allows for storage of arbitrary strings. For example, both of these are valid secrets to store within the Secrets Manager:

{"extra": "{\"foo\": \"bar\"}"}

and

{"extra": {"foo": "bar"}}

Right now, the former is supported but not the latter. There doesn't seem to be a good reason why the latter should not be supported, other than that the AWS UI will fail to parse key-value pairs. But it's still a valid secret!

6. Added some type annotations.

Some function signatures were missing type annotations, so I added them in. All of the type annotations for overridden methods adhere to the signatures from the base implementation of the class.

@potiuk
Copy link
Member

potiuk commented Aug 4, 2022

Can you please make tests pass before anyone deeply dives?

@potiuk
Copy link
Member

potiuk commented Aug 4, 2022

Also doc build fail

@dwreeves
Copy link
Contributor Author

dwreeves commented Aug 4, 2022

Sorry about that, I'll get to this later today.

@dwreeves dwreeves force-pushed the 25104-remove-SecretsManagerBackend-json-url-encoding-requirement branch from f78b016 to 43286d4 Compare August 4, 2022 17:07
@dwreeves
Copy link
Contributor Author

dwreeves commented Aug 5, 2022

@potiuk Tests passing. Sorry about that; I had a typo in one of the tests I added, and I'm still not entirely sure why the Sphinx issue happened, but it's fixed.

@potiuk
Copy link
Member

potiuk commented Aug 5, 2022

@o-nikolas @ferruzzi @vincbeck -> I'd also love to hear your opinion on that one.

@dwreeves dwreeves force-pushed the 25104-remove-SecretsManagerBackend-json-url-encoding-requirement branch from 7ef3e3e to eb25b18 Compare August 6, 2022 12:35
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@potiuk
Copy link
Member

potiuk commented Aug 6, 2022

Let me know @vincbeck if you have more comments, otherwise I merge it before releasing providers.

@dwreeves
Copy link
Contributor Author

dwreeves commented Aug 6, 2022

Thank you for the good feedback @vincbeck, and thank you for the approval @potiuk!

@potiuk potiuk merged commit ae7bf47 into apache:main Aug 7, 2022
@vincbeck
Copy link
Contributor

vincbeck commented Aug 8, 2022

I know the PR is already merged but LGTM :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants