-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable json serialization for secrets backend #19857
Enable json serialization for secrets backend #19857
Conversation
Ah nice. Indeed Airflow connection URI IS trricky. Comment while you are at it @dstandish, as this seems very much related I think this does not solve a specific case that several of our users mentioned. And I am just thinking whether we maybe also try to solve it systemically rather than telling the users "Write your custom Secrets Backend" of "use _CMD" There is a case where the users already have processes and tools to automatically rotate their credentials, but only when they are standalone "values" - not part of URI, not part of dictionary. The credentials (passwords/tokens etc). are the only things that change when they are rotated - all the rest - user, host, extras remain as they were. That for me is really, really valid use case - where part of the connection (the non-secret one) is 'static' but credential is dynamic. And the tools that the organisation has treats those two separately. Example issue #19217 (but there were few other similar discussions). I do not have a "perfect" solution for that but I thought about an options of joinng the two options to allow to keep connection "metadata" in our metadata DB and kee[ "secret" credentials in Secrets. That seems to use both types of storages in the way they were designed for. One possible solution I could imagine the case that you configure your connection in Airflow DB and in the place of password you put (
Then airflow could combine thet two and retrieve from the metadata DB for most of the connection and combine it with credential(s) retrieved from the secrets manager. I think it should be rather easy to implement something like that (and have a flag in secrets which would allow to combine metadata + secrets). We already ALWAYS have both - metadata available and possible to use when secret is enabled, so operationally it's not a big change. WDYT? |
@potiuk You don't necessary need to write a custom secrets backend in order to have secrets rotation. E.g. if you want to use aws secrets manager's built-in secrets rotation capabilities, the existing backend now supports it. and presumably with most of the backends we have, you could implement rotation with processes external to airflow. that example:
is that really all that common? i think i'm a bit skeptical that we should get more involved in secrets / connections management / rotation. i think it might be best to leave it to the external tools. and pulling part of a connection from secrets backend and another part of it from another secrets backend... that could get confusing pretty quickly, particularly when considering the diversity in the structure of connections in airflow. e.g. it's not just separately, what do you think of the general idea of this PR though? should i proceed with it you think? i think i would also like to add abilitity to load from cli using json instead of URI (i.e. putting json on same level as URI) and, ultimately i think it's best to deprecate airflow URI but that could be more controversial. Already we have some secrets backends supporting json values, and we could continue to add support on a piecemeal basis, but i figure we should just make it first class citizen and standardize the approach. I considered possibly using a config key in |
Yeah. I know that, but one problem I have with it is that it is AWS only and you have to add quite a lot of code to support similar patterns for other secrets. I thought about something more "generic" . Maybe we should apply the approach of AWS for all other backends? Maybe we can find a similar, generic-enough solution for all of them. I think your proposal is attempt to do so, but I think (see below) it might be that it looks at the problem from a wrong angle.
I have no "hard" data on that one. Just anecdotal evidence. But when I take off my "Airflow" hat and start to think as a user (in this case a person who operates organisational secrets and has to organize management. security, rotation etc.), I believe our approach is very "airflow centric" and in fact "selfish". If I were such a person who manages "credentials" for an organisation, I'd be quite pissed of at Airflow. Why? Secrets storages are "shared" resources for an organisation. IMHO we should not think about them as "general storage" which adapts to the client but they are "centerpiece" of managing identities, secrets, passwords etc. by the organisations. And often follow a rigorous structure of secrets stored there, that is decided on an managed on "organisational" level. For example it is important, I believe, that if there is a service that can be acesses by many clients, the "secret" (password/token) is kept in the secret manager only once - and it's the clients should adapt to where and how it should be retrieved from. It simplifies rotation, security, access audits, security crisis responses etc. The big problem with the current approach of Airflow is that Airflow expects the secret to be in a format that only Airflow understands. And it's the same - whether it's URI or dictionary in your proposal. No other system that needs the same password for the same external service will be able to use it (unles they specifically implement support for "airflow-formatted-secret"). This basically means that if the same token needs to be used by another (non-Airlfow) client, the organisation will have to rotate it in two places - once in the format that is required by Airflow, and once in the format that is understood by the other client. The AWS implementaiton is I think the manifestation of that issue - the reason why we have configurable fields is because we wanted to tap into "non airflow formats" of storing the passwords/token. Where I think the only "universal" assumption we can make is that "password" or "token" is stored as "single, standalone secret". Not as part of URI, not as part of dictionary of specific structure. This is one of the reasons why for example LDAP clients are so configurable - because the clients have to be able to adapt to the organisational schema chosen by the organisation, not the other way round. AWS implementation (with full_url_mode = false) now is pretty much what LDAP clients do. Looking at it from the organisation that has to maintain many systems, I think it's very "selfish" (and self-centered) from Airflow side to expect the organisation to store the "token/password" in the format that only Airflow understands.
Absolutely agree. And I am not about managing the connection. The proposal I made with `SECRET:' is probably not good (for the reason you explained below. It was really one of the ways how to retrieve the password as single "secret" and plug-in the existing Airflow approach. But the AWS approach is indeed better (though more complex implementation and more difficult to generalize to all secret backends). Maybe we can find a solution that will:
Agree. My idea is quite bad from that point of view. Though it is generic enough - it is not limited to password only - the "SECRET://" pattern could be used in any of the fields of the connection (including extras).
I think it depends on whether you (and others) would follow my proposed line of thinking. As explained above - it still requires from the organisation to keep the structure that only Airflow understands. I think this direction is not good in general, I think we should implement a way to retrieve secrets similarly to what AWS does rather than introduce yet-another format of storing secrets, but If I am the only one who thinks this is wrong and won't convince others, then yeah - it's certainly better than URI, so I am not going to block it.
The "standardize" part is precisely what worries me as we are trying to impose "airflow-standard" on something that I feel we should not (i.e. centralized secret storage). But - I might be as well wrong. Maybe it is common practice to store secrets in proprietary formats. Maybe it's simply worth to check it with a number of (big corporate) users who have their experiences with that. |
That is not what I'm attempting to do here. JSON is simply a better, friendlier way to store creds than URI. And the cat is out of the bag with allowing JSON storage of connections. So, I think we should embrace it and allow it uniformly. What's the alternative? Do you think we should disallow entirely the use of json as secrets serialization format? Or perhaps allow it on a case-by-case basis? Or, perhaps for every backend we should make a json version such e.g. EnvironmentVariablesJSONBackend and SystemManagerParameterStoreJSONBackend etc? Separately, I do recognize that having one place to store creds retrievable from many systems can be nice. Using JSON can actually help in this regard. I've actually done this where jupyter and airflow pulled from same creds in aws systems manager. But that's not the problem I'm trying to solve in this PR. |
Sure - don't get me wrong. I do see the incrememental improvement of using json, and I am not at all against it, I just think that it does not solve "real" (and different) problem of some organisations which might want to keep secrets in their own structure rather than following Airflow ways. It was not my intention to block it or contest it - it was more to raise attention to the fact that it might not be solving a more 'painful" issue for our users. Which is perfectly fine - it does not have to. It solves other issues and I agree it is an improvement over the current URI form. However I do think there is not a lot we should do now about the bigger (IMHO) issue (unless someone wants to take on the task on trying to generalize the approach that is implemented in AWS secret backend). And It's good we had this discussion because I had a chance to write down my thoughts about it and realise that what we really need to do is to add a little chapter to simply describe the status-quo and guide the users who have this kind of problem to roll their own backends (which I proposed in #19859) For now this is I think perfectly fine to keep it this way (and I'd love to hear other's comments on the "json format" as well). |
As a result of discussion in apache#19857, I propose to add this short chapter to respond to anticipated need of organisations to keep the connections in format that is not Airflow-exclusive. I think it would be good to explicitly state what is the Airflow approach in this case (i.e. either using existing capabilities of secret backends when they are there - for example in AWS - or rolling your own backend, possibly by extending the community provided ones if the flexibility is not implemented by the community provided backend.
#19859) * Add a short chapter focusing on adapting secret format for connections As a result of discussion in #19857, I propose to add this short chapter to respond to anticipated need of organisations to keep the connections in format that is not Airflow-exclusive. I think it would be good to explicitly state what is the Airflow approach in this case (i.e. either using existing capabilities of secret backends when they are there - for example in AWS - or rolling your own backend, possibly by extending the community provided ones if the flexibility is not implemented by the community provided backend.
4bb8b70
to
a380c35
Compare
ok @potiuk if you're interested would welcome your feedback on the approach taken here w.r.t. adding json support for any backend that serializes to a string value. |
apache#19859) * Add a short chapter focusing on adapting secret format for connections As a result of discussion in apache#19857, I propose to add this short chapter to respond to anticipated need of organisations to keep the connections in format that is not Airflow-exclusive. I think it would be good to explicitly state what is the Airflow approach in this case (i.e. either using existing capabilities of secret backends when they are there - for example in AWS - or rolling your own backend, possibly by extending the community provided ones if the flexibility is not implemented by the community provided backend.
#19859) * Add a short chapter focusing on adapting secret format for connections As a result of discussion in #19857, I propose to add this short chapter to respond to anticipated need of organisations to keep the connections in format that is not Airflow-exclusive. I think it would be good to explicitly state what is the Airflow approach in this case (i.e. either using existing capabilities of secret backends when they are there - for example in AWS - or rolling your own backend, possibly by extending the community provided ones if the flexibility is not implemented by the community provided backend. (cherry picked from commit 399ae0b)
5a29c6c
to
496294c
Compare
5c99a19
to
df116f4
Compare
0602607
to
35a81d4
Compare
35a81d4
to
3a44361
Compare
6ec3018
to
1dd2b6b
Compare
b90bd3d
to
5fb1d0f
Compare
OK finally have the docs in, i think, a decent place. Really required a lot of work to try to get things where they feel reasonably coherent (talking bout the managing connections doc). I have attached the rendered doc for easier review. |
|
||
.. _cli-export-connections: | ||
|
||
Exporting Connections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is relocated from managing connections doc (and linked to from there)
688f1e3
to
931b90a
Compare
Previously in general we could only store connections in the Airflow URI format. With this change we can serialize as json. The Airflow URI format can be very tricky to work with and although we have for some time had a convenience method Connection.get_uri, using json is just simpler.
931b90a
to
85beb36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are at this, I think that we should mask the extra field
i'd rather not add that to this one. it's big enough already and that seems very much like a separate change. |
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
thanks @potiuk happy to get this one over the finish line 🎉 |
🚀 |
I'm starting to see warnings raised about We have several references in the code to the deprecated method we should fix the references to avoid the warrnings |
ok thanks @eladkal -- i can take a look at this. can you give me any examples of where you are seeing this? |
@dstandish example:
It produce the warning (some other unrelated warnings as well):
|
got it @eladkal |
In apache#19857 we enabled storing connections as JSON instead of URI and renamed get_conn_uri to get_conn_value to be consistent with this change. The method get_conn_uri is now deprecated and should warn when used.
TLDR
Add support for serializing connections as JSON instead of the Airflow Connection URI format.
Example:
Overview
Previously in general we could only store connections in the Airflow URI format. With this change we can serialize as json.
The Airflow URI format can be very tricky to work with and although we have for some time had a convenience method Connection.get_uri, using json is just simpler.
We already have some secrets backends supporting json values, and we could continue to add support on a piecemeal basis, but i figure we should just make it first class citizen and standardize the logic.
Approach
rename
get_conn_uri
-->get_conn_value
More accurately, we add
get_conn_value
and deprecateget_conn_uri
.The goal of this rename is only to make the name appropriately generic -- not to initiate deprecation of airflow's URI format.
I replace
get_conn_uri
with the more generically-namedget_conn_value
, with the assumption that the value may not be the URI representation but instead a JSON representation.Secrets backends which have implemented
get_conn_uri
will continue to work, but the method is deprecated and should be removed at the next opportunity.The methods
get_conn_value
andget_conn_uri
will be called in order so that the new method is tried first and if not implemented we fall back to the old method.add method
deserialize_connection
This method handles construction of the Connection object from the string value returned from
get_conn_value
. This provides a single location for parsing a string as a Connection.added option
conn-json
to cli connection add command