Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable json serialization for secrets backend #19857

Merged
merged 1 commit into from
Mar 15, 2022

Conversation

dstandish
Copy link
Contributor

@dstandish dstandish commented Nov 27, 2021

TLDR

Add support for serializing connections as JSON instead of the Airflow Connection URI format.

Example:

    export AIRFLOW_CONN_MY_PROD_DATABASE='{
        "conn_type": "my-conn-type",
        "login": "my-login",
        "password": "my-password",
        "host": "my-host",
        "port": 1234,
        "schema": "my-schema",
        "extra": {
            "param1": "val1",
            "param2": "val2"
        }
    }'

Overview

Previously in general we could only store connections in the Airflow URI format. With this change we can serialize as json.

The Airflow URI format can be very tricky to work with and although we have for some time had a convenience method Connection.get_uri, using json is just simpler.

We already have some secrets backends supporting json values, and we could continue to add support on a piecemeal basis, but i figure we should just make it first class citizen and standardize the logic.

Approach

rename get_conn_uri --> get_conn_value

More accurately, we add get_conn_value and deprecate get_conn_uri.

The goal of this rename is only to make the name appropriately generic -- not to initiate deprecation of airflow's URI format.

I replace get_conn_uri with the more generically-named get_conn_value, with the assumption that the value may not be the URI representation but instead a JSON representation.

Secrets backends which have implemented get_conn_uri will continue to work, but the method is deprecated and should be removed at the next opportunity.

The methods get_conn_value and get_conn_uri will be called in order so that the new method is tried first and if not implemented we fall back to the old method.

add method deserialize_connection

This method handles construction of the Connection object from the string value returned from get_conn_value. This provides a single location for parsing a string as a Connection.

added option conn-json to cli connection add command

@dstandish dstandish marked this pull request as draft November 27, 2021 15:16
@potiuk
Copy link
Member

potiuk commented Nov 27, 2021

Ah nice. Indeed Airflow connection URI IS trricky.

Comment while you are at it @dstandish, as this seems very much related

I think this does not solve a specific case that several of our users mentioned. And I am just thinking whether we maybe also try to solve it systemically rather than telling the users "Write your custom Secrets Backend" of "use _CMD"

There is a case where the users already have processes and tools to automatically rotate their credentials, but only when they are standalone "values" - not part of URI, not part of dictionary. The credentials (passwords/tokens etc). are the only things that change when they are rotated - all the rest - user, host, extras remain as they were. That for me is really, really valid use case - where part of the connection (the non-secret one) is 'static' but credential is dynamic. And the tools that the organisation has treats those two separately.

Example issue #19217 (but there were few other similar discussions).

I do not have a "perfect" solution for that but I thought about an options of joinng the two options to allow to keep connection "metadata" in our metadata DB and kee[ "secret" credentials in Secrets. That seems to use both types of storages in the way they were designed for.

One possible solution I could imagine the case that you configure your connection in Airflow DB and in the place of password you put (/connections/my_secret_password is the "key" to use when querying Secret Backend):

SECRET:/connections/my_secret_password

Then airflow could combine thet two and retrieve from the metadata DB for most of the connection and combine it with credential(s) retrieved from the secrets manager.

I think it should be rather easy to implement something like that (and have a flag in secrets which would allow to combine metadata + secrets). We already ALWAYS have both - metadata available and possible to use when secret is enabled, so operationally it's not a big change.

WDYT?

@dstandish
Copy link
Contributor Author

dstandish commented Nov 27, 2021

@potiuk You don't necessary need to write a custom secrets backend in order to have secrets rotation. E.g. if you want to use aws secrets manager's built-in secrets rotation capabilities, the existing backend now supports it. and presumably with most of the backends we have, you could implement rotation with processes external to airflow.

that example:

And the tools that the organisation has treats those two separately

is that really all that common?

i think i'm a bit skeptical that we should get more involved in secrets / connections management / rotation. i think it might be best to leave it to the external tools.

and pulling part of a connection from secrets backend and another part of it from another secrets backend... that could get confusing pretty quickly, particularly when considering the diversity in the structure of connections in airflow. e.g. it's not just password that might be "secret" and in need of rotation..


separately, what do you think of the general idea of this PR though? should i proceed with it you think? i think i would also like to add abilitity to load from cli using json instead of URI (i.e. putting json on same level as URI) and, ultimately i think it's best to deprecate airflow URI but that could be more controversial.

Already we have some secrets backends supporting json values, and we could continue to add support on a piecemeal basis, but i figure we should just make it first class citizen and standardize the approach.

I considered possibly using a config key in airflow.cfg to flip the default expectation from URI to json. And we could do this and put it in [secrets]. But the problem is then you don't have the flexibility to use json in one backend, URI in another etc.

@potiuk
Copy link
Member

potiuk commented Nov 27, 2021

@potiuk You don't necessary need to write a custom secrets backend in order to have secrets rotation. E.g. if you want to use aws secrets manager's built-in secrets rotation capabilities, the existing backend now supports it. and presumably with most of the backends we have, you could implement rotation with processes external to airflow.

Yeah. I know that, but one problem I have with it is that it is AWS only and you have to add quite a lot of code to support similar patterns for other secrets. I thought about something more "generic" . Maybe we should apply the approach of AWS for all other backends? Maybe we can find a similar, generic-enough solution for all of them. I think your proposal is attempt to do so, but I think (see below) it might be that it looks at the problem from a wrong angle.

is that really all that common?

I have no "hard" data on that one. Just anecdotal evidence.

But when I take off my "Airflow" hat and start to think as a user (in this case a person who operates organisational secrets and has to organize management. security, rotation etc.), I believe our approach is very "airflow centric" and in fact "selfish". If I were such a person who manages "credentials" for an organisation, I'd be quite pissed of at Airflow. Why?

Secrets storages are "shared" resources for an organisation. IMHO we should not think about them as "general storage" which adapts to the client but they are "centerpiece" of managing identities, secrets, passwords etc. by the organisations. And often follow a rigorous structure of secrets stored there, that is decided on an managed on "organisational" level. For example it is important, I believe, that if there is a service that can be acesses by many clients, the "secret" (password/token) is kept in the secret manager only once - and it's the clients should adapt to where and how it should be retrieved from. It simplifies rotation, security, access audits, security crisis responses etc.

The big problem with the current approach of Airflow is that Airflow expects the secret to be in a format that only Airflow understands. And it's the same - whether it's URI or dictionary in your proposal. No other system that needs the same password for the same external service will be able to use it (unles they specifically implement support for "airflow-formatted-secret"). This basically means that if the same token needs to be used by another (non-Airlfow) client, the organisation will have to rotate it in two places - once in the format that is required by Airflow, and once in the format that is understood by the other client.

The AWS implementaiton is I think the manifestation of that issue - the reason why we have configurable fields is because we wanted to tap into "non airflow formats" of storing the passwords/token. Where I think the only "universal" assumption we can make is that "password" or "token" is stored as "single, standalone secret". Not as part of URI, not as part of dictionary of specific structure. This is one of the reasons why for example LDAP clients are so configurable - because the clients have to be able to adapt to the organisational schema chosen by the organisation, not the other way round. AWS implementation (with full_url_mode = false) now is pretty much what LDAP clients do.

Looking at it from the organisation that has to maintain many systems, I think it's very "selfish" (and self-centered) from Airflow side to expect the organisation to store the "token/password" in the format that only Airflow understands.

i think i'm a bit skeptical that we should get more involved in secrets / connections management / rotation. i think it might be best to leave it to the external tools.

Absolutely agree. And I am not about managing the connection. The proposal I made with `SECRET:' is probably not good (for the reason you explained below. It was really one of the ways how to retrieve the password as single "secret" and plug-in the existing Airflow approach. But the AWS approach is indeed better (though more complex implementation and more difficult to generalize to all secret backends). Maybe we can find a solution that will:

  • apply to all secrets (like your dictionary proposal)
  • will have the configurabiliy that AWS secret backend has
  • does not force the organisation to keep "credentials" as part of the structure that is Airflow-specific

and pulling part of a connection from secrets backend and another part of it from another secrets backend... that could get confusing pretty quickly, particularly when considering the diversity in the structure of connections in airflow. e.g. it's not just password that might be "secret" and in need of rotation..

Agree. My idea is quite bad from that point of view. Though it is generic enough - it is not limited to password only - the "SECRET://" pattern could be used in any of the fields of the connection (including extras).

separately, what do you think of the general idea of this PR though? should i proceed with it you think? i think i would also like to add abilitity to load from cli using json instead of URI (i.e. putting json on same level as URI) and, ultimately i think it's best to deprecate airflow URI but that could be more controversial.

I think it depends on whether you (and others) would follow my proposed line of thinking. As explained above - it still requires from the organisation to keep the structure that only Airflow understands. I think this direction is not good in general, I think we should implement a way to retrieve secrets similarly to what AWS does rather than introduce yet-another format of storing secrets, but If I am the only one who thinks this is wrong and won't convince others, then yeah - it's certainly better than URI, so I am not going to block it.

already we have some secrets backends supporting json values, and we could continue to add support on a piecemeal basis, but i figure we should just make it first class citizen and standardize

The "standardize" part is precisely what worries me as we are trying to impose "airflow-standard" on something that I feel we should not (i.e. centralized secret storage).

But - I might be as well wrong. Maybe it is common practice to store secrets in proprietary formats. Maybe it's simply worth to check it with a number of (big corporate) users who have their experiences with that.

@dstandish
Copy link
Contributor Author

Maybe we should apply the approach of AWS for all other backends? Maybe we can find a similar, generic-enough solution for all of them. I think your proposal is attempt to do so, but I think (see below) it might be that it looks at the problem from a wrong angle.

That is not what I'm attempting to do here. JSON is simply a better, friendlier way to store creds than URI. And the cat is out of the bag with allowing JSON storage of connections. So, I think we should embrace it and allow it uniformly.

What's the alternative? Do you think we should disallow entirely the use of json as secrets serialization format? Or perhaps allow it on a case-by-case basis? Or, perhaps for every backend we should make a json version such e.g. EnvironmentVariablesJSONBackend and SystemManagerParameterStoreJSONBackend etc?

Separately, I do recognize that having one place to store creds retrievable from many systems can be nice. Using JSON can actually help in this regard. I've actually done this where jupyter and airflow pulled from same creds in aws systems manager. But that's not the problem I'm trying to solve in this PR.

@potiuk
Copy link
Member

potiuk commented Nov 28, 2021

Sure - don't get me wrong. I do see the incrememental improvement of using json, and I am not at all against it, I just think that it does not solve "real" (and different) problem of some organisations which might want to keep secrets in their own structure rather than following Airflow ways. It was not my intention to block it or contest it - it was more to raise attention to the fact that it might not be solving a more 'painful" issue for our users. Which is perfectly fine - it does not have to. It solves other issues and I agree it is an improvement over the current URI form.

However I do think there is not a lot we should do now about the bigger (IMHO) issue (unless someone wants to take on the task on trying to generalize the approach that is implemented in AWS secret backend). And It's good we had this discussion because I had a chance to write down my thoughts about it and realise that what we really need to do is to add a little chapter to simply describe the status-quo and guide the users who have this kind of problem to roll their own backends (which I proposed in #19859)

For now this is I think perfectly fine to keep it this way (and I'd love to hear other's comments on the "json format" as well).

potiuk added a commit to potiuk/airflow that referenced this pull request Nov 28, 2021
As a result of discussion in apache#19857, I propose to add this short
chapter to respond to anticipated need of organisations to keep the
connections in format that is not Airflow-exclusive. I think it would
be good to explicitly state what is the Airflow approach in this case
(i.e. either using existing capabilities of secret backends when
they are there  - for example in AWS - or rolling your own backend,
possibly by extending the community provided ones if the flexibility
is not implemented by the community provided backend.
potiuk added a commit that referenced this pull request Nov 29, 2021
#19859)

* Add a short chapter focusing on adapting secret format for connections

As a result of discussion in #19857, I propose to add this short
chapter to respond to anticipated need of organisations to keep the
connections in format that is not Airflow-exclusive. I think it would
be good to explicitly state what is the Airflow approach in this case
(i.e. either using existing capabilities of secret backends when
they are there  - for example in AWS - or rolling your own backend,
possibly by extending the community provided ones if the flexibility
is not implemented by the community provided backend.
@dstandish dstandish force-pushed the secrets-backend-store-as-json branch from 4bb8b70 to a380c35 Compare November 30, 2021 16:13
@dstandish
Copy link
Contributor Author

ok @potiuk if you're interested would welcome your feedback on the approach taken here w.r.t. adding json support for any backend that serializes to a string value.

dillonjohnson pushed a commit to dillonjohnson/airflow that referenced this pull request Dec 1, 2021
apache#19859)

* Add a short chapter focusing on adapting secret format for connections

As a result of discussion in apache#19857, I propose to add this short
chapter to respond to anticipated need of organisations to keep the
connections in format that is not Airflow-exclusive. I think it would
be good to explicitly state what is the Airflow approach in this case
(i.e. either using existing capabilities of secret backends when
they are there  - for example in AWS - or rolling your own backend,
possibly by extending the community provided ones if the flexibility
is not implemented by the community provided backend.
jedcunningham pushed a commit that referenced this pull request Dec 7, 2021
#19859)

* Add a short chapter focusing on adapting secret format for connections

As a result of discussion in #19857, I propose to add this short
chapter to respond to anticipated need of organisations to keep the
connections in format that is not Airflow-exclusive. I think it would
be good to explicitly state what is the Airflow approach in this case
(i.e. either using existing capabilities of secret backends when
they are there  - for example in AWS - or rolling your own backend,
possibly by extending the community provided ones if the flexibility
is not implemented by the community provided backend.

(cherry picked from commit 399ae0b)
@dstandish dstandish force-pushed the secrets-backend-store-as-json branch 2 times, most recently from 5a29c6c to 496294c Compare December 13, 2021 15:58
@dstandish dstandish force-pushed the secrets-backend-store-as-json branch from 5c99a19 to df116f4 Compare December 16, 2021 17:47
@dstandish dstandish force-pushed the secrets-backend-store-as-json branch from 0602607 to 35a81d4 Compare January 11, 2022 21:26
@dstandish dstandish force-pushed the secrets-backend-store-as-json branch from 35a81d4 to 3a44361 Compare January 28, 2022 18:09
@dstandish dstandish force-pushed the secrets-backend-store-as-json branch 2 times, most recently from 6ec3018 to 1dd2b6b Compare February 9, 2022 23:50
@dstandish dstandish marked this pull request as ready for review February 10, 2022 03:53
@dstandish dstandish marked this pull request as ready for review March 4, 2022 21:24
@dstandish dstandish force-pushed the secrets-backend-store-as-json branch 2 times, most recently from b90bd3d to 5fb1d0f Compare March 5, 2022 04:24
@dstandish
Copy link
Contributor Author

OK finally have the docs in, i think, a decent place. Really required a lot of work to try to get things where they feel reasonably coherent (talking bout the managing connections doc).

I have attached the rendered doc for easier review.
Managing connections.zip


.. _cli-export-connections:

Exporting Connections
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is relocated from managing connections doc (and linked to from there)

@dstandish dstandish force-pushed the secrets-backend-store-as-json branch from 688f1e3 to 931b90a Compare March 6, 2022 04:52
Previously in general we could only store connections in the Airflow URI format.  With this change we can serialize as json.  The Airflow URI format can be very tricky to work with and although we have for some time had a convenience method Connection.get_uri, using json is just simpler.
@dstandish dstandish force-pushed the secrets-backend-store-as-json branch from 931b90a to 85beb36 Compare March 6, 2022 17:36
Copy link
Contributor

@ephraimbuddy ephraimbuddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are at this, I think that we should mask the extra field

@dstandish
Copy link
Contributor Author

While we are at this, I think that we should mask the extra field

i'd rather not add that to this one. it's big enough already and that seems very much like a separate change.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Mar 15, 2022
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@dstandish dstandish merged commit 774ca08 into apache:main Mar 15, 2022
@dstandish dstandish deleted the secrets-backend-store-as-json branch March 15, 2022 21:23
@dstandish
Copy link
Contributor Author

dstandish commented Mar 15, 2022

thanks @potiuk happy to get this one over the finish line 🎉

@kaxil
Copy link
Member

kaxil commented Mar 15, 2022

🚀

@eladkal
Copy link
Contributor

eladkal commented Mar 16, 2022

I'm starting to see warnings raised about get_conn_uri:
[2022-03-16, 18:06:38 UTC] {logging_mixin.py:115} WARNING - /opt/***/***/models/connection.py:420 PendingDeprecationWarning: Method get_conn_uriis deprecated. Please useget_conn_value.

We have several references in the code to the deprecated method we should fix the references to avoid the warrnings

@dstandish
Copy link
Contributor Author

ok thanks @eladkal -- i can take a look at this.

can you give me any examples of where you are seeing this?

@eladkal
Copy link
Contributor

eladkal commented Mar 16, 2022

@dstandish example:

from datetime import datetime
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.hooks.s3 import S3Hook


default_args = {
    'start_date': datetime(2022, 3, 15),
    'owner': 'airflow',
    'retries': 1
}


def test_hook():
    s3_hook = S3Hook()
    s3_hook.load_file(filename='bla', key='blaa', bucket_name='fff')


dag = DAG(dag_id='my_dag', default_args=default_args)

py_test = PythonOperator(task_id='python_task', python_callable=test_hook, dag=dag)

It produce the warning (some other unrelated warnings as well):

[2022-03-16, 19:26:09 UTC] {logging_mixin.py:115} WARNING - /opt/***/***/secrets/base_secrets.py:95 PendingDeprecationWarning: This method is deprecated. Please use `***.secrets.environment_variables.EnvironmentVariablesBackend.get_conn_value`.
[2022-03-16, 19:26:09 UTC] {logging_mixin.py:115} WARNING - /opt/***/***/models/connection.py:420 PendingDeprecationWarning: Method `get_conn_uri` is deprecated. Please use `get_conn_value`.
[2022-03-16, 19:26:11 UTC] {logging_mixin.py:115} WARNING - /opt/***/***/models/dag.py:1084 SADeprecationWarning: Query.value() is deprecated and will be removed in a future release.  Please use Query.with_entities() in combination with Query.scalar() (deprecated since: 1.4)

@dstandish
Copy link
Contributor Author

got it @eladkal
i think i see the issue, and i'll work on a fix today

dstandish added a commit to astronomer/airflow that referenced this pull request Mar 19, 2022
In apache#19857 we enabled storing connections as JSON instead of URI and renamed get_conn_uri to get_conn_value to be consistent with this change.  The method get_conn_uri is now deprecated and should warn when used.
dstandish added a commit that referenced this pull request Mar 24, 2022
…#22348)

In #19857 we enabled storing connections as JSON instead of URI and renamed get_conn_uri to get_conn_value to be consistent with this change.  The method get_conn_uri is now deprecated and should warn when used.
@ephraimbuddy ephraimbuddy added the type:new-feature Changelog: New Features label Apr 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers area:secrets full tests needed We need to run full set of tests for this PR to merge provider:amazon AWS/Amazon - related issues type:new-feature Changelog: New Features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants