-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-6260] Drive _cmd config options by env var #6801
Conversation
400abac
to
0c80319
Compare
This looks like it's causing failures in the rest of the config tests -- can you check on what's going on. You'll likely have to run them locally as the test output doesn't include the full error for some reason. |
I wanted to reuse "testsection" as a config section but it's checked in ways that are incompatible with my tests (I want to have a non-env option in |
0e08b76
to
6381361
Compare
@NBardelot this is a code change you will need to create a Jira ticket for it (So it won't be [AIRFLOW-XXX] |
This improves the ability to configure AirFlow using Kubernetes best practices. You can provide for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD referencing a shell script that computes the connection string using Kubernetes secrets. And that script can be provided to the container using a configmap. Adding a unit test to check that an option that should NOT be overriden by a command is correctly only read from the configuration.
Codecov Report
@@ Coverage Diff @@
## master #6801 +/- ##
=========================================
Coverage ? 84.29%
=========================================
Files ? 676
Lines ? 38362
Branches ? 0
=========================================
Hits ? 32338
Misses ? 6024
Partials ? 0
Continue to review full report at Codecov.
|
I have some doubts about that one. While I understand where it comes from, it has a number of drawbacks:
I wonder what other committers think about it ? |
I think all your points come to evaluating the core concept of sensitive data management in containers. IMHO that's not really a debate (or it would imply having it with the Kubernetes project itself). It is a best practice that sensitive data should only be stored in secrets. Secrets and configmaps are managed alongside with the application's deployment: you would provide the same configmaps and secrets to all your Airflow components, the same way you would provide them with the same environment variables or airflow.cfg file. With the current Airflow design the only clean way to convey a configuration that includes sensitive data (user/password for the broker, the database etc.) to Airflow containers in Kubernetes is to provide the whole airflow.cfg file as a secret. That comes with huge drawbacks:
It is far better to manage your database password in only one secret, and the script that computes the connection string as a configmap that does not contain the secret in clear-text. |
@NBardelot : I agree that in the container world that is how you keep secrets. And I understand your need. But we seem to try to add a new generic feature that we might have to support in all future versions of Airflow. If people start using it - we should be careful because that feature can be used in totally unexpected ways. I can imagine (for example) someone implementing a round-robin script that assigns different I don't like the "over-generic" approach here - seems we are killing a fly with a cannon gun (as we say in Polish). But again - this is more to start discussion than to completely abandon that approach. Maybe I am wrong and others have other opinion. I would love to hear what others think (@ashb , @kaxil, @feluelle, @Fokko, @dimberman., @mik-laj ?). Maybe that qualifies for devlist discussion and should be moved there? I totally agree that we should never keep many secrets in the same config file together with other settings, but my question is - why do you want to keep secrets in the .cfg file (or in your case in ENV variable) at all? Are there settings that you want to be kept secret that you cannot set otherwise? Which ones? Currently the default way for airflow to keep the connection secrets is to store them in connection database (which for the security reason can be encrypted with symmetric Fernet Key and the secrets are not visible via the UI) - not to keep them in .cfg file nor in environment variables (the environment variables are just a convenience to allow mostly testing and debugging). This is also security-by-obscurity in a way (but protects from revealing the password in case someone gets just access to the DB). Maybe we should have similar approach for anything that is secret and comes from the .cfg file currently instead? But maybe we could use some of the "vault" apis instead- whether via secrets in Kubernetes, or KMS, ApacheVault etc - we had discussion about it in the devlist I recall. Maybe that's a better approach? |
Please consider that this patch doesn't really change fundamentally what already exists in Airflow. You can already achieve the same behaviour by providing an airflow.cfg file that has a _cmd option calling a script. You could even write the full script as a one-liner in the _cmd option. This patch only makes the process smoother and logical from the "you can override options with env vars" point of view. Plus, think of it as a way to enable convention over configuration for sensitive data in the Helm chart. One could modify the existing Helm templates and values to standardize the mount path of secrets in Airflow containers, instead of setting multiple environment variables with clear-text sensitive data in the Helm values (see the airflow.mapenvsecrets Helm value in the stable Helm Airflow chart ). For the moment the Helm values must be secured in some way because of those sensitive values. In a Kubernetes paradigm I think there should not even be a need for this: secrets are already there for this purpose, with the platform being responsible to choose how they are stored - in a vault or whatever - because it is not really the concern of the person deploying Airflow.
Never heard that one in Polish (half-Pole here ^^), but it translates as well in French :) |
Geeee. Bad me .. I forgot entirely the _cmd option in Airflow. So it seems we are already past the dangerous line when user can not only write arbitrary python code in DAGs but also arbitrary bash script in configuration :). Still what I see is that we allow that only for those "secret-related" cases. From the docs:
So if we allow this ENV variable thing here it should also be limited to those variables IMHO.
What is the French version then :) ? |
Yep, this list of commands is the reason why after @ashb asked me to add a test I wrote one to ensure that non-cmd commands would not be executed and only cmd ones. In French we have: « écraser une mouche avec un marteau » (”to smash a fly with a hammer“) ;-) |
Ah cool. indeed. then I happily approve this one. merci beaucoup |
This improves the ability to configure AirFlow using Kubernetes best practices. You can provide for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD referencing a shell script that computes the connection string using Kubernetes secrets. And that script can be provided to the container using a configmap. Adding a unit test to check that an option that should NOT be overriden by a command is correctly only read from the configuration. (cherry picked from commit b3e8470)
This improves the ability to configure AirFlow using Kubernetes best practices. You can provide for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD referencing a shell script that computes the connection string using Kubernetes secrets. And that script can be provided to the container using a configmap. Adding a unit test to check that an option that should NOT be overriden by a command is correctly only read from the configuration. (cherry picked from commit b3e8470)
This improves the ability to configure AirFlow using Kubernetes best practices. You can provide for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD referencing a shell script that computes the connection string using Kubernetes secrets. And that script can be provided to the container using a configmap. Adding a unit test to check that an option that should NOT be overriden by a command is correctly only read from the configuration.
This improves the ability to configure AirFlow using Kubernetes best practices. You can provide for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD referencing a shell script that computes the connection string using Kubernetes secrets. And that script can be provided to the container using a configmap. Adding a unit test to check that an option that should NOT be overriden by a command is correctly only read from the configuration.
This improves the ability to configure AirFlow
using Kubernetes best practices. You can provide
for exemple AIRFLOW__CORE__SQL_ALCHEMY_CONN_CMD
referencing a shell script that computes the
connection string using Kubernetes secrets.
And that script can be provided to the container
using a configmap.