Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow providers to add default connections #32048

Open
1 task done
eladkal opened this issue Jun 21, 2023 · 22 comments
Open
1 task done

Allow providers to add default connections #32048

eladkal opened this issue Jun 21, 2023 · 22 comments
Labels
area:core kind:task A task that needs to be completed as part of a larger issue

Comments

@eladkal
Copy link
Contributor

eladkal commented Jun 21, 2023

Body

Currently we can add default connections only by adding entry to:

def create_default_connections(session: Session = NEW_SESSION):

This means only Airflow core release can add the new default connection which is not in sync with providers.
Examples:
#31533 (comment)

#31787 (comment)

This is a pain because providers support earlier Airflow version (for example right now the minimum Airlfow version supported by providers is Airflow 2.4.0 so even if we add new connection to core and release it with Airflow 2.7.0 the users of 2.4.0 won't have it)

The task:
We need somehow to make it so that providers can add default connections (and extract the existed default provider connections from core)

Committer

  • I acknowledge that I am a maintainer/committer of the Apache Airflow project.
@eladkal eladkal added area:core kind:task A task that needs to be completed as part of a larger issue labels Jun 21, 2023
@potiuk
Copy link
Member

potiuk commented Jun 21, 2023

I am not sure if this is really needed. As discussed in #31533 (comment) airflow db init is mostly a "try and test" feature.

There are a a number of things that we should be aware of:

Do not use airflow db init as it can create a lot of default connections, charts, etc. which are not required in production DB.

  • "Syncing" involves high complexity which has a number of problems to solve.

The objective of the airflow db init command is different than what is suggested here.

In the ideal world people will not have those connections created in production and we had absolutely no intention to synchronize those connections with anything. There are a number of problems and considerations to solve if we want to do any kind of any attempt of production-level synchronization. All of them are described and extensively discussed here: #31875

I think the root cause of the problem is the airflow db init name, that suggest that it should be use as first thing when you set-up the database, where it is only intended to be use if you want to "play" with airflow when you first try it.

IMHO we should continue the path that the airflow db init is not production ready, and no synchronisation intended feature. I would rather deprecate the airflow db init and rename it to create-default-connections or similar.

@eladkal
Copy link
Contributor Author

eladkal commented Jun 21, 2023

How do you see create-default-connections works if you don't update Airflow version?
My goal is that providers will be able to add new connections in the UI without forcing user to update Airflow version

@potiuk
Copy link
Member

potiuk commented Jun 21, 2023

My goal is that providers will be able to add new connections in the UI without forcing user to update Airflow version

This is already happening. You do not need to run init, for the connection to appear in the Connection drop-down list. One has nothing to do with the other. You can still manually create the "default" connection with the right type just after installing provider.

@potiuk
Copy link
Member

potiuk commented Jun 21, 2023

You have to remember that if you follow the advice:

Do not use airflow db init as it can create a lot of default connections, charts, etc. which are not required in production DB.

you have to already today manually create the default connections that you need after installation.

They won't and should not be created automatically. In fact the advice above specifically mentions that creating the default connections automatically is wrong and you should manually create them when you need a connection of a given type.

@eladkal
Copy link
Contributor Author

eladkal commented Jun 21, 2023

They won't and should not be created automatically. In fact the advice above specifically mentions that creating the default connections automatically is wrong and you should manually create them when you need a connection of a given type.

I agree.
Lets say that we have this manual command as you describe which users will run ad-hoc when it is needed (say creating a new dev account or something like that)
The open(?) question is what is the source used to generate these is it

def create_default_connections(session: Session = NEW_SESSION):
like it is today? or do you see each provider responsible of it's own default connections ?

@potiuk
Copy link
Member

potiuk commented Jun 21, 2023

We could get provider to contribute it, yes. It could be exposed via get_provider_info and ready via ProvidersManager -once we know what would be the recommended use pattern for those.

@eladkal
Copy link
Contributor Author

eladkal commented Jun 21, 2023

The tasks needs to be done are:

  1. Start mailing list discussion to rethink what is the goal of default connections. What is their purpose? Right now airflow db init will generate all default connections. Is there a use case user would need all of them? why?

2a. If the result of this discussion is that there is no value in it then the action item is to deprecate airflow db init which means also we are deprecating the list of default connections.

2b. If the result of this discussion is that there is value then what we should discuss is:
renaming airflow db init -> airflow db create-default-connections and then discuss the question I raised in the beginning of this thread: Where should the default connections be defined (in providers or in core)?

This issue will be considered completed when either (2a) or (2b) is completed.

@hussein-awala
Copy link
Member

I don't think we really need it, because:

  1. as @potiuk mentioned above, the connections types in the webserver UI are loaded from the hooks, ensuring that the most important requirement is already met.
  2. there is an Airflow configuration to enable/disable adding the default connection database.load_default_connections and in its description we explain that it's just to see how the connection looks like, and most of the users set it to False:

    Whether to load the default connections that ship with Airflow. It’s good to get started, but you probably want to set this to False in a production environment

renaming airflow db init -> airflow db create-default-connections

This command init the database if it dosen't exist, upgrade it if exist, load the default pools and synchronize log_filename_template with the db, why do you want to deprecate it or rename airflow db create-default-connections? IMO we can just change the default value of database.load_default_connections to False, and we will stop loading them by default, and the users can still load them by setting this configuration to True. WDYT?

@potiuk
Copy link
Member

potiuk commented Jun 21, 2023

IMO we can just change the default value of database.load_default_connections to False, and we will stop loading them by default, and the users can still load them by setting this configuration to True. WDYT?

That's - unfortunately - pretty breaking change. We can most likely deprecate init and raise warning and rename it to smth else, but changing it's behaviour would be a breaking change. There are a number of cases where people have the workflows where they depend on 'db init' behaviour and they rely on 'default_connections' being created.

This is unfortunate historical thing, but unfortunatley I see that there is quite high probability we will break someone's workflow - generally outside of the "Hyrum's law" edge case.

But my proposal above is that we do not have to "fix" its behaviour for providers - the connection will not be automatically created when we install new providers, and we do not have to care about back compatibility there, because it never worked this way. But we can deprecate it and discurage even more rather than break things that are broken even more (And clarify it in the docs, deprecation, and renaming the command).

And yes - I think we do not HAVE to change the behaviour of having the connections in the core. While we might not like the fact that adding new provider means new "default" conneciton created in the core, we do not have to change it. One of the options (if we agree this should be treated as a "quick/test only" solution) is to leave those connections in the core. We will have to anyway leave the list of extras in the core. There is no escape from the fact that we will have to list the providers we can install by having an extra - this is already one thing that is impossible to remove from the core. We need to have a list of providers in the core that we have extras for. currently it is done by generating (via pre-commit) 'generated/provider_dependencies.json') - and it used currently as the "source of the truth" for list of extras apache airflow..

If we agree that "default connections" created in the db is a purely test setting and change the perception of that, we can leave them in the core - we can even change them to also be read and configured from a json core, or even add an entry in "generated/provider_dependencies.json" and create them based on it (as long as we make that file part of the airflow package).

That only reinforces my " let's clarify what "default_connections" are - are they test only? useful for prod? - what the 'db init' is for (and rename it if we agree 'db init' should not be used to initialize the db).

But I think we cannot (Without bumping Airflow 2 to Airflow 3) change the way "airflow db init" behaves by default (at most it can raise a warning)

@hussein-awala
Copy link
Member

Yes I see that this can affect some users.

But I think we cannot (Without bumping Airflow 2 to Airflow 3) change the way "airflow db init" behaves by default (at most it can raise a warning)

I think we can deprecate just the part which create the connections (not all the init method), ask the users to set it to False manually, and create a new alternative command to create the default connections. Something similar to

diff --git a/airflow/cli/cli_config.py b/airflow/cli/cli_config.py
index 0c69571fea..7811662841 100644
--- a/airflow/cli/cli_config.py
+++ b/airflow/cli/cli_config.py
@@ -1566,6 +1566,12 @@ DB_COMMANDS = (
         func=lazy_load_command("airflow.cli.commands.db_command.initdb"),
         args=(ARG_VERBOSE,),
     ),
+    ActionCommand(
+        name="create-default-connections",
+        help="Create default connections",
+        func=lazy_load_command("airflow.cli.commands.db_command.create_default_connections"),
+        args=(ARG_VERBOSE,),
+    ),
     ActionCommand(
         name="check-migrations",
         help="Check if migration have finished",
diff --git a/airflow/cli/commands/db_command.py b/airflow/cli/commands/db_command.py
index 64d54cc22e..db044c0ee7 100644
--- a/airflow/cli/commands/db_command.py
+++ b/airflow/cli/commands/db_command.py
@@ -246,3 +246,8 @@ def drop_archived(args):
         table_names=args.tables,
         needs_confirm=not args.yes,
     )
+
+
+def create_default_connections(args):
+    """Create default connections"""
+    db.create_default_connections()
diff --git a/airflow/utils/db.py b/airflow/utils/db.py
index a76f0d4f67..17236c6932 100644
--- a/airflow/utils/db.py
+++ b/airflow/utils/db.py
@@ -732,6 +732,13 @@ def initdb(session: Session = NEW_SESSION, load_connections: bool = True):
         _create_db_from_orm(session=session)
     # Load default connections
     if conf.getboolean("database", "LOAD_DEFAULT_CONNECTIONS") and load_connections:
+        warnings.warn(
+            "Creating default connections via the command `airflow db init` is deprecated.\n"
+            "Please use `airflow db create-default-connections` instead and "
+            "set database.load_default_connections to False.",
+            DeprecationWarning,
+            stacklevel=2,
+        )
         create_default_connections(session=session)
     # Add default pool & sync log_template
     add_default_pool_if_not_exists(session=session)

Then we can remove it after 2 or 3 minor versions if the Airflow core release policy allows us to do like that, otherwise in Airflow 3 as you said.

But if the command airflow db upgrade can replace the command airflow db init (except the part of default connections), we can deprecate the whole command db init and warn the user to use the two commands db upgrade + db create-default-connections instead.

@utkarsharma2
Copy link
Contributor

How do we decide which providers would have default connections out of the box or in the past, since I don't see it for all of them? were there any criteria to be met?

Also, I think we need to document this behavior as well, as discussed in comment.

@potiuk
Copy link
Member

potiuk commented Jun 23, 2023

How do we decide which providers would have default connections out of the box or in the past, since I don't see it for all of them? were there any criteria to be met?

Also, I think we need to document this behavior as well, as discussed in comment.

It's historical. We have no rules.

@potiuk
Copy link
Member

potiuk commented Jun 23, 2023

But yes. I am all for clarifying the behaviour, getting airflow db init to be renamed and defining when we create default conneciotns (and whether they should be coming from providers, or it's enouhg to be in the core).

It would be great if someone leads it and proposes a solution/approves and fixes it :).

@eladkal
Copy link
Contributor Author

eladkal commented Jul 7, 2023

Created #32420 to handle the db init rename.

This issue will focus on the mechanism of adding connections to the default connection list (in other words what the db init / db create-default-connections will do):
Only from core?
discovering connections from providers?
maybe the concept of default connections is not really needed to begin with?

@Adaverse
Copy link
Contributor

Adaverse commented Jul 7, 2023

Currently, airflow db init does two main tasks -

(a) This command init the database if it dosen't exist, upgrade it if exist, load the default pools and synchronizes log_filename_template with the db
(b) Create default connections

We all agree to remove the (b) part of the db init and move it to db create-airflow-connections.

Also, there are some suggestions to remove db init entirely in favor of db upgrade + db create-airflow-connections

IMO it is best to leave db upgrade as it is. Just remove the (b) part of db init to the db create-airflow-connections and leave the (a) part in db init untouched.
The reason not to touch db upgrade is that it does exactly what its name suggests. The reason for keeping db init with only (a) part is that it comes closer to what its name suggests.
Also, changing things this way brings minimum changes from user perspective i.e just add db create-airflow-connections in the flow if default connections are needed and keep everything the way it is.

Let me know what you guys think so that we can proceed with #32420

@eladkal
Copy link
Contributor Author

eladkal commented Jul 8, 2023

Also, there are some suggestions to remove db init entirely in favor of db upgrade + db create-airflow-connections

This is what we want but it can happen only in Airflow 3.
thus for now we only deprecate db init
The db init should not have any logic in it other than invoking db upgrade + db create-airflow-connections

@utkarsharma2
Copy link
Contributor

Ya, we can only do this in a major release.

Also, would it make sense to change the behavior of db create-airflow-connections and only create connections for the providers that are installed for a user?

@potiuk
Copy link
Member

potiuk commented Jul 8, 2023

Yes. We can't really change db init behaviour. But we can do something else. We can deprecate (and hide) init and add a new initialize command that should only work on empty db (and add the create-default- connections of course).

and yes - create-default-connecrions should only eventually create connections for the installed providers. This is what the follow-up should be about. So maybe indeed it should all be done as a single step (or at least we have to make sure that 'add-default-connections is never released in the state wher it would all all connections independently which providers are installed

@Adaverse
Copy link
Contributor

Adaverse commented Aug 3, 2023

create-default-connecrions should only eventually create connections for the installed providers

We could get provider to contribute it, yes. It could be exposed via get_provider_info and ready via ProvidersManager -once we know what would be the recommended use pattern for those.

Since #32420 has been addressed, its time for the followup. I think it is better to get some suggestions from mailing list discussions?

Let me know what you guys think.

@potiuk
Copy link
Member

potiuk commented Aug 3, 2023

I think it's quite easy straightforward - follow the way configs were moved to providers. This is a very similar case.

  1. add a "default_connections" config somewhere in provider.yaml (likely "connection-types" should have an additional, optional fields "default-connection" (should also be defined in provider_info.schema.json and provider_yaml.schema.json) - expand ProivdersManager to also be aware of those default connection (likely just changing schemas and .yaml files will be enough becasue I think it uses dictionaries for "connection-types" - to be checked.

  2. move the connections to all providers to be read from there.

  3. add fall-back mechanism - move the list of "pre-2.8.0" (it will be released in 2.8) default connections to an embedded declarative json/yaml file to match providers with the connection that should be created if the provider does not have the "default-connection" entries in their provider-info / provider.yaml. This should cover the case where old providers are used with new Airflow. This should also raise deprecation warning when accessed ("please upgrade provider")

  4. add all the extra stuff for discover: in CLI "airflow providers" and in documentation - to have some information in our docs what are the default connections contributed by each provider. This has the added benefit that it will finally get documented - i.e from provider docs you will be able to see what default connections are created.

@Adaverse
Copy link
Contributor

Adaverse commented Aug 4, 2023

Thanks @potiuk for the elaborated steps. Will create the PR soon.

@potiuk
Copy link
Member

potiuk commented Aug 30, 2023

FY: #33909 I've added more of a "meta" Issue describing our goal about "splitting providers to subfolders" (for now - without starting discussion about splitting them out from the repository yet) .

This might help to understand how this and few other issues fit-in when it comes to a "long-term" target we have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:task A task that needs to be completed as part of a larger issue
Projects
None yet
Development

No branches or pull requests

5 participants