Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: use standard imports for operators/hooks #1238

Closed
jlowin opened this issue Mar 29, 2016 · 7 comments
Closed

Proposal: use standard imports for operators/hooks #1238

jlowin opened this issue Mar 29, 2016 · 7 comments
Assignees
Labels
Roadmap The issues /PR that are big changes on the roadmap

Comments

@jlowin
Copy link
Member

jlowin commented Mar 29, 2016

Currently, Airflow uses a relatively complex import mechanism to import hooks and operators without polluting the namespace with submodules. I would like to propose that Airflow abandon that system and use standard Python importing.

Here are a few major reasons why I think the current system has run its course.

Polluting namespace

The biggest advantage of the current system, as I understand it, is that only Operators appear in the airflow.operators namespace. The submodules that actually contain the operators do not.
So for example while airflow.operators.python_operator.PythonOperator is a thing, PythonOperator is in the airflow.operators namespace but python_operator is not.

I think this sort of namespace pollution was helpful when Airflow was a smaller project, but as the number of hooks/operators grows -- and especially as the contrib hooks/operators grow -- I'd argue that namespacing is a good thing. It provides structure and organization, and opportunities for documentation (through module docstrings).

In fact, I'd argue that the current namespace is itself getting quite polluted -- the only way to know what's available is to use something like Ipython tab-completion to browse an alphabetical list of Operator names, or to load the source file and grok the import definition (which no one installing from pypi is likely to do).

Conditional imports

There's a second advantage to the current system that any module that fails to import is silently ignored. It makes it easy to have optional dependencies. For example, if someone doesn't have boto installed, then they don't have an S3Hook either. Same for a HiveOperator

Again, as Airflow grows and matures, I think this is a little too magic. If my environment is missing a dependency, I want to hear about it.

On the other hand, the contrib namespace sort of depends on this -- we don't want users to have to install every single dependency. So I propose that contrib modules all live in their submodules: from airflow.contrib.operators.my_operator import MyOperator. As mentioned previously, having structure and namespacing is a good thing as the project gets more complex.

Other ways to handle this include putting "non-standard" dependencies inside the operator/hook rather than the module (see HiveOperator/HiveHook), so it can be imported but not used. Another is judicious use of try/except ImportError. The simplest is to make people import things explicitly from submodules.

Operator dependencies

Right now, operators can't depend on each other if they aren't in the same file. This is for the simple reason that there is no guarantee on what order the operators will be loaded. It all comes down to which dictionary key gets loaded first. One day Operator B could be loaded after Operator A; the next day it might be loaded before. Consequently, A and B can't depend on each other. Worse, if a user makes two operators that do depend on each other, they won't get an error message when one fails to import.

For contrib modules in particular, this is sort of killer.

Ease of use

It's hard to set up imports for a new operator. The dictionary-based import instructions aren't obvious for new users, and errors are silently dismissed which makes debugging difficult.

Identity

Surprisingly, airflow.operators.SubDagOperator != airflow.operators.subdag_operator.SubDagOperator. See #1168.

Proposal

Use standard python importing for hooks/operators/etc.

  • __init__.py files use straightforward, standard Python imports
  • major operators are available at airflow.operators.OperatorName or airflow.operators.operator_module.OperatorName.
  • contrib operators are only available at airflow.contrib.operators.operator_module.OperatorName in order to manage dependencies
  • operator authors are encouraged to use __all__ to define their module's exports

Possibly delete namespace afterward

  • in operators/__init__.py, run a function at the end of the file which deletes all modules from the namespace, leaving only Operators. This keeps the namespace clear but lets people use familiar import mechanisms.

Possibly use an import function to handle ImportError gracefully

  • rewrite import_module_attrs to take one module name at a time instead of a dictionary.
@bolkedebruin
Copy link
Contributor

+1

1 similar comment
@artwr
Copy link
Contributor

artwr commented Mar 29, 2016

+1

@jlowin jlowin added the Roadmap The issues /PR that are big changes on the roadmap label Mar 29, 2016
@mistercrunch
Copy link
Member

+1

also:

  • currently we do lots of unused imports, a typical airflow run will import all operators, hooks and related libs, where in reality it needs only one operator and related deps
  • I haven't looked at the latest utils refactor either, but we should do that there too
  • we need to document the full pythonpath location of operators and maybe cleanup the module names where/if needed

but:

  • that breaks the API, lots of pipelines will need to be altered. Breaking the API mean going to 2.0, is there an option there to have some sort of environment hook for monkey patching operators into airflow.operators. Maybe the currently hard coded data structure that does this (airflow.operators.__init__) goes away but can be passed in from airflow_settings.py, then we can get away with just an aggressive warning in the next release notes
  • individual operators depending on hooks only (as opposed to other operator) is sane, though deriving an operator is ok

@jlowin
Copy link
Member Author

jlowin commented Mar 29, 2016

@mistercrunch excellent point. I think we can take advantage of the fact that the objects imported into airflow.operators are not the same as the ones in airflow.operators.op_module. We could mark all of the top-level objects as deprecated using something like http://docs.zope.org/zope.deprecation/api.html#deprecating-objects-inside-a-module. That way people could continue to use them, but would receive a deprecation warning saying for example that they will be removed in 2.0

@mistercrunch
Copy link
Member

Sounds perfect!

@jlowin jlowin self-assigned this Mar 29, 2016
@jlowin jlowin added this to the Airflow 2.0 milestone Mar 29, 2016
@jlowin
Copy link
Member Author

jlowin commented Mar 29, 2016

It works! see: jlowin@973295c

In [1]: import airflow
[2016-03-29 11:49:06,956] {__init__.py:36} INFO - Using executor SequentialExecutor

In [2]: from airflow.operators import PigOperator
/Users/jlowin/anaconda3/bin/ipython:1: DeprecationWarning: PigOperator: Importing 
PigOperator directly from 'airflow.operators' has been deprecated. Please import 
from 'airflow.operators.[operator_module]' instead. Support will be dropped entirely 
in Airflow 2.0.
  #!/bin/bash /Users/jlowin/anaconda3/bin/python.app

In [3]: PigOperator
Out[3]: pig_operator.PigOperator

In [4]: from airflow.operators.pig_operator import PigOperator

In [5]: PigOperator
Out[5]: airflow.operators.pig_operator.PigOperator

@artwr
Copy link
Contributor

artwr commented Mar 29, 2016

Yep I currently do this for apply_defaults in airflow/utils/init.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Roadmap The issues /PR that are big changes on the roadmap
Projects
None yet
Development

No branches or pull requests

6 participants