Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support on packaged dags #8716

Closed
mfumagalli68 opened this issue May 5, 2020 · 3 comments
Closed

Support on packaged dags #8716

mfumagalli68 opened this issue May 5, 2020 · 3 comments
Labels

Comments

@mfumagalli68
Copy link

I'm trying to use apache airlfow with packaged dags.

I've written my code as a python package and my code depends on other libraries such as numpy, scipy etc.

This is setup.py of my custom python package:

    from setuptools import setup, find_packages
    from pathlib import Path
    from typing import List
    
    import distutils.text_file
    
    def parse_requirements(filename: str) -> List[str]:
        """Return requirements from requirements file."""
        # Ref: https://stackoverflow.com/a/42033122/
        return distutils.text_file.TextFile(filename=str(Path(__file__).with_name(filename))).readlines()
    
    
    setup(name='classify_business',
          version='0.1',
          python_requires=">=3.6",
          description='desc',
          url='https://urlgitlab/datascience/classifybusiness',
          author='Marco fumagalli',
          author_email='marco.fumagalli@mycompany.com',
          packages = find_packages(),
          license='MIT',
          install_requires=
          parse_requirements('requirements.txt'),
          zip_safe=False,
          include_package_data=True)

requirements.txt contains packages ( vertica_python, pandas, numpy etc) along with their version needed for my code.

I wrote a litte shell script based on the one provied in the doc for creating packaged dags:

    set -eu -o pipefail
    
    if [ $# == 0 ]; then
        echo "First param should be /srv/user_name/virtualenvs/name_virtual_env"
        echo "Second param should be name of temp_directory"
        echo "Third param directory should be git url"
        echo "Fourth param should be dag zip name, i.e dag_zip.zip to be copied into AIRFLOW__CORE__DAGS__FOLDER"
        echo "Fifth param should be package name, i.e classify_business"
    fi
    
    
    venv_path=${1}
    dir_tmp=${2}
    git_url=${3}
    dag_zip=${4}
    pkg_name=${5}
    
    
    
    python3 -m venv $venv_path
    source $venv_path/bin/activate
    mkdir $dir_tmp
    cd $dir_tmp
    
    python3 -m pip install --prefix=$PWD git+$git_url
        
    zip -r $dag_zip *
    cp $dag_zip $AIRFLOW__CORE__DAGS_FOLDER
    
    rm -r $dir_tmp

The shell will install my package along with dependencies directly from gitlab, zip and then move to the dags folder.

This is the content of the folder tmp_dir before being zipped.

    bin  
    lib  
    lib64  
    predict_dag.py  
    train_dag.py

Airflow doesn't seem to be able to import package installed in lib or lib64.
I'm getting this error

ModuleNotFoundError: No module named 'vertica_python'

I even tried to move my custom package outside of lib:

    bin
    my_custom_package
    lib  
    lib64  
    predict_dag.py  
    train_dag.py

But still getting same error.

One of the problem I think relies on how to use pip to install package in a specific location.
Airflow example use --install-option="--install-lib=/path/" but it's unsupported:

Location-changing options found in --install-option: ['--install-lib']
from command line. This configuration may cause unexpected behavior
and is unsupported. pip 20.2 will remove support for this
functionality. A possible replacement is using pip-level options like
--user, --prefix, --root, and --target. You can find discussion regarding this at pypa/pip#7309.

Using --prefix leads to a structure like above, with module not found error.

Using --target leads to every package installed in the directory specified.
In this case I have a pandas related error

C extension: No module named 'pandas._libs.tslibs.conversion' not built

I guess that it's related to dynamic libraries that should be available at a system level?

I really don't know how to do that.

Thanks

@boring-cyborg
Copy link

boring-cyborg bot commented May 5, 2020

Thanks for opening your first issue here! Be sure to follow the issue template!

@mik-laj
Copy link
Member

mik-laj commented May 5, 2020

packaged dags cannot contain dynamic libraries (eg. libz.so) these need to be available on the system if a module needs those. In other words only pure python modules can be packaged.

https://airflow.readthedocs.io/en/latest/concepts.html?highlight=packaged%20dags#packaged-dags

Packaged DAGs is only a partial solution to the dependency problem because it only allows you to load simple Python libraries. All complex math libraries are not supported.

@mfumagalli68
Copy link
Author

mfumagalli68 commented May 5, 2020

Ok.
But what should I do about --install-option="--install-lib=/path/" ?

And what's the best practice for dealing with dependency problems?

What I've seen so far:

  • PythonVirtualEnvOperator. Problem: create and destory virtualenv on the fly. Not feasible for a project with lots of dependencies.
  • Use BashOperator: write a shell which execute a python script with virtualenv interpreter. Seems a workaround.
  • DockerOperator: Is this the only option we have?

@potiuk potiuk closed this as completed Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants