Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Airflow parser functionality #2418

Merged
merged 34 commits into from
Feb 7, 2022

Conversation

kiersten-stokes
Copy link
Member

@kiersten-stokes kiersten-stokes commented Jan 25, 2022

Fixes portions of #2414
Supports #2437 & #2438

Functionality Summary:

  • The new parser logic covers all cases that the previous parser logic covered
  • The new parser logic is much more robust against handling additional and edge cases, such as gathering data type information from type hints
  • The only 'regression' is that it does not gather any default values in the case of a dictionary or list-typed parameter (this should not be an issue because default values shouldn't be given in arguments for Python mutable types)

What changes were proposed in this pull request?

This PR refactors the AirflowComponentParser to use ast.parse to cover additional parsing cases cleanly. General flow is the same as before: the parser takes the contents of a file, parses it for all its classes, then parses each class for its properties (i.e. init arguments).

What is different is the method by which is does those things. Rather than using regexes, ast.parse is used to get the abstract syntax tree (AST) of the file. This provides the class definitions and their corresponding init function (including argument objects) and docstring for further parsing. Annoyingly, ast.parse produces slightly different data structures for different python versions (Python 3.7 and lower vs Python 3.8+) - the relevant cases are currently handled to the best of my knowledge. The class definitions are filtered to ensure that they extend (directly or indirectly) from the BaseOperator. For operators that extend from a class that is not defined in the same file, it is sufficient to check the import statement for operators defined in either airflow.providers.*.operators. or airflow.operators.* (see this link for details). As before, classes are looped through to provide a list of properties for each (represented by the init argument objects and any additional information gleaned from the operator class docstring parsed earlier). Note that regexes are still used to search the class docstring for 1. a description of the property, and 2. any type information available in the docstring (which is often more verbose and therefore more helpful than the type information parsed from AST).

This PR doesn't (yet) touch any of the processor classes.

Some discussion points:

  • Do we want to/how should we handle cases where arguments are of 'non-standard' data types, e.g. the PythonOperator specifies an argument of the type 'python callable' (which will default to string) This will be addressed in a follow-up PR as it falls under the umbrella of 'new functionality' and is already tracked in Address shortcomings in the Airflow parser and processor code #2414
  • Similar to the above, are there cases where an operator may not have an init function but we still want to include them as a Component? (again see some of the classes defined in python_operator.py) (different from the case where init does not have any arguments) I propose that we handle this issue in another PR; this is essentially the same case as determining how to support operators that derive from other operator classes (which is already tracked in Address shortcomings in the Airflow parser and processor code #2414)

Left to do:

  • Add log messages for parsing failure cases
  • Add a try/except around the ast.parse (and/or an even larger section)

How was this pull request tested?

  • Backend tests to be revised once functionality here is confirmed
  • Manual testing must occur to ensure that we are accurately covering as many cases as possible
    • This includes testing in Python 3.6, 3.7, 3.8, and 3.9 environments since ast.parse behavior can vary slightly between versions

One additional change made to the tests includes a new fixture that tears down the test component catalog instance created in certain tests (test_modify_component_catalogs, e.g.). In this way, if a test failure occurs before the test catalog can be removed, there are no issues in tests that occur later down the line.

Developer's Certificate of Origin 1.1

   By making a contribution to this project, I certify that:

   (a) The contribution was created in whole or in part by me and I
       have the right to submit it under the Apache License 2.0; or

   (b) The contribution is based upon previous work that, to the best
       of my knowledge, is covered under an appropriate open source
       license and I have the right under that license to submit that
       work with modifications, whether created in whole or in part
       by me, under the same open source license (unless I am
       permitted to submit under a different license), as indicated
       in the file; or

   (c) The contribution was provided directly to me by some other
       person who certified (a), (b) or (c) and I have not modified
       it.

   (d) I understand and agree that this project and the contribution
       are public and that a record of the contribution (including all
       personal information I submit with it, including my sign-off) is
       maintained indefinitely and may be redistributed consistent with
       this project or the open source license(s) involved.

@kiersten-stokes kiersten-stokes added the status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise. label Jan 25, 2022
@kiersten-stokes kiersten-stokes added this to the 3.6.0 milestone Jan 25, 2022
@elyra-bot
Copy link

elyra-bot bot commented Jan 25, 2022

Thanks for making a pull request to Elyra!

To try out this branch on binder, follow this link: Binder

@kiersten-stokes kiersten-stokes marked this pull request as draft January 25, 2022 17:05
@ptitzler
Copy link
Member

Do we want to/how should we handle cases where arguments are of 'non-standard' data types, e.g. the PythonOperator specifies an argument of the type 'python callable' (which will default to string)

A few thoughts about the PythonOperator:

  • In a limited way the existing support for running Python scripts approximates the PythonOperator functionality. One exception is the lack of support for xcom data exchange.
  • It would be possible to provide rudimentary support if the VPE accepts a string as text input that is treated as if it defines a valid Python function, which is then embedded in the DAG. Usability would be poor though because 1) the input would have to be copied from somewhere (like a Python editor) and 2) "business logic" would be embedded in the pipeline file, which makes it harder to re-use.
  • The PythonOperator is (roughly) equivalent to KFP Python-function based components, which Elyra currently does not support.

@kiersten-stokes kiersten-stokes marked this pull request as ready for review February 1, 2022 19:35
@akchinSTC akchinSTC removed the status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise. label Feb 7, 2022
Copy link
Member

@ptitzler ptitzler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good. There's a general question/comment about how the "teardown" fixture is declared (and used), but everything else is fairly specific (and minor).

Note, I don't really have time to fully test this out but hoping others are able to spend time on this in that respect.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank Kiersten - these changes look good.

@akchinSTC akchinSTC merged commit ca04453 into elyra-ai:master Feb 7, 2022
@kiersten-stokes kiersten-stokes deleted the aa-parser-improvements branch April 1, 2022 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants