Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example DAG for demonstrating usage of GCS sensors #22808

Merged
merged 10 commits into from
Apr 8, 2022

Conversation

pankajkoti
Copy link
Member

Following GCS Sensor examples are provided as part of the change:

  1. GCSUploadSessionCompleteSensor
  2. GCSObjectUpdateSensor

Following GCS Sensors examples are provided as part of the change:
1. GCSUploadSessionCompleteSensor
2. GCSObjectUpdateSensor
@boring-cyborg
Copy link

boring-cyborg bot commented Apr 7, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

Copy link
Contributor

@phanikumv phanikumv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

tags=["example", "gcs"],
) as dag:

# [START howto_sensor_gcs_upload_session_complete_task]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this in https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/gcs.html ?

You can add it in https://github.com/apache/airflow/blob/main/docs/apache-airflow-providers-google/operators/cloud/gcs.rst which will show up in the page I linked above when new version of provider is released.

You can also build docs locally as described in https://github.com/apache/airflow/blob/main/BREEZE.rst#building-the-documentation

./breeze build-docs -- --package-filter apache-airflow-providers-google

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
@pankajkoti pankajkoti marked this pull request as draft April 7, 2022 12:09
pankajkoti and others added 5 commits April 7, 2022 17:39
The commit does the following:
1. Delete the newly created top level example_gcs.py as it was a
   wrong place for the sensors
2. Add the intended sensors of the PR to the existing example_gcs.py file
   located in airflow/cloud/example_dags directory
@pankajkoti pankajkoti marked this pull request as ready for review April 8, 2022 12:43
@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Apr 8, 2022
@github-actions
Copy link

github-actions bot commented Apr 8, 2022

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

@kaxil
Copy link
Member

kaxil commented Apr 8, 2022

@kaxil
Copy link
Member

kaxil commented Apr 8, 2022

@pankajkoti Static check is failing: https://github.com/apache/airflow/runs/5885329724?check_suite_focus=true#step:11:296

I recommend installing and using pre-commit hook so this will be automatically checked when you run "git commit" -> this is explained in https://github.com/apache/airflow/blob/main/STATIC_CODE_CHECKS.rst#installing-pre-commit-hooks

I ❤️ pre-commit hooks and I am sure you will love that framework too, it is very handy :)

@pankajkoti
Copy link
Member Author

pankajkoti commented Apr 8, 2022

@kaxil yes, I love it too, thank you. I missed installing it earlier. Installed now and pushed the fix: 1b94d57.

@kaxil kaxil merged commit 6aa65a3 into apache:main Apr 8, 2022
@kaxil kaxil deleted the examples-gcs-dag branch April 8, 2022 14:26
@boring-cyborg
Copy link

boring-cyborg bot commented Apr 8, 2022

Awesome work, congrats on your first merged pull request!

@kaxil
Copy link
Member

kaxil commented Apr 8, 2022

Well done. Congratulations on your first merged PR 👏

"GCP_GCS_PATH_TO_SAVED_FILE", os.path.join(temp_dir_path, "test-gcs-example-download.txt")
)

BUCKET_FILE_LOCATION = PATH_TO_UPLOAD_FILE.rpartition("/")[-1]

# Upload 'test-gcs-manual-example-upload.txt' manually in the <BUCKET_1> after triggering the DAG.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests shouldn't depend on manual steps - if the file is required we should:

  1. store it in resources
  2. upload using operator, for example LocalFilesystemToGCSOperator
  3. After the tests, remove any resources created during tests (usually it's enough to remove Bucket)

Copy link
Member Author

@pankajkoti pankajkoti Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bhirsz I wanted to do upload the file programatically like the other chain. However, the sensor task GCSUploadSessionCompleteSensor waits for change in number of files in the bucket. If we have the upload task before GCSUploadSessionCompleteSensor task, it won't detect any changes. On the other hand, if we add upload task after GCSUploadSessionCompleteSensor task, it would be blocked until GCSUploadSessionCompleteSensor task completes and does not solve the need. I am unsure how to add dependency between such tasks. Any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky in this case, indeed. How about starting sensors in parallel with upload_file?

Example:

    chain(
        # TEST SETUP
        create_bucket,
        upload_file,
        # TEST BODY
        [gcs_object_exists, gcs_object_with_prefix_exists],
        # TEST TEARDOWN
        delete_bucket,
    )
    chain(
        create_bucket,
        # TEST BODY
        [gcs_upload_session_complete, gcs_update_object_exists],
        delete_bucket
    )

image

We're starting the sensors and in meantime we're uploading the file - and sensors detect it:
image

image

I'm not sure though if it will not be flaky in some cases - running this in CI will show. I will update my PR with this change.

Copy link
Member Author

@pankajkoti pankajkoti Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Since, we are starting in parallel, it may happen that, the upload task is picked up before the sensor task begins and it may not detect the change as expected. Hence, added the comment to manually upload the file. :)

Also, the gcs_object_update_sensor_task needs to be activated after gcs_upload_session_complete_task (and not in parallel with it) as the object is expected to be detected by sensor and the object_update task is to confirm the manually upload has happened prior to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can alleviate it a bit with sleep (yes, the old the ugly sleep ;)). Run sensors separably and in the meantime trigger sleep (5s will suffice) and then upload task:

chain(
    # TEST SETUP
    create_bucket,
    sleep,
    upload_file,
    # TEST BODY
    [gcs_object_exists, gcs_object_with_prefix_exists],
    # TEST TEARDOWN
    delete_bucket,
)
chain(
    create_bucket,
    # TEST BODY
    gcs_upload_session_complete,
    gcs_update_object_exists,
    delete_bucket
)

And of course put some explanation in comments why we're doing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual steps are unfortunately a big no for us since the plan is to run system tests in CI in community. We can't expect people performing manual task while running them so I'm looking for an automatic solution

Copy link
Member Author

@pankajkoti pankajkoti Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I was not aware of the fact that we're migrating these example DAGs to tests. I was of the impression that the example DAGs are for references to our community on how to implement DAGs for certain operators. Understand and totally agree to your fair point that tests should not need manual interventions.
Okay, so we can try introducing sleep if we feel right about it. Also, I am new to contributing to Airflow :)

cc: @kaxil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we had two goals with the migration - to make writing system tests easier and actually ensure that are example dags are runnable (what would be point in example that doesn't work, and it's actually often the case ;)). In the old design system tests actually "wrapped" and run examples - but not all example dags were used by system tests. In the new design we're also using examples as system tests, but without separating it to different files and now example dag is the system test itself.

Any change is highly welcome and it's good to see PRs such like yours - congrats! We only missed the notification on PR (sadly Github doesn't allow for an advanced notification system) and we were not able to have this discussion before merging the PR.

@bhirsz
Copy link
Contributor

bhirsz commented Apr 12, 2022

Hi,
are you aware of AIP-47? We should now strive to use new design of the system tests. There is open PR migrating the same files you modified in this PR: #22778 . You can use it as a reference for the future. I will also migrate your changes - although I have some comments so I will also leave it in this PR (no action required, just general tips :) ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
okay to merge It's ok to merge this PR as it does not require more tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants