Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added MultipleFilesWebHdfsSensor #43045

Merged
merged 5 commits into from
Oct 17, 2024
Merged

added MultipleFilesWebHdfsSensor #43045

merged 5 commits into from
Oct 17, 2024

Conversation

eilon246810
Copy link
Contributor

@eilon246810 eilon246810 commented Oct 15, 2024

I added MultipleFilesWebHdfsSensor class in providers.apache.hdfs.sensors.web_hdfs.

The current existing WebHdfsSensor can check if one file exists, which requires many tasks to check many files (in my org we had 350+ sensors for a single DAG).

The new MultipleFilesWebHdfsSensor can list a whole directory and succeeds only when all the expected files landed in the hdfs.

This is my first contribution so I would greatly appreciate any guidance :)


Copy link

boring-cyborg bot commented Oct 15, 2024

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

Copy link
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment, otherwise LGTM

@romsharon98
Copy link
Contributor

Great first PR!
I added a small comment about the testing and notice your static checks is failing.
you can read about configuring pre-commit here

@romsharon98 romsharon98 merged commit 3b9e156 into apache:main Oct 17, 2024
6 checks passed
Copy link

boring-cyborg bot commented Oct 17, 2024

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@kaxil
Copy link
Member

kaxil commented Oct 17, 2024

This PR has caused failures both static and tests. I am fixing them in #43122

=========================== short test summary info ============================
FAILED providers/tests/apache/hdfs/sensors/test_web_hdfs.py::TestMultipleFilesWebHdfsSensor::test_poke - AssertionError: assert 'Files Found in directory: ' in ''
 +  where '' = <_pytest.logging.LogCaptureFixture object at 0x7f40eba6dd30>.text
FAILED providers/tests/apache/hdfs/sensors/test_web_hdfs.py::TestMultipleFilesWebHdfsSensor::test_poke_should_return_false_for_missing_file - assert 'Files Found in directory: ' in "INFO     airflow.task.operators.airflow.providers.apache.hdfs.sensors.web_hdfs.MultipleFilesWebHdfsSensor:web_hdfs.py:78 There are missing files: {'static_babynames2', 'static_babynames3', 'static_babynames1'}\n"
 +  where "INFO     airflow.task.operators.airflow.providers.apache.hdfs.sensors.web_hdfs.MultipleFilesWebHdfsSensor:web_hdfs.py:78 There are missing files: {'static_babynames2', 'static_babynames3', 'static_babynames1'}\n" = <_pytest.logging.LogCaptureFixture object at 0x7f40ebf5ca00>.text
=================== 2 failed, 17 passed, 1 warning in 12.66s ===================

@potiuk
Copy link
Member

potiuk commented Oct 17, 2024

This was an interesting one - seems that for some reason the CI workflow DID NOT run at all - only build image workflow did. that's why it was "green".

@potiuk
Copy link
Member

potiuk commented Oct 17, 2024

This is a long known issue with GitHub that I raised to them 3 years ago - unfortunately there is a race condition that makes the PR "green" if the workflow have not started at all, or when it is just starting.... Very poor design IMHO for GitHub Actions @kaxil @romsharon98

The only way I found to prevent such accidental merges of "green-but-incomplete" PRs is to look at the number of checks that "passed". When there are < 10, something is WRONG. But it's not really obvious and happened to me more than once to merge such PR.

R7L208 pushed a commit to R7L208/airflow that referenced this pull request Oct 17, 2024
harjeevanmaan pushed a commit to harjeevanmaan/airflow that referenced this pull request Oct 23, 2024
PaulKobow7536 pushed a commit to PaulKobow7536/airflow that referenced this pull request Oct 24, 2024
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants