-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AIRFLOW-3791: Dataflow - Support check status if pipeline spans on multiple jobs #4633
Conversation
Support to check if job is already running before starting java job In case dataflow creates more than one job, we need to track all jobs for status
Codecov Report
@@ Coverage Diff @@
## master #4633 +/- ##
=========================================
- Coverage 79.05% 79% -0.05%
=========================================
Files 489 489
Lines 30685 30728 +43
=========================================
+ Hits 24257 24278 +21
- Misses 6428 6450 +22
Continue to review full report at Codecov.
|
Support to check if job is already running before starting java job In case dataflow creates more than one job, we need to track all jobs for status
why os this not being merged to master? |
PTAL @kaxil |
I will have a look at this over the weekend @chaimt . Can you update the PR title and commit message with specific details - following commit guidelines. |
cc @fenglu-g What do you think? |
What is the status of the PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR needs to be rebased against master to resolve the conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please give this PR a useful title. |
And the same with your commit message. |
Co-Authored-By: Fokko Driesprong <fokko@driesprong.frl>
Co-Authored-By: Fokko Driesprong <fokko@driesprong.frl>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed all issues
…o AIRFLOW-3791_Dataflow
Still some merge conflicts 😭 |
…o AIRFLOW-3791_Dataflow
…o AIRFLOW-3791_Dataflow
…o AIRFLOW-3791_Dataflow
…o AIRFLOW-3791_Dataflow
…o AIRFLOW-3791_Dataflow
…o AIRFLOW-3791_Dataflow
…o AIRFLOW-3791_Dataflow
…o AIRFLOW-3791_Dataflow
i have hit a problem with the documentation, any help? |
Fix is available:
I did not check the correctness of the implementation. I just fixed the tests. |
…o AIRFLOW-3791_Dataflow
…o AIRFLOW-3791_Dataflow
…o AIRFLOW-3791_Dataflow
…o AIRFLOW-3791_Dataflow
finally, what now? |
@Fokko - can we merge this? |
@chaimt A few more remarks, sorry for the late response. |
…o AIRFLOW-3791_Dataflow change default for check if running
…o AIRFLOW-3791_Dataflow merge redundant code of _get_job_id_from_name
…o AIRFLOW-3791_Dataflow merge redundant code of _get_job_id_from_name
…o AIRFLOW-3791_Dataflow merge redundant code of _get_job_id_from_name
…o AIRFLOW-3791_Dataflow merge redundant code of _get_job_id_from_name
…o AIRFLOW-3791_Dataflow merge redundant code of _get_job_id_from_name
…o AIRFLOW-3791_Dataflow merge redundant code of _get_job_id_from_name
@Fokko anything else? |
@RosterIn - can this be merged to master? |
LGTM @ashb do you have pending comments? |
None anymore - I've let all context of this PR go out of my head |
Thanks @chaimt |
…ltiple jobs (apache#4633) * AIRFLOW-3791: Dataflow Support to check if job is already running before starting java job In case dataflow creates more than one job, we need to track all jobs for status * AIRFLOW-3791: Dataflow Support to check if job is already running before starting java job In case dataflow creates more than one job, we need to track all jobs for status * Update airflow/contrib/hooks/gcp_dataflow_hook.py Co-Authored-By: Fokko Driesprong <fokko@driesprong.frl> * Update airflow/contrib/hooks/gcp_dataflow_hook.py Co-Authored-By: Fokko Driesprong <fokko@driesprong.frl> * Update gcp_dataflow_hook.py * Update dataflow_operator.py * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow change default for check if running * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow merge redundant code of _get_job_id_from_name * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow merge redundant code of _get_job_id_from_name * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow merge redundant code of _get_job_id_from_name * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow merge redundant code of _get_job_id_from_name * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow merge redundant code of _get_job_id_from_name * Merge branch 'AIRFLOW-3791_Dataflow' of github.com:chaimt/airflow into AIRFLOW-3791_Dataflow merge redundant code of _get_job_id_from_name
:rtype: list | ||
""" | ||
if not self._multiple_jobs and self._job_id: | ||
return self._dataflow.projects().locations().jobs().get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the problem. This method returns a dictionary here, when list of dictionaries is expected. This makes it impossible to determine job id
[2019-09-06 03:20:51,974] {taskinstance.py:1042} ERROR - string indices must be integers
Traceback (most recent call last):
File "/opt/airflow/airflow/models/taskinstance.py", line 917, in _run_raw_task
result = task_copy.execute(context=context)
File "/opt/airflow/airflow/gcp/operators/dataflow.py", line 216, in execute
self.jar, self.job_class, True, self.multiple_jobs)
File "/opt/airflow/airflow/gcp/hooks/dataflow.py", line 372, in start_java_dataflow
self._start_dataflow(variables, name, command_prefix, label_formatter, multiple_jobs)
File "/opt/airflow/airflow/contrib/hooks/gcp_api_base_hook.py", line 307, in wrapper
return func(self, *args, **kwargs)
File "/opt/airflow/airflow/gcp/hooks/dataflow.py", line 327, in _start_dataflow
variables['region'], self.poll_sleep, job_id, self.num_retries, multiple_jobs) \
File "/opt/airflow/airflow/gcp/hooks/dataflow.py", line 76, in __init__
self._jobs = self._get_jobs()
File "/opt/airflow/airflow/gcp/hooks/dataflow.py", line 138, in _get_jobs
self._job_id, job['name']
TypeError: string indices must be integers
Pipelines usually spawn only one job on dataflow. But there is the option to spawn multiple jobs.
Support to check if job is already running before starting java job
In case dataflow creates more than one job, we need to track all jobs for status
Make sure you have checked all steps below.
Jira
Support to check if job is already running before starting java job
In case dataflow creates more than one job, we need to track all jobs for status
Tests
Commits
Documentation
Code Quality
flake8