Dataflow operator checks wrong project_id #15483
Labels
good first issue
kind:bug
This is a clearly a bug
provider:google
Google (including GCP) related issues
Apache Airflow version:
composer-1.16.1-airflow-1.10.15
Environment:
What happened:
First, a bit of context. We have a single instance of airflow within its own GCP project, which runs dataflows jobs on different GCP projects.
Let's call the project which runs airflow project A, while the project where dataflow jobs are run project D.
We recently upgraded from 1.10.14 to 1.10.15 (
composer-1.14.2-airflow-1.10.14
tocomposer-1.16.1-airflow-1.10.15
), and noticed that jobs were running successfully from the Dataflow console, but an error was being thrown when thewait_for_done
call was being made by airflow to check if a dataflow job had ended. The error was reporting a 403 error code on Dataflow APIs when retrieving the job state. The error was:What you expected to happen:
I noticed that the 403 code was thrown when looking up the job state within project A, while I expect this lookup to happen within project D (and to consequently NOT fail, since the associated service account has the correct permissions - since it managed to launch the job). I investigated a bit, and noticed that this looks like a regression introduced when upgrading to
composer-1.16.1-airflow-1.10.15
.This version uses an image which automatically installs
apache-airflow-backport-providers-apache-beam==2021.3.13
, which backports the dataflow operator from v2. The previous version we were using was installingapache-airflow-backport-providers-google==2020.11.23
I checked the commits and changes, and noticed that this operator was last modified in 1872d87. Relevant lines from that commit are the following:
airflow/airflow/providers/google/cloud/operators/dataflow.py
Lines 1147 to 1162 in 1872d87
while these are from the previous version:
airflow/airflow/providers/google/cloud/operators/dataflow.py
Lines 965 to 976 in 70bf307
airflow/airflow/providers/google/cloud/hooks/dataflow.py
Lines 613 to 644 in 70bf307
airflow/airflow/providers/google/cloud/hooks/dataflow.py
Lines 965 to 972 in 70bf307
In the previous version, the job was started by calling
start_python_dataflow
, which in turn would call the_start_dataflow
method, which would then create a localjob_controller
and use it to check if the job had ended. Throughout this chain of calls, theproject_id
parameter was passed all the way from the initialization of theDataflowCreatePythonJobOperator
to the creation of the controller which would check if the job had ended.In the latest relevant commit, this behavior was changed. The operator receives a project_id during intialization, and creates the job using the
start_python_pipeline
method, which receives theproject_id
as part of thevariables
parameter. However, the completion of the job is checked by thedataflow_hook.wait_for_done
call. The DataFlowHook used here:wait_for_done
call)As a result, it looks like it is using the default GCP project ID (the one which the composer is running inside) and not the one used to create the Dataflow job. This explains why we can see the job launching successfully while the operator fails.
I think that specifying the
project_id
as a parameter in thewait_for_done
call may solve the issue.How to reproduce it:
The Dataflow job will succeed (you can see no errors get thrown from the GCP console), but an error will be thrown in airflow logs.
Note: I am reporting a 403 because the service account I am using which is associated to airflow does not have the correct permissions. I suspect that, even with the correct permission, you may get another error (maybe 404, since there will be no job running with that ID within the project) but I have no way to test this at the moment.
Anything else we need to know:
This problem occurs every time I launch a Dataflow job on a project where the composer isn't running.
The text was updated successfully, but these errors were encountered: