Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCSToBigQueryOperator - Not generating the Unique BQ Job Name #11660

Closed
BhuviTheDataGuy opened this issue Oct 19, 2020 · 13 comments
Closed

GCSToBigQueryOperator - Not generating the Unique BQ Job Name #11660

BhuviTheDataGuy opened this issue Oct 19, 2020 · 13 comments
Labels
kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@BhuviTheDataGuy
Copy link

BhuviTheDataGuy commented Oct 19, 2020

I was using GoogleCloudStorageToBigQueryOperator then I wanted to use GCSToBigQueryOperator. When I run parallel data export from GCS to BQ, (via a for loop Im generating dynamic task) It is generating the BQ Job name as test-composer:us-west2.airflow_1603109319 (I think its taking node name + current timestamp) as the job id for all the tasks.

Error

ERROR - 409 POST https://bigquery.googleapis.com/bigquery/v2/projects/centili-prod/jobs: Already Exists: Job test-composer:us-west2.airflow_1603109319
Traceback (most recent call last)

This is not allowing to import 2nd table, it has to wait for a min(retry in DAG) then its imported.

But the older one is giving proper Job ID like (Job_someUUID)

3 parallel table Import:

  • GCSToBigQueryOperator - BQ Job ID for all the load jobs: Job test-composer:us-west2.airflow_1603109319
  • GoogleCloudStorageToBigQueryOperator - BQ job id for table1(job_NYEBXXXXXvoflDiEj2j), table2(job_9xGl7WlVXXXXXWBriaqbhLQY), table3(job_aqmVLXXXXXL2YqVCGAqb_5EtW)
@BhuviTheDataGuy BhuviTheDataGuy added the kind:bug This is a clearly a bug label Oct 19, 2020
@muscovitebob
Copy link

I just started using this operator via the backports package a few days ago and I hit this at least one in ten times I invoke the operator, making it unusable without manual supervision. I do not use a dynamic dag but I do have a few GCSToBigQueryOperators.

[2020-10-22 08:40:58,732] {base_task_runner.py:113} INFO - Job 59262: Subtask TASKNAME [2020-10-22 08:40:58,730] {taskinstance.py:1135} ERROR - 409 POST https://bigquery.googleapis.com/bigquery/v2/projects/PROJECTNAME/jobs: Already Exists: Job PROJECTNAME:EU.airflow_1603356058@-@{"workflow": "DAGNAME", "task-id": "TASKNAME", "execution-date": "2020-10-22T08:29:42.258293+00:00"}

@BhuviTheDataGuy
Copy link
Author

Yeah, maybe a bug, try the previous version(GoogleCloudStorageToBigQueryOperator). It works.

@muscovitebob
Copy link

This seems to be the method that incorrectly generates the job IDs but the format does not seem to match exactly what I get in the logs.
Meanwhile I will indeed switch to the older operator since you report it always generates job IDs correctly, thanks!

@eladkal eladkal added the provider:google Google (including GCP) related issues label Oct 26, 2020
@QuinRiva
Copy link

Are you sure you're running the current backport versions? I was getting this issue in 2020.6.24, but it looks like it is solved in 2020.10.5. Unfortunately, 2020.10.5 isn't compatible with the new bigquery/pubsub libraries (2.0.0), so I can't test.

@jensenity
Copy link

I'm using apache-airflow-backport-providers-google==2020.10.5 and seem to get similar error as well.

[2020-11-08 23:31:58,200] {taskinstance.py:1150} ERROR - 409 POST https://bigquery.googleapis.com/bigquery/v2/projects/banksalad/jobs: Already Exists: Job test:US.airflow_1604878317

@BhuviTheDataGuy
Copy link
Author

Even the big query hook has the same issue, Im, not a coder :) so not able to find the exact cause.

When we use contrib.bigquery then no issue, this new providers.bigquery having this problem on all distributions (like the hook, bigquery operator, gcs to bq, empty table creator, Bq to bq and etc)

@jensenity
Copy link

apparently they upgraded and fixed it at airflow backport 2020.10.29 version.

@jensenity
Copy link

47b05a8

@BhuviTheDataGuy
Copy link
Author

Awesome!!! So im closing this now.

#11282

@muscovitebob
Copy link

For Cloud Composer users like me, trying to use 2020.10.29 with the composer-1.12.5-airflow-1.10.10 image will error out with a very cryptic:

The command '/bin/sh -c bash installer.sh $COMPOSER_PYTHON_VERSION  fail' returned a non-zero code: 1
ERROR
ERROR: build step 0 "gcr.io/cloud-builders/docker" failed: step exited with non-zero status: 1

In the cloud build step. The warning from earlier in the log may provide some hints as to why this happens:

+ python3 -m pipdeptree --warn fail
Warning!!! Possibly conflicting dependencies found:
* google-cloud-memcache==0.2.0
 - google-api-core [required: >=1.17.0,<2.0.0dev, installed: 1.16.0]

So for now at least Cloud Composer users are stuck with using either the old non-buggy Operator or waiting for Google to patch this.

@mik-laj
Copy link
Member

mik-laj commented Nov 10, 2020

Can you latest composer image composer-1.13.0-airflow-1.10.12 image and latest backport provider - apache-airflow-backport-providers-google==2020.10.29?

@potiuk
Copy link
Member

potiuk commented Nov 10, 2020

@muscovitebob 0 I believe indeed it is a dependency problem that is very likely to be addressed in the last version of composer image. From what we know the Composer team keeps the images updated with the releases of Apache Airlfow and the providers and the next image will even include the latest google providers baked, but you should try to install the latest provider there. Note that there is a new version of google backport provider as a release candidate (voting on it finishes on Thursday) so you might even try to install this version instead

https://pypi.org/project/apache-airflow-backport-providers-google/2020.11.13rc1/

It has even more fixes:

Commit Committed Subject
b2a28d1 2020-11-09 Moves provider packages scripts to dev (#12082)
fcb6b00 2020-11-08 Add authentication to AWS with Google credentials (#12079)
2ef3b7e 2020-11-08 Fix ERROR - Object of type 'bytes' is not JSON serializable when using store_to_xcom_key parameter (#12172)
0caec9f 2020-11-06 Dataflow - add waiting for successful job cancel (#11501)
cf9437d 2020-11-06 Simplify string expressions (#12123)
91a64db 2020-11-04 Format all files (without excepions) by black (#12091)
fd3db77 2020-11-04 Add server side cursor support for postgres to GCS operator (#11793)
f1f1940 2020-11-04 Add DataflowStartSQLQuery operator (#8553)
41bf172 2020-11-04 Simplify string expressions (#12093)
5f5244b 2020-11-04 Add template fields renderers to Biguery and Dataproc operators (#12067)
4e8f9cc 2020-11-03 Enable Black - Python Auto Formmatter (#9550)
8c42cf1 2020-11-03 Use PyUpgrade to use Python 3.6 features (#11447)
45ae145 2020-11-03 Log BigQuery job id in insert method of BigQueryHook (#12056)
e324b37 2020-11-03 Add job name and progress logs to Cloud Storage Transfer Hook (#12014)
6071fdd 2020-11-02 Improve handling server errors in DataprocSubmitJobOperator (#11947)
2f703df 2020-10-30 Add SalesforceToGcsOperator (#10760)
e5713e0 2020-10-29 Add drain option when canceling Dataflow pipelines (#11374)
37eaac3 2020-10-29 The PRs which are not approved run subset of tests (#11828)
79cb771 2020-10-28 Fixing re pattern and changing to use a single character class. (#11857)
5a439e8 2020-10-26 Prepare providers release 0.0.2a1 (#11855)
240c7d4 2020-10-26 Google Memcached hooks - improve protobuf messages handling (#11743)
8afdb6a 2020-10-26 Fix spellings (#11825)
872b156 2020-10-25 Generated backport providers readmes/setup for 2020.10.29 (#11826)
b680bbc0b 2020-10-24 Generated backport providers readmes/setup for 2020.10.29

@muscovitebob
Copy link

Can you latest composer image composer-1.13.0-airflow-1.10.12 image and latest backport provider - apache-airflow-backport-providers-google==2020.10.29?

Upgrade with these went well, thanks much! Did not realise there was a new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
Development

No branches or pull requests

7 participants