Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-569] [Enhancement] only wait for Dataproc job to finish, not for the Dataproc serverless cluster to spin down #734

Closed
2 tasks done
wazi55 opened this issue May 22, 2023 · 5 comments · Fixed by #929
Labels
enhancement New feature or request

Comments

@wazi55
Copy link
Contributor

wazi55 commented May 22, 2023

Is this a new bug in dbt-bigquery?

  • I believe this is a new bug in dbt-bigquery
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When working with DBT with Dataproc, the current behavior is DBT waits for the batch job complete the torn down of the cluster here, however, that adds an additional 1 or 2 minutes on the computation time. Instead of waiting for the job finish, the code should just wait until the job is finished by polling using the get_batch() method to check whether the job has succeeded or not.

Expected Behavior

The expected behavior is, DBT should continue on with the SQL models after the spark code succeed instead of waiting for the cluster to be torn down.

Steps To Reproduce

    def _submit_dataproc_job(self) -> dataproc_v1.types.jobs.Job:
        batch = self._configure_batch()
        parent = f"projects/{self.credential.execution_project}/locations/{self.credential.dataproc_region}"

        request = dataproc_v1.CreateBatchRequest(
            parent=parent,
            batch=batch,
        )
        # make the request
        operation = self.job_client.create_batch(request=request)  # type: ignore
        # this takes quite a while, waiting on GCP response to resolve
        # (not a google-api-core issue, more likely a dataproc serverless issue)
        response = operation.result(retry=self.retry)
        return response

Relevant log output

No response

Environment

- OS:
- Python:
- dbt-core:
- dbt-bigquery:

Additional Context

No response

@wazi55 wazi55 added bug Something isn't working triage labels May 22, 2023
@github-actions github-actions bot changed the title [Bug] DBT Bigquery adaptor for spark connector waits for the cluster torn down instead of job finishes, adding a few extra minutes to the computation [ADAP-569] [Bug] DBT Bigquery adaptor for spark connector waits for the cluster torn down instead of job finishes, adding a few extra minutes to the computation May 22, 2023
@dbeatty10
Copy link
Contributor

Thanks for reporting this @wazi55!

Are you interested in contributing a pull request for this, by any chance?

@dbeatty10 dbeatty10 added enhancement New feature or request awaiting_response and removed bug Something isn't working triage labels May 22, 2023
@dbeatty10 dbeatty10 changed the title [ADAP-569] [Bug] DBT Bigquery adaptor for spark connector waits for the cluster torn down instead of job finishes, adding a few extra minutes to the computation [ADAP-569] [Feature] DBT Bigquery adaptor for spark connector waits for the cluster torn down instead of job finishes, adding a few extra minutes to the computation May 22, 2023
@dbeatty10
Copy link
Contributor

I re-labeled this as an enhancement since I don't perceive there to be an error, flaw, failure or fault here -- this is more of an efficiency / optimization thing.

@dataders
Copy link
Contributor

dataders commented Jun 8, 2023

@wazi55 thanks so much for opening! This was on my to-do list to write up after an email thread last month with Big Query engineers.

Their recommendation was to use BatchControllerClient's .get_batch() instead of create_batch() to create DataProc jobs (relevant Google Python SDK docs).

afaict, this is not a drop-in replacement as some polling would have to be implemented to ensure that the response's state attribute (docs) is one of: SUCCEEDED, CANCELLED, or FAILED.

Perusing the docs, I also notice that there is a BatchControllerAsyncClient (docs), which might obviate the need for polling? I'm no expert here.

I'm going to ask the GCP team to validate my thinking.

@dataders dataders changed the title [ADAP-569] [Feature] DBT Bigquery adaptor for spark connector waits for the cluster torn down instead of job finishes, adding a few extra minutes to the computation [ADAP-569] [Enhancement] only wait for Dataproc job to finish, not for the Dataproc serverless cluster to spin down Jun 8, 2023
@wazi55
Copy link
Contributor Author

wazi55 commented Jul 10, 2023

Hello 👋 get_batch is exactly the method I was going to suggest, basically check the returned BATCH.STATE should be the way to go.

@dataders
Copy link
Contributor

dataders commented Oct 9, 2023

resolved by #929

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants