Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor DatabricksHook #19835

Merged
merged 4 commits into from
Dec 5, 2021
Merged

Conversation

eskarimov
Copy link
Contributor

@eskarimov eskarimov commented Nov 26, 2021

related: PR #19736 and feature request #18999

The PR intends to refactor DatabricksHook and related classes as a preparation step before introducing Deferrable Operator for Databricks.

Main points:

  • Use int type for run_id and job_id. Databricks docs specifies it as integer. Also, actual responses from Databricks API return integer, not string.
  • Fix filling AAD headers.
  • Move AAD token validation into a separate function, covered with tests.
  • Change initialisation of RunState - make it explicit that run_state might be None. Currently there's a tiny comment in the code telling that result_state might be None in case a job is still running.

@eskarimov
Copy link
Contributor Author

@alexott could you check this PR, please?

@uranusjr
Copy link
Member

Could you revert the unnecessary double-quote-to-single-quote changes? They make the patch very difficult to review.

@eskarimov
Copy link
Contributor Author

Could you revert the unnecessary double-quote-to-single-quote changes? They make the patch very difficult to review.

Reverted

@@ -64,10 +66,12 @@
class RunState:
"""Utility class for the run state concept of Databricks runs."""

def __init__(self, life_cycle_state: str, result_state: str, state_message: str) -> None:
def __init__(
self, life_cycle_state: str, state_message: str, result_state: str = None, *args, **kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need *args, **kwargs ? Also, why to change order of the parameters? They are logical now: lifecycle -> result state -> state message.

Copy link
Contributor Author

@eskarimov eskarimov Nov 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be reviewed together with the changes for initialising the class instance out of API response:

Current version:

        state = response['state']
        life_cycle_state = state['life_cycle_state']
        # result_state may not be in the state if not terminal
        result_state = state.get('result_state', None)
        state_message = state['state_message']
        return RunState(life_cycle_state, result_state, state_message)

Proposed version:

        state = response['state']
        return RunState(**state)

Current version is basically an intermediate layer between the API response and class, extracting values out of the API response and initialising class instance. But actually the response should already represent a state, why do we need this layer then?
I see the following drawbacks with it:

  • Class signature doesn't tell that result_state might be missing if state is not terminal. Currently it's described with the comment deep in the code.
  • It tends to increase repeating code - let's say we want to introduce async class for DatabricksHook. This logic needs to be written twice. Also in case we want to change the class in the future, let's say add new property user_cancelled_or_timedout (which is already a part of the API response), then we need to change class arguments, parsing response logic and class instance initialisation everywhere it's used.
    With the proposed version, we only need to change class arguments.

With all the above, answering the questions:

  • why do we need *args, **kwargs ?

    It shows that RunState might receive other init arguments (since we don't have control over API response), see above example with user_cancelled_or_timedout in the response.

  • why to change order of the parameters? They are logical now: lifecycle -> result state -> state message.

    Just because of Python syntax, we need to put arguments with default values after required arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking - should we initialize result_state to empty string? If we leave it's None, then we need to adjust get_run_state_str to use empty string instead when result_state is None. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, when it's None it might be also treated that the class argument wasn't set at all, but we actually set it in the init, so empty string sounds like a better option. Plus we'd avoid checking the type where it's used further. Will change it, thanks! 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking loud: does it make sense to assume that state_message by default is an empty string? Then we'd keep the order of the arguments the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's safe to assume that it's empty string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it back to life_cycle_state, result_state, state_message with the latter two default to empty string

"""
# SP is outside of the workspace
if 'azure_resource_id' in self.databricks_conn.extra_dejson:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this check inside the function, because it could be called by accident (in the future). maybe call it _fill_aad_headers_if_needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think if we call it _get_aad_headers(), which would return either empty dict or a filled dict? Also we won't need input arg headers in this case.

Then we could construct headers like:

aad_headers = self._get_aad_headers()
headers = {**USER_AGENT_HEADER.copy(), **aad_headers}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I thought something like this. it's easier to use because the logic of adding headers is incorporating inside function...

Comment on lines -359 to +375
def run_now(self, json: dict) -> str:
def run_now(self, json: dict) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it break existing code? for example if people are using this result to concatenate with log string without using str?

Copy link
Contributor Author

@eskarimov eskarimov Nov 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't break existing code, actually it's opposite - if someone assumes that output is str because of the function signature, then it'd break the code, because the actual returned type is int.

@@ -522,6 +515,20 @@ def uninstall(self, json: dict) -> None:
"""
self._do_api_call(UNINSTALL_LIBS_ENDPOINT, json)

@staticmethod
def _is_aad_token_valid(aad_token: dict) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a separate function that is called from one place?

Copy link
Contributor Author

@eskarimov eskarimov Nov 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Mainly for readability to hide the details for checking that token is valid under the separate function, because it's not the main purpose of the parent function _get_aad_token
  • There's a mistake in the current function implementation: it subtracts TOKEN_REFRESH_LEAD_TIME out of the current time, while it should actually sum it. With this we fix it and cover the function with tests.

@alexott
Copy link
Contributor

alexott commented Nov 28, 2021

Otherwise, it looks good. I need to test changes on the real Databricks instances

@eskarimov
Copy link
Contributor Author

Otherwise, it looks good. I need to test changes on the real Databricks instances

Perfect, thanks a lot! I've just pushed the changes regarding empty string by default for result_state and refactoring function for AAD headers retrieval.

Comment on lines 67 to 71
def __init__(self, life_cycle_state: str, result_state: str, state_message: str) -> None:
def __init__(
self, life_cycle_state: str, state_message: str, result_state: str = '', *args, **kwargs
) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why flipping the arguments? This feels like a source for unnecessary trouble. Also, why take *args, **kwarg and ignore them? This is usually an anti-pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to the comment above, replied there earlier about the same

@eskarimov eskarimov force-pushed the 18999-refactor-databrickshook branch from e88b278 to ee49b03 Compare December 2, 2021 09:28
@eskarimov
Copy link
Contributor Author

@alexott may I kindly request your review please? (the button for re-requesting somehow doesn't work, nothing happens when I click on it)

Copy link
Contributor

@alexott alexott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants