-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor DatabricksHook #19835
Refactor DatabricksHook #19835
Conversation
@alexott could you check this PR, please? |
Could you revert the unnecessary double-quote-to-single-quote changes? They make the patch very difficult to review. |
ff936d2
to
bd6c7d5
Compare
Reverted |
@@ -64,10 +66,12 @@ | |||
class RunState: | |||
"""Utility class for the run state concept of Databricks runs.""" | |||
|
|||
def __init__(self, life_cycle_state: str, result_state: str, state_message: str) -> None: | |||
def __init__( | |||
self, life_cycle_state: str, state_message: str, result_state: str = None, *args, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need *args, **kwargs
? Also, why to change order of the parameters? They are logical now: lifecycle -> result state -> state message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to be reviewed together with the changes for initialising the class instance out of API response:
Current version:
state = response['state']
life_cycle_state = state['life_cycle_state']
# result_state may not be in the state if not terminal
result_state = state.get('result_state', None)
state_message = state['state_message']
return RunState(life_cycle_state, result_state, state_message)
Proposed version:
state = response['state']
return RunState(**state)
Current version is basically an intermediate layer between the API response and class, extracting values out of the API response and initialising class instance. But actually the response should already represent a state, why do we need this layer then?
I see the following drawbacks with it:
- Class signature doesn't tell that
result_state
might be missing if state is not terminal. Currently it's described with the comment deep in the code. - It tends to increase repeating code - let's say we want to introduce async class for
DatabricksHook
. This logic needs to be written twice. Also in case we want to change the class in the future, let's say add new propertyuser_cancelled_or_timedout
(which is already a part of the API response), then we need to change class arguments, parsing response logic and class instance initialisation everywhere it's used.
With the proposed version, we only need to change class arguments.
With all the above, answering the questions:
-
why do we need *args, **kwargs ?
It shows that RunState might receive other init arguments (since we don't have control over API response), see above example with
user_cancelled_or_timedout
in the response. -
why to change order of the parameters? They are logical now: lifecycle -> result state -> state message.
Just because of Python syntax, we need to put arguments with default values after required arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking - should we initialize result_state
to empty string? If we leave it's None
, then we need to adjust get_run_state_str
to use empty string instead when result_state is None. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, when it's None
it might be also treated that the class argument wasn't set at all, but we actually set it in the init, so empty string sounds like a better option. Plus we'd avoid checking the type where it's used further. Will change it, thanks! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking loud: does it make sense to assume that state_message
by default is an empty string? Then we'd keep the order of the arguments the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's safe to assume that it's empty string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it back to life_cycle_state
, result_state
, state_message
with the latter two default to empty string
""" | ||
# SP is outside of the workspace | ||
if 'azure_resource_id' in self.databricks_conn.extra_dejson: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep this check inside the function, because it could be called by accident (in the future). maybe call it _fill_aad_headers_if_needed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think if we call it _get_aad_headers()
, which would return either empty dict or a filled dict? Also we won't need input arg headers
in this case.
Then we could construct headers like:
aad_headers = self._get_aad_headers()
headers = {**USER_AGENT_HEADER.copy(), **aad_headers}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I thought something like this. it's easier to use because the logic of adding headers is incorporating inside function...
def run_now(self, json: dict) -> str: | ||
def run_now(self, json: dict) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can it break existing code? for example if people are using this result to concatenate with log string without using str
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't break existing code, actually it's opposite - if someone assumes that output is str
because of the function signature, then it'd break the code, because the actual returned type is int
.
@@ -522,6 +515,20 @@ def uninstall(self, json: dict) -> None: | |||
""" | |||
self._do_api_call(UNINSTALL_LIBS_ENDPOINT, json) | |||
|
|||
@staticmethod | |||
def _is_aad_token_valid(aad_token: dict) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need a separate function that is called from one place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Mainly for readability to hide the details for checking that token is valid under the separate function, because it's not the main purpose of the parent function
_get_aad_token
- There's a mistake in the current function implementation: it subtracts
TOKEN_REFRESH_LEAD_TIME
out of the current time, while it should actually sum it. With this we fix it and cover the function with tests.
Otherwise, it looks good. I need to test changes on the real Databricks instances |
Perfect, thanks a lot! I've just pushed the changes regarding empty string by default for |
def __init__(self, life_cycle_state: str, result_state: str, state_message: str) -> None: | ||
def __init__( | ||
self, life_cycle_state: str, state_message: str, result_state: str = '', *args, **kwargs | ||
) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why flipping the arguments? This feels like a source for unnecessary trouble. Also, why take *args, **kwarg
and ignore them? This is usually an anti-pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to the comment above, replied there earlier about the same
e88b278
to
ee49b03
Compare
ee49b03
to
9d08a12
Compare
@alexott may I kindly request your review please? (the button for re-requesting somehow doesn't work, nothing happens when I click on it) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
related: PR #19736 and feature request #18999
The PR intends to refactor
DatabricksHook
and related classes as a preparation step before introducing Deferrable Operator for Databricks.Main points:
int
type forrun_id
andjob_id
. Databricks docs specifies it as integer. Also, actual responses from Databricks API return integer, not string.RunState
- make it explicit thatrun_state
might be None. Currently there's a tiny comment in the code telling that result_state might be None in case a job is still running.