-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug with null replication metrics when row is all null #706
Changes from 4 commits
bee16d3
7644b79
a3b0a16
1d94e12
f6bcf32
00ee95a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2240,7 +2240,20 @@ def _update_null_replication_metrics(self, clean_samples: Dict) -> None: | |
:param clean_samples: input cleaned dataset | ||
:type clean_samples: dict | ||
""" | ||
data = pd.DataFrame(clean_samples).apply(pd.to_numeric, errors="coerce") | ||
data = pd.DataFrame(clean_samples) | ||
|
||
# If the last row is all null, then add rows to the data DataFrame | ||
max_null_index = max( | ||
[max(i) for i in getattr(self._profile[0], "null_types_index").values()], | ||
default=0, | ||
) | ||
if max_null_index > data.index.max(): | ||
data.loc[max_null_index] = {} | ||
|
||
# Fill in missing rows with NaN and convert types to numeric | ||
data = data.reindex(range(data.index.max() + 1), fill_value=np.nan).apply( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this could be really expensive for large data. Is there another way around this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What specific error is requiring this change / where? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 good catch @JGSweets There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If a row is all null, the dataframe will be missing a row, which causes an error on the line There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or even better potentially would be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another way would be to reindex a single column in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That would work too There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any way we can do it w/o re-indexing? I think that would be the ideal state. |
||
pd.to_numeric, errors="coerce" | ||
) | ||
|
||
get_data_type = lambda profile: profile.profiles[ # NOQA: E731 | ||
"data_type_profile" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._profile[0]
Right now this is getting the first profile every time is this correct?Why do we need to add rows to the df?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the last row(s) is all null, then the maximum null index of any profile should be the index of the last row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len(self._profile)
is3
and only the zero-th index has values.@tonywu315 is this always the case though? are we sure we can always just pull the zero-th profile in the
self._profile
list?