-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataFrame] Implements filter and dropna #1959
Conversation
python/ray/dataframe/dataframe.py
Outdated
inplace = validate_bool_kwarg(inplace, "inplace") | ||
|
||
if axis == 1 and subset is not None: | ||
subset = self.index.isin(subset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subset = [item for item in self.index if item in subset]
Is slightly faster. Also you can probably refactor a bit by putting the top level if statement just be if subset is not None
python/ray/dataframe/dataframe.py
Outdated
columns=new_cols, | ||
index=self.index) | ||
|
||
self._col_metadata = self._col_metadata[new_cols] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't return another _IndexMetadata
python/ray/dataframe/dataframe.py
Outdated
Returns: | ||
A new dataframe with the filter applied. | ||
""" | ||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this import is local to filter?
python/ray/dataframe/dataframe.py
Outdated
|
||
if items is not None: | ||
bool_arr = labels.isin(items) | ||
elif like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible subtle errors here with empty strings/lists or other funny falsy values
python/ray/dataframe/dataframe.py
Outdated
|
||
indices_for_rows = [self.columns.index(new_col) | ||
for new_col in columns] | ||
indices_for_rows = self.columns.isin(columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same change as above
Test PASSed. |
Test FAILed. |
Test PASSed. |
Test FAILed. |
* Implement multiple axis for dropna * Add multiple axis dropna test * Fix using dummy_frame in dropna * Clean up dropna multiple axis tests * remove unnecessary axis modification * Clean up dropna tests
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few minor comments.
python/ray/dataframe/dataframe.py
Outdated
if not inplace: | ||
return result | ||
|
||
return self._update_inplace( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't return self._update_inplace
, it will return None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the docs, when inplace=True
, the function should return false. I made this a bit clearer in the code that this is intentional.
python/ray/dataframe/dataframe.py
Outdated
) | ||
|
||
axis = pd.DataFrame()._get_axis_number(axis) | ||
inplace = validate_bool_kwarg(inplace, "inplace") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should move this up higher, before the is_list_like
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
|
||
if axis == 1: | ||
new_vals = [self._col_metadata.get_global_indices(i, vals) | ||
for i, vals in enumerate(ray.get(new_vals))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can build the columns in a remote task. It might be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I move this code to a remote task, it would require serializing self._col_metadata._lengths
and self.columns
, which could potentially be expensive. The code would later block on a ray.get on the new columns generated remotely.
Given there would be a blocking operation either way, I think this is fine for now but can be revisited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where the blocking operation occurs does matter. Here the main thread will be waiting until the entire dropna
operation is complete. This happens whether you need the columns immediately or not. We should block when the user needs it, not when the user calls an operation. Given this can be an iteratively executed operation, we should not block on each call.
|
||
else: | ||
new_vals = [self._row_metadata.get_global_indices(i, vals) | ||
for i, vals in enumerate(ray.get(new_vals))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here with the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above.
Test FAILed. |
Jenkins, retest this please |
Test PASSed. |
Test FAILed. |
Jenkins, retest this please |
Test PASSed. |
Passes private-travis. Merged, thanks @kunalgosar ! |
* master: (21 commits) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973) Clean up syntax for supported Python versions. (ray-project#1963) [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956) [DataFrame] Fix dtypes (ray-project#1930) keep_dims -> keepdims (ray-project#1980) add pthread linking (ray-project#1986) [DataFrame] Add layer of abstraction to allow OID instantiation (ray-project#1984) ...
* master: (25 commits) [DataFrame] Add direct pandas imports for MVP (ray-project#1960) Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973) Clean up syntax for supported Python versions. (ray-project#1963) [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956) [DataFrame] Fix dtypes (ray-project#1930) keep_dims -> keepdims (ray-project#1980) ...
* master: [DataFrame] Add direct pandas imports for MVP (ray-project#1960) Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973)
This PR implements
df.filter
anddf.dropna
.