[DataFrame] Implements filter and dropna #1959

kunalgosar · 2018-04-28T01:03:19Z

This PR implements df.filter and df.dropna.

p-yang · 2018-04-28T01:44:49Z

python/ray/dataframe/dataframe.py

+        inplace = validate_bool_kwarg(inplace, "inplace")
+
+        if axis == 1 and subset is not None:
+            subset = self.index.isin(subset)


subset = [item for item in self.index if item in subset]

Is slightly faster. Also you can probably refactor a bit by putting the top level if statement just be if subset is not None

p-yang · 2018-04-28T01:48:27Z

python/ray/dataframe/dataframe.py

+                                 columns=new_cols,
+                                 index=self.index)
+
+            self._col_metadata = self._col_metadata[new_cols]


This doesn't return another _IndexMetadata

p-yang · 2018-04-28T01:52:03Z

python/ray/dataframe/dataframe.py

+        Returns:
+            A new dataframe with the filter applied.
+        """
+        import re


Any reason this import is local to filter?

p-yang · 2018-04-28T01:55:23Z

python/ray/dataframe/dataframe.py

+
+        if items is not None:
+            bool_arr = labels.isin(items)
+        elif like:


Possible subtle errors here with empty strings/lists or other funny falsy values

p-yang · 2018-04-28T02:01:33Z

python/ray/dataframe/dataframe.py

-
-            indices_for_rows = [self.columns.index(new_col)
-                                for new_col in columns]
+            indices_for_rows = self.columns.isin(columns)


Same change as above

AmplabJenkins · 2018-04-28T02:03:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5094/
Test PASSed.

AmplabJenkins · 2018-05-04T02:03:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5183/
Test FAILed.

AmplabJenkins · 2018-05-04T02:24:56Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5182/
Test PASSed.

AmplabJenkins · 2018-05-04T02:26:42Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5184/
Test FAILed.

* Implement multiple axis for dropna * Add multiple axis dropna test * Fix using dummy_frame in dropna * Clean up dropna multiple axis tests * remove unnecessary axis modification * Clean up dropna tests

AmplabJenkins · 2018-05-04T03:55:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5186/
Test PASSed.

devin-petersohn

Looks good, just a few minor comments.

devin-petersohn · 2018-05-04T07:13:19Z

python/ray/dataframe/dataframe.py

+            if not inplace:
+                return result
+
+            return self._update_inplace(


Don't return self._update_inplace, it will return None.

Based on the docs, when inplace=True, the function should return false. I made this a bit clearer in the code that this is intentional.

devin-petersohn · 2018-05-04T07:13:59Z

python/ray/dataframe/dataframe.py

+            )
+
+        axis = pd.DataFrame()._get_axis_number(axis)
+        inplace = validate_bool_kwarg(inplace, "inplace")


You should move this up higher, before the is_list_like block.

devin-petersohn · 2018-05-04T07:16:24Z

python/ray/dataframe/dataframe.py

+
+        if axis == 1:
+            new_vals = [self._col_metadata.get_global_indices(i, vals)
+                        for i, vals in enumerate(ray.get(new_vals))]


You can build the columns in a remote task. It might be better

If I move this code to a remote task, it would require serializing self._col_metadata._lengths and self.columns, which could potentially be expensive. The code would later block on a ray.get on the new columns generated remotely.

Given there would be a blocking operation either way, I think this is fine for now but can be revisited.

Where the blocking operation occurs does matter. Here the main thread will be waiting until the entire dropna operation is complete. This happens whether you need the columns immediately or not. We should block when the user needs it, not when the user calls an operation. Given this can be an iteratively executed operation, we should not block on each call.

devin-petersohn · 2018-05-04T07:16:33Z

python/ray/dataframe/dataframe.py

+
+        else:
+            new_vals = [self._row_metadata.get_global_indices(i, vals)
+                        for i, vals in enumerate(ray.get(new_vals))]


Same here with the index.

AmplabJenkins · 2018-05-04T10:36:25Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5212/
Test FAILed.

kunalgosar · 2018-05-04T10:49:34Z

Jenkins, retest this please

AmplabJenkins · 2018-05-04T11:57:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5213/
Test PASSed.

AmplabJenkins · 2018-05-04T16:02:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5214/
Test FAILed.

kunalgosar · 2018-05-04T17:06:18Z

Jenkins, retest this please

AmplabJenkins · 2018-05-04T18:11:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5217/
Test PASSed.

devin-petersohn · 2018-05-04T19:21:20Z

Passes private-travis. Merged, thanks @kunalgosar !

* master: (21 commits) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973) Clean up syntax for supported Python versions. (ray-project#1963) [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956) [DataFrame] Fix dtypes (ray-project#1930) keep_dims -> keepdims (ray-project#1980) add pthread linking (ray-project#1986) [DataFrame] Add layer of abstraction to allow OID instantiation (ray-project#1984) ...

* master: (25 commits) [DataFrame] Add direct pandas imports for MVP (ray-project#1960) Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973) Clean up syntax for supported Python versions. (ray-project#1963) [DataFrame] Implements mode, to_datetime, and get_dummies (ray-project#1956) [DataFrame] Fix dtypes (ray-project#1930) keep_dims -> keepdims (ray-project#1980) ...

* master: [DataFrame] Add direct pandas imports for MVP (ray-project#1960) Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (ray-project#2007) Expand local_dir in Trial init (ray-project#2013) Fixing ascii error for Python2 (ray-project#2009) [DataFrame] Implements df.update (ray-project#1997) [DataFrame] Implements df.as_matrix (ray-project#2001) [DataFrame] Implement quantile (ray-project#1992) [DataFrame] Impement sort_values and sort_index (ray-project#1977) [DataFrame] Implement rank (ray-project#1991) [DataFrame] Implemented prod, product, added test suite (ray-project#1994) [DataFrame] Implemented __setitem__, select_dtypes, and astype (ray-project#1941) [DataFrame] Implement diff (ray-project#1996) [DataFrame] Implemented nunique, skew (ray-project#1995) [DataFrame] Implements filter and dropna (ray-project#1959) [DataFrame] Implements df.pipe (ray-project#1999) [DataFrame] Apply() for Lists and Dicts (ray-project#1973)

p-yang reviewed Apr 28, 2018

View reviewed changes

kunalgosar added 6 commits May 3, 2018 17:15

implement filter

3d4ec01

begin implementation of dropna

1115453

implement dropna

e51a5f3

docs and tests

83bd854

resolving comments

c0d8df4

resolving merge

f082a5d

kunalgosar force-pushed the filter branch from ed6c9a3 to f082a5d Compare May 4, 2018 01:22

kunalgosar added 2 commits May 3, 2018 18:32

add error checking to dropna

32ee1e1

fix update inplace call

b94adb4

Implement multiple axis for dropna (#13)

16faf64

* Implement multiple axis for dropna * Add multiple axis dropna test * Fix using dummy_frame in dropna * Clean up dropna multiple axis tests * remove unnecessary axis modification * Clean up dropna tests

devin-petersohn reviewed May 4, 2018

View reviewed changes

resolve comments

0d901c1

fix lint

69f9e29

devin-petersohn approved these changes May 4, 2018

View reviewed changes

devin-petersohn merged commit 4030356 into ray-project:master May 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataFrame] Implements filter and dropna #1959

[DataFrame] Implements filter and dropna #1959

kunalgosar commented Apr 28, 2018

p-yang Apr 28, 2018

p-yang Apr 28, 2018

p-yang Apr 28, 2018

p-yang Apr 28, 2018

p-yang Apr 28, 2018

AmplabJenkins commented Apr 28, 2018

AmplabJenkins commented May 4, 2018

AmplabJenkins commented May 4, 2018

AmplabJenkins commented May 4, 2018

AmplabJenkins commented May 4, 2018

devin-petersohn left a comment

devin-petersohn May 4, 2018

kunalgosar May 4, 2018

devin-petersohn May 4, 2018

kunalgosar May 4, 2018

devin-petersohn May 4, 2018

kunalgosar May 4, 2018

devin-petersohn May 4, 2018 •

edited

Loading

devin-petersohn May 4, 2018

kunalgosar May 4, 2018

AmplabJenkins commented May 4, 2018

kunalgosar commented May 4, 2018

AmplabJenkins commented May 4, 2018

AmplabJenkins commented May 4, 2018

kunalgosar commented May 4, 2018

AmplabJenkins commented May 4, 2018

devin-petersohn commented May 4, 2018

[DataFrame] Implements filter and dropna #1959

[DataFrame] Implements filter and dropna #1959

Conversation

kunalgosar commented Apr 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Apr 28, 2018

AmplabJenkins commented May 4, 2018

AmplabJenkins commented May 4, 2018

AmplabJenkins commented May 4, 2018

AmplabJenkins commented May 4, 2018

devin-petersohn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devin-petersohn May 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented May 4, 2018

kunalgosar commented May 4, 2018

AmplabJenkins commented May 4, 2018

AmplabJenkins commented May 4, 2018

kunalgosar commented May 4, 2018

AmplabJenkins commented May 4, 2018

devin-petersohn commented May 4, 2018

devin-petersohn May 4, 2018 •

edited

Loading