Add the boundary param for sort in ray.data.Dataset #41269

veryhannibal · 2023-11-20T10:04:16Z

Why are these changes needed?

User can specify the boundaries so the dataset will be divided into blocks according to the specified boundaries while sorting.

Related issue number

Closes #41265

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

jiwq

Can u add the related UTs for this changes?

stephanie-wang

Thanks for contributing this change! Overall structure looks good, but had a few comments. Also agree that we should add unit tests for this. ray/python/ray/data/tests/test_sort.py is a good place to add.

stephanie-wang · 2023-11-21T17:19:52Z

python/ray/data/dataset.py

@@ -2212,6 +2212,7 @@ def sort(
        self,
        key: Union[str, List[str], None] = None,
        descending: Union[bool, List[bool]] = False,
+        boundaries: Optional[list] = None,


Can you document the arg in the docstring? It'd also be good to specify what the type of the List element should be.

Thanks for the tip, I'll provide detailed annotation instructions.

stephanie-wang · 2023-11-21T17:21:08Z

python/ray/data/_internal/sort.py

+    else:
+        boundaries = [(b, ) for b in sort_key.boundaries]
+        num_mappers = len(boundaries) + 1
+        num_reducers = num_mappers


Right now, sort requires num_mappers == num_reducers == num input blocks, so instead we should add a check that the length of boundaries is equal to the length of block_list.

If the user customizes the boundaries parameter, then after the sort is executed, the block_num of the output dataset is equal to len(boundaries) + 1.

Yes, what I meant is that right now we assume that num input blocks == num output blocks, so it would be good to assert num_mappers == len(boundaries) + 1 instead of setting num_mappers = len(boundaries) + 1.

However, it does seem like num input blocks != num output blocks is working, so maybe it's okay as is. Still, we should not modify num_mappers (this should be decided based on the number of input blocks, not by the user-provided boundaries).

The original intention of letting users define boundaries here is to allow users to decide the number of blocks in the output dataset, so I made adjustments to num_mappers here. Of course, if the user does not pass the boundaries parameter, it will not affect the original logic.😊

python/ray/data/_internal/sort.py

Signed-off-by: lile18 <lile18@jd.com>

Detailed annotation and instructions are provided.

stephanie-wang · 2023-11-27T19:19:35Z

python/ray/data/_internal/sort.py

+        boundaries = sample_boundaries(blocks_list, sort_key, num_reducers, ctx)
+    else:
+        boundaries = [(b, ) for b in sort_key.boundaries]
+        num_mappers = len(boundaries) + 1


Suggested change

num_mappers = len(boundaries) + 1

Sorry, according to the current implementation, line 222 cannot be deleted because the block of the output dataset is determined by user-defined boundaries. For example, if I split the list L=[0,1,2,3,4,5] and the defined boundaries are [2,4], then L will be divided into 3 parts, which are [0,1 ],[2,3],[4,5].

I don't think that is quite right... the number of boundaries should determine the number of reducers, while the number of input blocks determines the number of mappers, no?

For example, say the input is in two blocks L = [[0, 1, 2], [3, 4, 5]]. In your example, we will use num_mappers=2 and num_reducers=3.

stephanie-wang · 2023-11-27T19:20:12Z

python/ray/data/dataset.py

@@ -2212,32 +2212,60 @@ def sort(
        self,
        key: Union[str, List[str], None] = None,
        descending: Union[bool, List[bool]] = False,
+        boundaries: List[Union[int, float]] = None,


Will it work for non-numeric columns?

Sorry, this function cannot currently process non-numeric columns. However, in our business, if we encounter a non-numeric column, we will process it and convert it to a numeric type.
For example, for a non-numeric column, calculate the hash value and then take modulo 3. Then the value of this column becomes 0, 1 or 2. Then, if the parameter boundaries is set to [1,2], then the rows with values 0, 1, and 2 will be divided into three blocks respectively.

That sounds good for now; could you just update the docstring to say that this only supports numeric columns right now?

Thanks, I have added code comments to explain that the boundaries parameter currently supports numeric types.😁😁😁

python/ray/data/dataset.py

stephanie-wang · 2023-11-27T19:33:58Z

python/ray/data/tests/test_sort.py

+    ds = ds.sort("id", descending, boundaries)
+    ordered_ids = x["id"].values.tolist()
+    ordered_ids.sort()
+    check_id_in_block(ds, boundaries, list(range(1000)), descending)


Thanks for adding this test! But it's a bit hard to read; could you instead follow / extend the test_sort_simple example? I think all we really need to do is add checks to make sure that ds._block_num_rows() is as expected when different boundaries are passed in.

Give me the same feel.

Thanks for the suggestion. I have added a relatively simple test to test_sort_simple.
However, test_sort_with_specified_boundaries will be a more comprehensive test and takes into account some more complex situations, such as some values of boundaries not being in the key column of the dataset.😁😁😁

Hmm I think we can test that without needing this code? For example, something like ds.range(100).sort(boundaries=[10, 200]) would work, right?

In any case, I think the test sounds like a good idea but let's please try to simplify it. It is quite hard to read right now.

Hmm I think we can test that without needing this code? For example, something like ds.range(100).sort(boundaries=[10, 200]) would work, right?

In any case, I think the test sounds like a good idea but let's please try to simplify it. It is quite hard to read right now.

Yeah, it would work, and I updated the unit tests, mainly adding a few simple examples in test_sort_simple.😁

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Rony Lee <43735106+veryhannibal@users.noreply.github.com>

Added a relatively simple test to test_sort_simple

Modified code annotation.

jiwq

I think some UT cases as below should be considered:

boundaries = [15, 10, 5] or [10, 5, 15]
use the fixed data and split to two parts and three.
missing the test for float type

jiwq · 2023-12-01T08:38:43Z

python/ray/data/_internal/sort.py

@@ -209,7 +215,12 @@ def sort_impl(
    # Use same number of output partitions.
    num_reducers = num_mappers
    # TODO(swang): sample_boundaries could be fused with a previous stage.


Comment should follow closely with the related code. In this case, I think it should be moved into the if block.

python/ray/data/tests/test_sort.py

Co-authored-by: Wanqiang Ji <wanqiang.ji@gmail.com> Signed-off-by: Rony Lee <43735106+veryhannibal@users.noreply.github.com>

1. Updated unit test in test_sort_simple. 2. Added outlier detection for parameter boundaries in the ray.data.Dataset.sort function. 3. Added code comments to explain that the boundaries parameter currently only supports numeric types.

veryhannibal · 2023-12-29T08:16:07Z

@stephanie-wang @jiwq Hello, is there anything else I need to add about this pull-request?

jiwq · 2024-01-02T16:54:01Z

cc @c21

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang · 2024-01-04T13:46:21Z

Sorry for the delay here. I went ahead and updated the unit tests to simplify. LGTM now.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

veryhannibal · 2024-01-05T03:02:52Z

Sorry for the delay here. I went ahead and updated the unit tests to simplify. LGTM now.

Thanks a lot ! 😁😁😁

User can specify the boundaries so the dataset will be divided into blocks according to the specified boundaries while sorting. Closes ray-project#41265 --------- Signed-off-by: lile18 <lile18@jd.com> Signed-off-by: Rony Lee <43735106+veryhannibal@users.noreply.github.com> Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: lile18 <lile18@jd.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Wanqiang Ji <wanqiang.ji@gmail.com>

veryhannibal requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and Zandew as code owners November 20, 2023 10:04

jiwq reviewed Nov 20, 2023

View reviewed changes

stephanie-wang self-assigned this Nov 21, 2023

stephanie-wang requested changes Nov 21, 2023

View reviewed changes

Add the boundary param for sort in ray.data.Dataset

e8a21ac

Signed-off-by: lile18 <lile18@jd.com>

veryhannibal force-pushed the ray-41265 branch from c5ddb00 to e8a21ac Compare November 24, 2023 10:11

Signed-off-by: lile18 <lile18@jd.com>

1efd05a

Detailed annotation and instructions are provided.

stephanie-wang reviewed Nov 27, 2023

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

stephanie-wang reviewed Nov 27, 2023

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

stephanie-wang reviewed Nov 27, 2023

View reviewed changes

veryhannibal and others added 5 commits November 30, 2023 10:56

Update python/ray/data/dataset.py

a1b3adb

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Rony Lee <43735106+veryhannibal@users.noreply.github.com>

Update python/ray/data/dataset.py

e1fa5a4

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Rony Lee <43735106+veryhannibal@users.noreply.github.com>

Signed-off-by: lile18 <lile18@jd.com>

e9455c2

Added a relatively simple test to test_sort_simple

Merge branch 'ray-41265' of github.com:veryhannibal/ray into ray-41265

616a7ab

Signed-off-by: lile18 <lile18@jd.com>

d6f0e64

Modified code annotation.

jiwq reviewed Dec 1, 2023

View reviewed changes

veryhannibal and others added 4 commits December 5, 2023 10:51

Update python/ray/data/tests/test_sort.py

448072f

Co-authored-by: Wanqiang Ji <wanqiang.ji@gmail.com> Signed-off-by: Rony Lee <43735106+veryhannibal@users.noreply.github.com>

Update python/ray/data/tests/test_sort.py

ab10605

Co-authored-by: Wanqiang Ji <wanqiang.ji@gmail.com> Signed-off-by: Rony Lee <43735106+veryhannibal@users.noreply.github.com>

Update python/ray/data/tests/test_sort.py

5f7b579

Co-authored-by: Wanqiang Ji <wanqiang.ji@gmail.com> Signed-off-by: Rony Lee <43735106+veryhannibal@users.noreply.github.com>

Signed-off-by: lile18 <lile18@jd.com>

59e999e

1. Updated unit test in test_sort_simple. 2. Added outlier detection for parameter boundaries in the ray.data.Dataset.sort function. 3. Added code comments to explain that the boundaries parameter currently only supports numeric types.

lile18 added 2 commits December 5, 2023 15:33

Merge branch 'ray-41265' of github.com:veryhannibal/ray into ray-41265

0a3728c

modify code in test_sort_with_specified_boundaries

2117532

stephanie-wang added 2 commits January 4, 2024 08:43

Fix tests, lint

d9ea6c4

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

add float test

90bfd6f

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang approved these changes Jan 4, 2024

View reviewed changes

fixes

e2ccb6d

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Merge branch 'master' into ray-41265

e3b686e

stephanie-wang merged commit 2603834 into ray-project:master Jan 9, 2024
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the boundary param for sort in ray.data.Dataset #41269

Add the boundary param for sort in ray.data.Dataset #41269

veryhannibal commented Nov 20, 2023 •

edited

Loading

jiwq left a comment

stephanie-wang left a comment

stephanie-wang Nov 21, 2023

veryhannibal Nov 24, 2023

stephanie-wang Nov 21, 2023

veryhannibal Nov 24, 2023

stephanie-wang Nov 27, 2023 •

edited

Loading

veryhannibal Nov 30, 2023

stephanie-wang Nov 27, 2023

veryhannibal Nov 30, 2023

stephanie-wang Dec 1, 2023

stephanie-wang Nov 27, 2023

veryhannibal Nov 30, 2023 •

edited

Loading

stephanie-wang Dec 1, 2023

veryhannibal Dec 5, 2023

stephanie-wang Nov 27, 2023

jiwq Nov 29, 2023

veryhannibal Nov 30, 2023

stephanie-wang Dec 1, 2023

veryhannibal Dec 5, 2023

jiwq left a comment

jiwq Dec 1, 2023

veryhannibal commented Dec 29, 2023

jiwq commented Jan 2, 2024

stephanie-wang commented Jan 4, 2024

veryhannibal commented Jan 5, 2024

Add the boundary param for sort in ray.data.Dataset #41269

Add the boundary param for sort in ray.data.Dataset #41269

Conversation

veryhannibal commented Nov 20, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

jiwq left a comment

Choose a reason for hiding this comment

stephanie-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

veryhannibal Nov 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiwq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

veryhannibal commented Dec 29, 2023

jiwq commented Jan 2, 2024

stephanie-wang commented Jan 4, 2024

veryhannibal commented Jan 5, 2024

veryhannibal commented Nov 20, 2023 •

edited

Loading

stephanie-wang Nov 27, 2023 •

edited

Loading

veryhannibal Nov 30, 2023 •

edited

Loading