Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repartition Ray dataset if number of shards is too small #283

Merged
merged 5 commits into from
Jun 24, 2023

Conversation

krfricke
Copy link
Collaborator

Currently we throw an error when the number of partitions in a data source is too small for the number of workers.

However, in the case of Ray datasets, we can actually repartition the dataset ourselves.

This will also ensure our quickstart examples, such as in https://docs.ray.io/en/latest/train/train.html#quick-start-to-distributed-training-with-ray-train will work out of the box.

@Yard1
Copy link
Member

Yard1 commented Jun 23, 2023

@krfricke I am a little confused - we should not be throwing any errors as we have moved to a split based solution for Ray Datasets (see get_actor_shards). That's the recommended way instead of repartitioning. How is this error triggered?

@krfricke
Copy link
Collaborator Author

The errors comes up when running e.g. the quickstart example from our docs:


import ray
from ray.train.xgboost import XGBoostTrainer
from ray.air.config import ScalingConfig

# Load data.
dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")

# Split data into train and validation.
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)

trainer = XGBoostTrainer(
    scaling_config=ScalingConfig(
        # Number of workers to use for data parallelism.
        num_workers=2,
        # Whether to use GPU acceleration.
        use_gpu=False,
    ),
    label_column="target",
    num_boost_round=20,
    params={
        # XGBoost specific params
        "objective": "binary:logistic",
        # "tree_method": "gpu_hist",  # uncomment this to use GPU for training
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": train_dataset, "valid": valid_dataset},
)
result = trainer.fit()

But I see what you mean - we do the split ourselves, so no need to actually call repartition. We can just check if the data source supports automatic repartitioning.

I've updated the PR.

Copy link
Member

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@krfricke krfricke merged commit 8668a77 into ray-project:master Jun 24, 2023
@krfricke krfricke deleted the repartition branch June 24, 2023 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants