Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use window_size_bytes: auto to specify automatic windowing #3076

Merged
merged 6 commits into from
Feb 14, 2023

Conversation

jeffkinnison
Copy link
Contributor

This updates RayDataset, RayDatasetManager, and RayBackend to accept "auto" as a valid input for window_size_bytes, and consolidates RayDataset creation logic to handle auto-windowing.

auto_window: If True and the dataset is larger than available memory,
automatically set window size to `<available memory> // 5`.
"""
"""Wrapper around ray.data.Dataset."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: While we're updating the doc strings, should we update it to all of the inputs and their types for this class? toally ok to skip as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. I meant to add window_size_bytes back in and fill in the rest of the docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in db4218b

# If the user does not supply a window size and the dataset is large,
# If user has specified a window size, use it as-is.
if isinstance(window_size_bytes, int):
window_size = window_size_bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just do return window_size_bytes and short circuit going to the next block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally a fan of a single exit point for a function unless some serious computation happens, but happy to refactor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, can leave this as is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised in db4218b

Comment on lines +1017 to +1018
backend_config = copy.deepcopy(RAY_BACKEND_CONFIG)
backend_config["loader"] = {"window_size_bytes": window_size_bytes}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - are we always guaranteed to have the loader key in the backend config? If not, do we need to add a default value for this in the schema (possibly None)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the hood, this update uses backend._data_loader_kwargs which the RayBackend initializer sets with self._data_loader_kwargs = loader or {}. We don't actually rely on the existence of loader, but rather the backend's handling of its config.

@github-actions
Copy link

github-actions bot commented Feb 10, 2023

Unit Test Results

         6 files  ±0           6 suites  ±0   5h 0m 44s ⏱️ - 6m 10s
  3 929 tests ±0    3 892 ✔️ +1    37 💤 ±0  0  - 1 
11 784 runs  ±0  11 673 ✔️ +1  111 💤 ±0  0  - 1 

Results for commit db4218b. ± Comparison against base commit e8b1c8b.

♻️ This comment has been updated with latest results.

Comment on lines -89 to +91
if window_size_bytes:
def get_window_size_bytes(self, window_size_bytes: Optional[Union[int, Literal["auto"]]] = None) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice use of literal, we should use that more in Ludwig

Copy link
Contributor

@arnavgarg1 arnavgarg1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks for cleaning up the implementation/API

@jeffkinnison jeffkinnison merged commit c408492 into master Feb 14, 2023
@jeffkinnison jeffkinnison deleted the auto-window-config branch February 14, 2023 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants