-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use window_size_bytes: auto
to specify automatic windowing
#3076
Conversation
ludwig/data/dataset/ray.py
Outdated
auto_window: If True and the dataset is larger than available memory, | ||
automatically set window size to `<available memory> // 5`. | ||
""" | ||
"""Wrapper around ray.data.Dataset.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: While we're updating the doc strings, should we update it to all of the inputs and their types for this class? toally ok to skip as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch. I meant to add window_size_bytes
back in and fill in the rest of the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in db4218b
ludwig/data/dataset/ray.py
Outdated
# If the user does not supply a window size and the dataset is large, | ||
# If user has specified a window size, use it as-is. | ||
if isinstance(window_size_bytes, int): | ||
window_size = window_size_bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just do return window_size_bytes
and short circuit going to the next block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally a fan of a single exit point for a function unless some serious computation happens, but happy to refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, can leave this as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised in db4218b
backend_config = copy.deepcopy(RAY_BACKEND_CONFIG) | ||
backend_config["loader"] = {"window_size_bytes": window_size_bytes} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious - are we always guaranteed to have the loader
key in the backend config? If not, do we need to add a default value for this in the schema (possibly None
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under the hood, this update uses backend._data_loader_kwargs
which the RayBackend
initializer sets with self._data_loader_kwargs = loader or {}
. We don't actually rely on the existence of loader
, but rather the backend's handling of its config.
if window_size_bytes: | ||
def get_window_size_bytes(self, window_size_bytes: Optional[Union[int, Literal["auto"]]] = None) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice use of literal, we should use that more in Ludwig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Thanks for cleaning up the implementation/API
This updates
RayDataset
,RayDatasetManager
, andRayBackend
to accept"auto"
as a valid input forwindow_size_bytes
, and consolidatesRayDataset
creation logic to handle auto-windowing.