-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move examples out, merge base/ upward #494
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Few minor comments:
- Can you please update the PR description with details on this PR is about?
- The
docs/source/how_to_guides/dataset_conversion_to_mds_format.md
line 22 still reference../../../streaming/base/converters/README.md
. Can you please change that?
@@ -128,7 +128,7 @@ def _check_and_find(streams_local: List[str], streams_remote: List[Union[str, No | |||
f'Reused local directory: {streams_local} vs ' + | |||
f'{their_locals}. Provide a different one. If using ' + | |||
f'a unique local directory, try deleting the local directory and ' + | |||
f'call `streaming.base.util.clean_stale_shared_memory()` only once ' + | |||
f'call `streaming.util.clean_stale_shared_memory()` only once ' + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change where if the user calls streaming.base.util.clean_stale_shared_memory()
, it will fail since the base
module no longer exists, and this API is being used outside. Is there a way where the user can call streaming.base.util.clean_stale_shared_memory()
but will see a deprecation warning stating that the API path has been changed to streaming.util.clean_stale_shared_memory()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the old code, this was not exported from either streaming/__init__.py
or streaming/base/__init__.py
, so I thought this is not covered by the no breaking changes rule?
In any case, I can mirror the old file tree, where each base/
file just does ye olde redirect(streaming/base/<path> -> streaming/<path>)
, for everything since it would be hard to justify only doing that one method?
@@ -0,0 +1,154 @@ | |||
# Copyright 2023 MosaicML Streaming authors | |||
# SPDX-License-Identifier: Apache-2.0 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like this file does NOT fit well here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I agree. Are people using these classes? if so, can we put this back in a streaming/vision
directory or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm.
How much torchvisioning have you done? https://pytorch.org/vision/main/_modules/index.html 👀 They're basically all VisionDataset
s -> this is a standard PyTorch idiom and is like 1/3 as basey as StreamingDataset
, which is basey enough for me :)
And recall the time RR had to key off some composer behavior on whether a streaming dataset was vision.
But yeah, we should only keep base classes in streaming/
, so we need to delete StreamingDataset
because it's not the base class, it descends from Array
along with LocalDataset
, to which we may add more subclasses in the future for different use cases (c.f. that user request about a lazy map-style non-streaming Streaming dataset, etc).
Also these files are pretty diverse already, I don't think a (modality).py sticks out?
I totally see moving the Streaming-specific innovation convert_image_class_dataset()
out to a examples/vision/base.py
, but the rest is a la torchvision/datasets/vision.py, let's keep it in a streaming/vision.py
I think?
array.py
batching/
compression.py
constant.py
converters/
dataloader.py
dataset.py
distributed.py
format/
hashing.py
__init__.py
local.py
partition/
py.typed
sampling.py
shared/
shuffle/
spanner.py
storage/
stream.py
util.py
_version.py
vision.py
world.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just had some minor comments and agree with Karan's points, otherwise looks good
@@ -0,0 +1,154 @@ | |||
# Copyright 2023 MosaicML Streaming authors | |||
# SPDX-License-Identifier: Apache-2.0 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I agree. Are people using these classes? if so, can we put this back in a streaming/vision
directory or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our offline discussion, adding a request change flag here to merge this PR in a separate branch rather than main
if we have to do another patch release. Once the Delta Table changes land in, we can create a big PR to main
and merge it. For now, we can develop all the stuff in a separate branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me.
- Can we add a
warnings.warn('', DeprecationWarning, stacklevel=2)
in methodclean_stale_shared_memory()
stating that the import has been moved fromstreaming.base.util.clean_stale_shared_memory()
tostreaming.util.clean_stale_shared_memory()
and the old import will be removed in Streaming 0.9? - Also, going forward, you will have to run the doc locally to ensure it passes since the doc CI does not run on non-main branches. The steps is mentioned here.
```{include} ../../../streaming/text/convert/README.md | ||
:start-line: 8 | ||
``` | ||
[examples/text/](../examples/text/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like README.md
is missing in ../examples/text/
directory! Do you know what the doc will look like after this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It was at the convert/
subdir level. I went this way for, can improve in subsequent PRs.
```{include} ../../../streaming/vision/convert/README.md | ||
:start-line: 8 | ||
``` | ||
[examples/vision/](../examples/vision/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It was at the convert/
subdir level. I went this way for now, can improve in subsequent PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some minor comments. also, do the examples notebooks all work with the code reorg? want to make sure we got all the changes right
@@ -8,7 +8,7 @@ | |||
|
|||
import numpy as np | |||
|
|||
from streaming.multimodal.webvid import StreamingInsideWebVid | |||
from examples.multimodal.webvid.read import StreamingInsideWebVid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make sure that examples
is included when installing the mosaicml-streaming
package, like in a new venv or something? Ran into a similar issue with the simulator and want to make sure that the reorg doesn't affect this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, was worrying about that. How did you handle this wrt simulator?
Tentatively, since I am repurposing examples/
, you would think it should be safe, and notebooks/
and benchmarks/
might be a problem.
Following up, when I look at setup.py
, there is this line
packages=setuptools.find_packages(exclude=['tests*']),
which seems to suggest we're fine? What say you.
@@ -7,7 +7,7 @@ | |||
import numpy as np | |||
from PIL import Image | |||
|
|||
from streaming.vision.convert.base import convert_image_class_dataset | |||
from streaming.vision import convert_image_class_dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as per prev review, is the function convert_image_class_dataset
living in a separate directory or just in vision.py
? just wanted to make sure this is in the right place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out.
Argument from symmetry: since it writes sample pairs as x
y
, which then get accessed by StreamingVisionDataset
as sample pairs of x
y
, let us have both (but it should be renamed symmetrically).
Argument from expedience: this is minor and we will still have time to fix in subsequent PRs before release, so let's move forward.
I'd like to then do a quick followup PR to rewrite that ridiculous method to this (or so):
def convert_vision_dataset(
*,
samples: Iterable[Dict[str, Any]],
shuffle: bool = True,
seed: int = 9176,
out_root: str,
split: Optional[str] = None,
hashes: Optional[Sequence[str]] = None,
size_limit: int = 1 << 25,
show_progress: bool = True,
encoding: str = 'png') -> None:
Finally, surely the single-letter column name thing was a bad idea too, not least because it makes it look like column name length matters to time and/or serialized size...
How about this?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving the PR for now so the upcoming PR is unblocked and let's fix the doc build in the upcoming PR.
* scripts/ -> benchmarks/. * examples/ -> notebooks/. * streaming/multimodal/ -> examples/multimodal/ (reorganized). * streaming/text/ -> examples/text (reorganized). * streaming/vision/base.py -> streaming/base/vision.py. * Switch streaming/base/vision.py to kwargs. * streaming/vision/ -> examples/vision/. * Update pyproject.toml. * And .pre-commit-config.yaml. * Fix headers. * Collapse "base/": streaming/base/ -> streaming/. * Fil imports re: collapsing the `base/` dirs upward. * Fixes (imports and indentation). * Update test_streaming_remote.py to not rely on any specific SD example subclasses * Fix pypyroject config. * Update paths. * Fix. * More examples/ moves. * Comma-tailing args. * Fix links. * More fixes. * Fix missing license. * How about this for import redirects... * Or this... * Improve redirect deprecation warning. * examples/ tree: __init__ imports and __all__'s.^ * benchmarks/ tree: __init__ imports and __all__'s * notebooks/ tree: __init__ imports and __all__'s. * Add notebooks/ symlink to docs/source. * Add benchmarks, examples, and notebooks trees to document_modules. * Also add benchmarks symlink. Or should we only symlink to notebooks/?
Context: breaking PR 488 into around ten sub-PRs, of which this is the first.
When streaming repo started, it was laid out like
The big picture was let’s, among other goals, exhaustively implement all the well-known datasets out there, starting with at least mirroring the repertoire of torchvision and other torchmodality datasets but doing it all in MDS.
I don’t think we have a good picture of how streaming is used in totality, but it seems like the pattern is that people tend to own the data prep/serialization and any custom StreamingDataset subclass work/deserialization entirely themselves. They want to train on their own data.
The datasets we built into Streaming as “batteries included” are not exhaustive, getting even good coverage is impossible, and suspect they are not being used directly generally speaking, but instead as examples or what you can do with streaming.
On that basis,
Overview:
Old code org:
New code org: