Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move examples out, merge base/ upward #494

Merged
merged 36 commits into from
Nov 18, 2023
Merged

Move examples out, merge base/ upward #494

merged 36 commits into from
Nov 18, 2023

Conversation

knighton
Copy link
Contributor

@knighton knighton commented Nov 2, 2023

Context: breaking PR 488 into around ten sub-PRs, of which this is the first.

When streaming repo started, it was laid out like

streaming/
    base/
    text/
    vision/
    (future: audio/?)
    (future: multimodal/?)

The big picture was let’s, among other goals, exhaustively implement all the well-known datasets out there, starting with at least mirroring the repertoire of torchvision and other torchmodality datasets but doing it all in MDS.

I don’t think we have a good picture of how streaming is used in totality, but it seems like the pattern is that people tend to own the data prep/serialization and any custom StreamingDataset subclass work/deserialization entirely themselves. They want to train on their own data.

The datasets we built into Streaming as “batteries included” are not exhaustive, getting even good coverage is impossible, and suspect they are not being used directly generally speaking, but instead as examples or what you can do with streaming.

On that basis,

  • We are not in the “implementing all the datasets” business, we have users now to drive what we are doing, that work belongs to them
  • However, all your base continues to belong to us
  • Our SD subclasses are useful as examples. Jettison every streaming subdir not named base/ using them to populate top level examples/ dir, how about one examples/ subdir per end-to-end example
  • Rename streaming/base/ -> streaming/
  • Some of scripts/ should move to examples/ too
  • Public imports that we all use, at the streaming/ level, continue to work unchanged

Overview:

    examples/ -> notebooks/
    streaming/{multimodal, vision, text}/ -> examples/
    scripts/ -> mostly benchmarks/, except for examples/
    streaming/base/format/base/ -> streaming/format/
    streaming/base/ -> streaming/

Old code org:

.
├── docs
│   └── source
│       ├── examples -> ../../examples
│       ├── fundamentals
│       ├── getting_started
│       ├── how_to_guides
│       ├── _static
│       │   ├── css
│       │   ├── images
│       │   └── js
│       └── _templates
│           └── sidebar
├── examples
├── regression
├── scripts
│   ├── compression
│   ├── epoch
│   ├── hashing
│   ├── partition
│   ├── samples
│   ├── serialization
│   ├── shuffle
│   └── webvid
├── simulation
│   ├── core
│   ├── interfaces
│   └── testing
├── streaming
│   ├── base
│   │   ├── batching
│   │   ├── converters
│   │   ├── format
│   │   │   ├── base
│   │   │   ├── json
│   │   │   ├── mds
│   │   │   └── xsv
│   │   ├── partition
│   │   ├── shared
│   │   ├── shuffle
│   │   └── storage
│   ├── multimodal
│   │   └── convert
│   │       ├── laion
│   │       │   └── laion400m
│   │       └── webvid
│   ├── text
│   │   └── convert
│   │       └── enwiki
│   │           ├── mds
│   │           └── tfrecord
│   └── vision
│       └── convert
└── tests
    ├── base
    │   └── converters
    └── common

New code org:

.
├── benchmarks
│   ├── compression
│   ├── epoch
│   ├── hashing
│   ├── partition
│   ├── samples
│   ├── serialization
│   └── shuffle
├── docs
│   └── source
│       ├── examples -> ../../examples
│       ├── fundamentals
│       ├── getting_started
│       ├── how_to_guides
│       ├── _static
│       │   ├── css
│       │   ├── images
│       │   └── js
│       └── _templates
│           └── sidebar
├── examples
│   ├── multimodal
│   │   ├── laion400m
│   │   └── webvid
│   │       ├── scripts
│   │       └── write
│   ├── text
│   │   ├── c4
│   │   ├── enwiki_tok
│   │   │   ├── mds
│   │   │   └── tfrecord
│   │   ├── enwiki_txt
│   │   └── pile
│   └── vision
│       ├── ade20k
│       ├── cifar10
│       ├── coco
│       └── imagenet
├── notebooks
├── regression
├── simulation
│   ├── core
│   ├── interfaces
│   └── testing
├── streaming
│   ├── batching
│   ├── converters
│   ├── format
│   │   ├── json
│   │   ├── mds
│   │   └── xsv
│   ├── partition
│   ├── shared
│   ├── shuffle
│   └── storage
└── tests
    ├── base
    │   └── converters
    └── common

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Collaborator

@karan6181 karan6181 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Few minor comments:

  • Can you please update the PR description with details on this PR is about?
  • The docs/source/how_to_guides/dataset_conversion_to_mds_format.md line 22 still reference ../../../streaming/base/converters/README.md. Can you please change that?

examples/text/c4/README.md Outdated Show resolved Hide resolved
examples/vision/ade20k/__init__.py Outdated Show resolved Hide resolved
@@ -128,7 +128,7 @@ def _check_and_find(streams_local: List[str], streams_remote: List[Union[str, No
f'Reused local directory: {streams_local} vs ' +
f'{their_locals}. Provide a different one. If using ' +
f'a unique local directory, try deleting the local directory and ' +
f'call `streaming.base.util.clean_stale_shared_memory()` only once ' +
f'call `streaming.util.clean_stale_shared_memory()` only once ' +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change where if the user calls streaming.base.util.clean_stale_shared_memory(), it will fail since the base module no longer exists, and this API is being used outside. Is there a way where the user can call streaming.base.util.clean_stale_shared_memory() but will see a deprecation warning stating that the API path has been changed to streaming.util.clean_stale_shared_memory()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old code, this was not exported from either streaming/__init__.py or streaming/base/__init__.py, so I thought this is not covered by the no breaking changes rule?

In any case, I can mirror the old file tree, where each base/ file just does ye olde redirect(streaming/base/<path> -> streaming/<path>), for everything since it would be hard to justify only doing that one method?

pyproject.toml Show resolved Hide resolved
@@ -0,0 +1,154 @@
# Copyright 2023 MosaicML Streaming authors
# SPDX-License-Identifier: Apache-2.0

Copy link
Collaborator

@karan6181 karan6181 Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like this file does NOT fit well here!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I agree. Are people using these classes? if so, can we put this back in a streaming/vision directory or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm.

How much torchvisioning have you done? https://pytorch.org/vision/main/_modules/index.html 👀 They're basically all VisionDatasets -> this is a standard PyTorch idiom and is like 1/3 as basey as StreamingDataset, which is basey enough for me :)

And recall the time RR had to key off some composer behavior on whether a streaming dataset was vision.

But yeah, we should only keep base classes in streaming/, so we need to delete StreamingDataset because it's not the base class, it descends from Array along with LocalDataset, to which we may add more subclasses in the future for different use cases (c.f. that user request about a lazy map-style non-streaming Streaming dataset, etc).

Also these files are pretty diverse already, I don't think a (modality).py sticks out?

I totally see moving the Streaming-specific innovation convert_image_class_dataset() out to a examples/vision/base.py, but the rest is a la torchvision/datasets/vision.py, let's keep it in a streaming/vision.py I think?

array.py
batching/
compression.py
constant.py
converters/
dataloader.py
dataset.py
distributed.py
format/
hashing.py
__init__.py
local.py
partition/
py.typed
sampling.py
shared/
shuffle/
spanner.py
storage/
stream.py
util.py
_version.py
vision.py
world.py

Copy link
Collaborator

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just had some minor comments and agree with Karan's points, otherwise looks good

@@ -0,0 +1,154 @@
# Copyright 2023 MosaicML Streaming authors
# SPDX-License-Identifier: Apache-2.0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I agree. Are people using these classes? if so, can we put this back in a streaming/vision directory or something?

Copy link
Collaborator

@karan6181 karan6181 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our offline discussion, adding a request change flag here to merge this PR in a separate branch rather than main if we have to do another patch release. Once the Delta Table changes land in, we can create a big PR to main and merge it. For now, we can develop all the stuff in a separate branch.

@knighton knighton changed the base branch from main to dev November 14, 2023 20:07
Copy link
Collaborator

@karan6181 karan6181 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me.

  • Can we add a warnings.warn('', DeprecationWarning, stacklevel=2) in method clean_stale_shared_memory() stating that the import has been moved from streaming.base.util.clean_stale_shared_memory() to streaming.util.clean_stale_shared_memory() and the old import will be removed in Streaming 0.9?
  • Also, going forward, you will have to run the doc locally to ensure it passes since the doc CI does not run on non-main branches. The steps is mentioned here.

```{include} ../../../streaming/text/convert/README.md
:start-line: 8
```
[examples/text/](../examples/text/)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like README.md is missing in ../examples/text/ directory! Do you know what the doc will look like after this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It was at the convert/ subdir level. I went this way for, can improve in subsequent PRs.

```{include} ../../../streaming/vision/convert/README.md
:start-line: 8
```
[examples/vision/](../examples/vision/)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It was at the convert/ subdir level. I went this way for now, can improve in subsequent PRs.

Copy link
Collaborator

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some minor comments. also, do the examples notebooks all work with the code reorg? want to make sure we got all the changes right

@@ -8,7 +8,7 @@

import numpy as np

from streaming.multimodal.webvid import StreamingInsideWebVid
from examples.multimodal.webvid.read import StreamingInsideWebVid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make sure that examples is included when installing the mosaicml-streaming package, like in a new venv or something? Ran into a similar issue with the simulator and want to make sure that the reorg doesn't affect this as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, was worrying about that. How did you handle this wrt simulator?

Tentatively, since I am repurposing examples/, you would think it should be safe, and notebooks/ and benchmarks/ might be a problem.

Following up, when I look at setup.py, there is this line

    packages=setuptools.find_packages(exclude=['tests*']),

which seems to suggest we're fine? What say you.

@@ -7,7 +7,7 @@
import numpy as np
from PIL import Image

from streaming.vision.convert.base import convert_image_class_dataset
from streaming.vision import convert_image_class_dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per prev review, is the function convert_image_class_dataset living in a separate directory or just in vision.py? just wanted to make sure this is in the right place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out.

Argument from symmetry: since it writes sample pairs as x y, which then get accessed by StreamingVisionDataset as sample pairs of x y, let us have both (but it should be renamed symmetrically).

Argument from expedience: this is minor and we will still have time to fix in subsequent PRs before release, so let's move forward.

I'd like to then do a quick followup PR to rewrite that ridiculous method to this (or so):

def convert_vision_dataset(
    *,
    samples: Iterable[Dict[str, Any]],
    shuffle: bool = True,
    seed: int = 9176,
    out_root: str,
    split: Optional[str] = None,
    hashes: Optional[Sequence[str]] = None,
    size_limit: int = 1 << 25,
    show_progress: bool = True,
    encoding: str = 'png') -> None:

Finally, surely the single-letter column name thing was a bad idea too, not least because it makes it look like column name length matters to time and/or serialized size...

@knighton
Copy link
Contributor Author

Overall looks good to me.

* Can we add a `warnings.warn('', DeprecationWarning, stacklevel=2)` in method `clean_stale_shared_memory()` stating that the import has been moved from `streaming.base.util.clean_stale_shared_memory()` to `streaming.util.clean_stale_shared_memory()` and the old import will be removed in Streaming 0.9?

* Also, going forward, you will have to run the doc locally to ensure it passes since the doc CI does not run on non-main branches. The steps is mentioned [here](https://github.com/mosaicml/streaming/blob/main/CONTRIBUTING.md#submitting-a-contribution).

How about this?

-    warn(f'Please update your imports: {old_fqdn} has moved to {new_fqdn}.')
+    warn(f'Please update your imports: {old_fqdn} has moved to {new_fqdn}.',
+         DeprecationWarning,
+         stacklevel=2)

Copy link
Collaborator

@karan6181 karan6181 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the PR for now so the upcoming PR is unblocked and let's fix the doc build in the upcoming PR.

@knighton knighton merged commit 7f5d160 into dev Nov 18, 2023
1 check passed
@knighton knighton deleted the james/reorg branch November 18, 2023 01:51
karan6181 pushed a commit that referenced this pull request Jan 26, 2024
* scripts/ -> benchmarks/.

* examples/ -> notebooks/.

* streaming/multimodal/ -> examples/multimodal/ (reorganized).

* streaming/text/ -> examples/text (reorganized).

* streaming/vision/base.py -> streaming/base/vision.py.

* Switch streaming/base/vision.py to kwargs.

* streaming/vision/ -> examples/vision/.

* Update pyproject.toml.

* And .pre-commit-config.yaml.

* Fix headers.

* Collapse "base/": streaming/base/ -> streaming/.

* Fil imports re: collapsing the `base/` dirs upward.

* Fixes (imports and indentation).

* Update test_streaming_remote.py to not rely on any specific SD example subclasses

* Fix pypyroject config.

* Update paths.

* Fix.

* More examples/ moves.

* Comma-tailing args.

* Fix links.

* More fixes.

* Fix missing license.

* How about this for import redirects...

* Or this...

* Improve redirect deprecation warning.

* examples/ tree: __init__ imports and __all__'s.^

* benchmarks/ tree: __init__ imports and __all__'s

* notebooks/ tree: __init__ imports and __all__'s.

* Add notebooks/ symlink to docs/source.

* Add benchmarks, examples, and notebooks trees to document_modules.

* Also add benchmarks symlink. Or should we only symlink to notebooks/?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants