Move examples out, merge base/ upward #494

knighton · 2023-11-02T16:35:37Z

Context: breaking PR 488 into around ten sub-PRs, of which this is the first.

When streaming repo started, it was laid out like

streaming/
    base/
    text/
    vision/
    (future: audio/?)
    (future: multimodal/?)

The big picture was let’s, among other goals, exhaustively implement all the well-known datasets out there, starting with at least mirroring the repertoire of torchvision and other torchmodality datasets but doing it all in MDS.

I don’t think we have a good picture of how streaming is used in totality, but it seems like the pattern is that people tend to own the data prep/serialization and any custom StreamingDataset subclass work/deserialization entirely themselves. They want to train on their own data.

The datasets we built into Streaming as “batteries included” are not exhaustive, getting even good coverage is impossible, and suspect they are not being used directly generally speaking, but instead as examples or what you can do with streaming.

On that basis,

We are not in the “implementing all the datasets” business, we have users now to drive what we are doing, that work belongs to them
However, all your base continues to belong to us
Our SD subclasses are useful as examples. Jettison every streaming subdir not named base/ using them to populate top level examples/ dir, how about one examples/ subdir per end-to-end example
Rename streaming/base/ -> streaming/
Some of scripts/ should move to examples/ too
Public imports that we all use, at the streaming/ level, continue to work unchanged

Overview:

    examples/ -> notebooks/
    streaming/{multimodal, vision, text}/ -> examples/
    scripts/ -> mostly benchmarks/, except for examples/
    streaming/base/format/base/ -> streaming/format/
    streaming/base/ -> streaming/

Old code org:

.
├── docs
│   └── source
│       ├── examples -> ../../examples
│       ├── fundamentals
│       ├── getting_started
│       ├── how_to_guides
│       ├── _static
│       │   ├── css
│       │   ├── images
│       │   └── js
│       └── _templates
│           └── sidebar
├── examples
├── regression
├── scripts
│   ├── compression
│   ├── epoch
│   ├── hashing
│   ├── partition
│   ├── samples
│   ├── serialization
│   ├── shuffle
│   └── webvid
├── simulation
│   ├── core
│   ├── interfaces
│   └── testing
├── streaming
│   ├── base
│   │   ├── batching
│   │   ├── converters
│   │   ├── format
│   │   │   ├── base
│   │   │   ├── json
│   │   │   ├── mds
│   │   │   └── xsv
│   │   ├── partition
│   │   ├── shared
│   │   ├── shuffle
│   │   └── storage
│   ├── multimodal
│   │   └── convert
│   │       ├── laion
│   │       │   └── laion400m
│   │       └── webvid
│   ├── text
│   │   └── convert
│   │       └── enwiki
│   │           ├── mds
│   │           └── tfrecord
│   └── vision
│       └── convert
└── tests
    ├── base
    │   └── converters
    └── common

New code org:

.
├── benchmarks
│   ├── compression
│   ├── epoch
│   ├── hashing
│   ├── partition
│   ├── samples
│   ├── serialization
│   └── shuffle
├── docs
│   └── source
│       ├── examples -> ../../examples
│       ├── fundamentals
│       ├── getting_started
│       ├── how_to_guides
│       ├── _static
│       │   ├── css
│       │   ├── images
│       │   └── js
│       └── _templates
│           └── sidebar
├── examples
│   ├── multimodal
│   │   ├── laion400m
│   │   └── webvid
│   │       ├── scripts
│   │       └── write
│   ├── text
│   │   ├── c4
│   │   ├── enwiki_tok
│   │   │   ├── mds
│   │   │   └── tfrecord
│   │   ├── enwiki_txt
│   │   └── pile
│   └── vision
│       ├── ade20k
│       ├── cifar10
│       ├── coco
│       └── imagenet
├── notebooks
├── regression
├── simulation
│   ├── core
│   ├── interfaces
│   └── testing
├── streaming
│   ├── batching
│   ├── converters
│   ├── format
│   │   ├── json
│   │   ├── mds
│   │   └── xsv
│   ├── partition
│   ├── shared
│   ├── shuffle
│   └── storage
└── tests
    ├── base
    │   └── converters
    └── common

…e subclasses

review-notebook-app · 2023-11-02T16:35:43Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…s/reorg

karan6181

Looks great. Few minor comments:

Can you please update the PR description with details on this PR is about?
The docs/source/how_to_guides/dataset_conversion_to_mds_format.md line 22 still reference ../../../streaming/base/converters/README.md. Can you please change that?

examples/text/c4/README.md

examples/vision/ade20k/__init__.py

karan6181 · 2023-11-07T01:04:15Z

streaming/shared/prefix.py

@@ -128,7 +128,7 @@ def _check_and_find(streams_local: List[str], streams_remote: List[Union[str, No
                            f'Reused local directory: {streams_local} vs ' +
                            f'{their_locals}. Provide a different one. If using ' +
                            f'a unique local directory, try deleting the local directory and ' +
-                            f'call `streaming.base.util.clean_stale_shared_memory()` only once ' +
+                            f'call `streaming.util.clean_stale_shared_memory()` only once ' +


This is a breaking change where if the user calls streaming.base.util.clean_stale_shared_memory(), it will fail since the base module no longer exists, and this API is being used outside. Is there a way where the user can call streaming.base.util.clean_stale_shared_memory() but will see a deprecation warning stating that the API path has been changed to streaming.util.clean_stale_shared_memory()?

In the old code, this was not exported from either streaming/__init__.py or streaming/base/__init__.py, so I thought this is not covered by the no breaking changes rule?

In any case, I can mirror the old file tree, where each base/ file just does ye olde redirect(streaming/base/<path> -> streaming/<path>), for everything since it would be hard to justify only doing that one method?

pyproject.toml

karan6181 · 2023-11-07T01:25:21Z

streaming/vision.py

@@ -0,0 +1,154 @@
+# Copyright 2023 MosaicML Streaming authors
+# SPDX-License-Identifier: Apache-2.0
+


Feels like this file does NOT fit well here!

+1, I agree. Are people using these classes? if so, can we put this back in a streaming/vision directory or something?

Hmmm.

How much torchvisioning have you done? https://pytorch.org/vision/main/_modules/index.html 👀 They're basically all VisionDatasets -> this is a standard PyTorch idiom and is like 1/3 as basey as StreamingDataset, which is basey enough for me :)

And recall the time RR had to key off some composer behavior on whether a streaming dataset was vision.

But yeah, we should only keep base classes in streaming/, so we need to delete StreamingDataset because it's not the base class, it descends from Array along with LocalDataset, to which we may add more subclasses in the future for different use cases (c.f. that user request about a lazy map-style non-streaming Streaming dataset, etc).

Also these files are pretty diverse already, I don't think a (modality).py sticks out?

I totally see moving the Streaming-specific innovation convert_image_class_dataset() out to a examples/vision/base.py, but the rest is a la torchvision/datasets/vision.py, let's keep it in a streaming/vision.py I think?

array.py batching/ compression.py constant.py converters/ dataloader.py dataset.py distributed.py format/ hashing.py __init__.py local.py partition/ py.typed sampling.py shared/ shuffle/ spanner.py storage/ stream.py util.py _version.py vision.py world.py

…s/reorg

snarayan21

just had some minor comments and agree with Karan's points, otherwise looks good

snarayan21 · 2023-11-07T17:18:21Z

streaming/vision.py

@@ -0,0 +1,154 @@
+# Copyright 2023 MosaicML Streaming authors
+# SPDX-License-Identifier: Apache-2.0
+


+1, I agree. Are people using these classes? if so, can we put this back in a streaming/vision directory or something?

karan6181

Based on our offline discussion, adding a request change flag here to merge this PR in a separate branch rather than main if we have to do another patch release. Once the Delta Table changes land in, we can create a big PR to main and merge it. For now, we can develop all the stuff in a separate branch.

karan6181

Overall looks good to me.

Can we add a warnings.warn('', DeprecationWarning, stacklevel=2) in method clean_stale_shared_memory() stating that the import has been moved from streaming.base.util.clean_stale_shared_memory() to streaming.util.clean_stale_shared_memory() and the old import will be removed in Streaming 0.9?
Also, going forward, you will have to run the doc locally to ensure it passes since the doc CI does not run on non-main branches. The steps is mentioned here.

karan6181 · 2023-11-14T20:43:52Z

docs/source/how_to_guides/dataset_conversion_to_mds_format.md

-```{include} ../../../streaming/text/convert/README.md
-:start-line: 8
-```
+[examples/text/](../examples/text/)


Looks like README.md is missing in ../examples/text/ directory! Do you know what the doc will look like after this change?

Hmm. It was at the convert/ subdir level. I went this way for, can improve in subsequent PRs.

karan6181 · 2023-11-14T20:43:59Z

docs/source/how_to_guides/dataset_conversion_to_mds_format.md

-```{include} ../../../streaming/vision/convert/README.md
-:start-line: 8
-```
+[examples/vision/](../examples/vision/)


Hmm. It was at the convert/ subdir level. I went this way for now, can improve in subsequent PRs.

snarayan21

just some minor comments. also, do the examples notebooks all work with the code reorg? want to make sure we got all the changes right

snarayan21 · 2023-11-14T21:37:16Z

examples/multimodal/webvid/scripts/bench_inside.py

@@ -8,7 +8,7 @@

 import numpy as np

-from streaming.multimodal.webvid import StreamingInsideWebVid
+from examples.multimodal.webvid.read import StreamingInsideWebVid


Can we make sure that examples is included when installing the mosaicml-streaming package, like in a new venv or something? Ran into a similar issue with the simulator and want to make sure that the reorg doesn't affect this as well.

Oh yeah, was worrying about that. How did you handle this wrt simulator?

Tentatively, since I am repurposing examples/, you would think it should be safe, and notebooks/ and benchmarks/ might be a problem.

Following up, when I look at setup.py, there is this line

packages=setuptools.find_packages(exclude=['tests*']),

which seems to suggest we're fine? What say you.

snarayan21 · 2023-11-14T21:39:32Z

examples/vision/cifar10/write_fake.py

@@ -7,7 +7,7 @@
 import numpy as np
 from PIL import Image

-from streaming.vision.convert.base import convert_image_class_dataset
+from streaming.vision import convert_image_class_dataset


as per prev review, is the function convert_image_class_dataset living in a separate directory or just in vision.py? just wanted to make sure this is in the right place

Thanks for pointing this out.

Argument from symmetry: since it writes sample pairs as x y, which then get accessed by StreamingVisionDataset as sample pairs of x y, let us have both (but it should be renamed symmetrically).

Argument from expedience: this is minor and we will still have time to fix in subsequent PRs before release, so let's move forward.

I'd like to then do a quick followup PR to rewrite that ridiculous method to this (or so):

def convert_vision_dataset( *, samples: Iterable[Dict[str, Any]], shuffle: bool = True, seed: int = 9176, out_root: str, split: Optional[str] = None, hashes: Optional[Sequence[str]] = None, size_limit: int = 1 << 25, show_progress: bool = True, encoding: str = 'png') -> None:

Finally, surely the single-letter column name thing was a bad idea too, not least because it makes it look like column name length matters to time and/or serialized size...

knighton · 2023-11-17T10:52:22Z

Overall looks good to me.

* Can we add a `warnings.warn('', DeprecationWarning, stacklevel=2)` in method `clean_stale_shared_memory()` stating that the import has been moved from `streaming.base.util.clean_stale_shared_memory()` to `streaming.util.clean_stale_shared_memory()` and the old import will be removed in Streaming 0.9?

* Also, going forward, you will have to run the doc locally to ensure it passes since the doc CI does not run on non-main branches. The steps is mentioned [here](https://github.com/mosaicml/streaming/blob/main/CONTRIBUTING.md#submitting-a-contribution).

How about this?

-    warn(f'Please update your imports: {old_fqdn} has moved to {new_fqdn}.')
+    warn(f'Please update your imports: {old_fqdn} has moved to {new_fqdn}.',
+         DeprecationWarning,
+         stacklevel=2)

karan6181

Approving the PR for now so the upcoming PR is unblocked and let's fix the doc build in the upcoming PR.

* scripts/ -> benchmarks/. * examples/ -> notebooks/. * streaming/multimodal/ -> examples/multimodal/ (reorganized). * streaming/text/ -> examples/text (reorganized). * streaming/vision/base.py -> streaming/base/vision.py. * Switch streaming/base/vision.py to kwargs. * streaming/vision/ -> examples/vision/. * Update pyproject.toml. * And .pre-commit-config.yaml. * Fix headers. * Collapse "base/": streaming/base/ -> streaming/. * Fil imports re: collapsing the `base/` dirs upward. * Fixes (imports and indentation). * Update test_streaming_remote.py to not rely on any specific SD example subclasses * Fix pypyroject config. * Update paths. * Fix. * More examples/ moves. * Comma-tailing args. * Fix links. * More fixes. * Fix missing license. * How about this for import redirects... * Or this... * Improve redirect deprecation warning. * examples/ tree: __init__ imports and __all__'s.^ * benchmarks/ tree: __init__ imports and __all__'s * notebooks/ tree: __init__ imports and __all__'s. * Add notebooks/ symlink to docs/source. * Add benchmarks, examples, and notebooks trees to document_modules. * Also add benchmarks symlink. Or should we only symlink to notebooks/?

knighton added 14 commits November 2, 2023 00:07

scripts/ -> benchmarks/.

4e05d24

examples/ -> notebooks/.

eb65a8e

streaming/multimodal/ -> examples/multimodal/ (reorganized).

026947e

streaming/text/ -> examples/text (reorganized).

c39af6f

streaming/vision/base.py -> streaming/base/vision.py.

bb9f913

Switch streaming/base/vision.py to kwargs.

2d72f3a

streaming/vision/ -> examples/vision/.

61b0e53

Update pyproject.toml.

19dbda5

And .pre-commit-config.yaml.

56b4ff6

Fix headers.

38e0433

Collapse "base/": streaming/base/ -> streaming/.

144a7f6

Fil imports re: collapsing the base/ dirs upward.

1f6e9cf

Fixes (imports and indentation).

3c9b2cb

Update test_streaming_remote.py to not rely on any specific SD exampl…

1f088be

…e subclasses

knighton requested review from karan6181 and bandish-shah as code owners November 2, 2023 16:35

knighton added 5 commits November 2, 2023 09:35

Merge branch 'main' into james/reorg

0a116f6

Merge branch 'james/reorg' of github.com:mosaicml/streaming into jame…

13e4f34

…s/reorg

Fix pypyroject config.

d7d79b1

Update paths.

d8946fd

Merge branch 'main' into james/reorg

a5becc5

karan6181 requested review from snarayan21 and XiaohanZhangCMU November 7, 2023 00:42

karan6181 reviewed Nov 7, 2023

View reviewed changes

knighton added 2 commits November 6, 2023 21:38

Fix.

4d1d0a0

Merge branch 'james/reorg' of github.com:mosaicml/streaming into jame…

1df1975

…s/reorg

snarayan21 reviewed Nov 7, 2023

View reviewed changes

knighton added 2 commits November 9, 2023 17:44

More examples/ moves.

37a4dda

Comma-tailing args.

0da4cfd

knighton added 5 commits November 10, 2023 16:41

Fix links.

f345675

More fixes.

c04bbac

Merge branch 'main' into james/reorg

ae3331f

Fix missing license.

2adf2ed

How about this for import redirects...

8066ca1

karan6181 requested changes Nov 14, 2023

View reviewed changes

Or this...

1b8dabf

knighton changed the base branch from main to dev November 14, 2023 20:07

karan6181 reviewed Nov 14, 2023

View reviewed changes

snarayan21 reviewed Nov 14, 2023

View reviewed changes

Improve redirect deprecation warning.

36438ec

knighton added 6 commits November 17, 2023 03:33

examples/ tree: __init__ imports and __all__'s.^

1a1d913

benchmarks/ tree: __init__ imports and __all__'s

2d5dd8a

notebooks/ tree: __init__ imports and __all__'s.

1e21dd7

Add notebooks/ symlink to docs/source.

a775620

Add benchmarks, examples, and notebooks trees to document_modules.

a457740

Also add benchmarks symlink. Or should we only symlink to notebooks/?

f92db3b

karan6181 approved these changes Nov 18, 2023

View reviewed changes

knighton merged commit 7f5d160 into dev Nov 18, 2023

knighton deleted the james/reorg branch November 18, 2023 01:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move examples out, merge base/ upward #494

Move examples out, merge base/ upward #494

knighton commented Nov 2, 2023 •

edited

Loading

review-notebook-app bot commented Nov 2, 2023

karan6181 left a comment

karan6181 Nov 7, 2023

knighton Nov 7, 2023

karan6181 Nov 7, 2023 •

edited

Loading

snarayan21 Nov 7, 2023

knighton Nov 11, 2023

snarayan21 left a comment

snarayan21 Nov 7, 2023

karan6181 left a comment

karan6181 left a comment

karan6181 Nov 14, 2023

knighton Nov 17, 2023

karan6181 Nov 14, 2023

knighton Nov 17, 2023

snarayan21 left a comment

snarayan21 Nov 14, 2023

knighton Nov 17, 2023

snarayan21 Nov 14, 2023

knighton Nov 17, 2023

knighton commented Nov 17, 2023

karan6181 left a comment

		@@ -0,0 +1,154 @@
		# Copyright 2023 MosaicML Streaming authors
		# SPDX-License-Identifier: Apache-2.0

Move examples out, merge base/ upward #494

Move examples out, merge base/ upward #494

Conversation

knighton commented Nov 2, 2023 • edited Loading

review-notebook-app bot commented Nov 2, 2023

karan6181 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karan6181 Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snarayan21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karan6181 left a comment

Choose a reason for hiding this comment

karan6181 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snarayan21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knighton commented Nov 17, 2023

karan6181 left a comment

Choose a reason for hiding this comment

knighton commented Nov 2, 2023 •

edited

Loading

karan6181 Nov 7, 2023 •

edited

Loading