Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add utitlity to visualize datapipe graphs #330

Closed
wants to merge 6 commits into from

Conversation

pmeier
Copy link
Contributor

@pmeier pmeier commented Mar 25, 2022

Closes #299.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 25, 2022
@pmeier pmeier requested a review from NivekT March 25, 2022 13:11
@pmeier pmeier marked this pull request as ready for review March 25, 2022 13:11
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noob question: Wondering do we have a potential extensibility using graphviz to visualize the attributes for each Node?

torchdata/datapipes/utils/_visualization.py Outdated Show resolved Hide resolved
@pmeier
Copy link
Contributor Author

pmeier commented Mar 25, 2022

Wondering do we have a potential extensibility using graphviz to visualize the attributes for each Node?

Sure. Everything we put in Node.__str__ will be displayed. For example, we could add the name of the functions of a Filter or Mapper to the node. Doing this would turn

from torchdata.datapipes.iter import IterableWrapper


def no_map(x):
    return x


def no_filter(x):
    return True


dp1, dp2, dp3 = IterableWrapper([]).fork(3)
dp1 = dp1.map(no_map)
dp2 = dp2.filter(no_filter)
dp = dp1.zip(dp2, dp3)

into

Digraph gv

This is why I suggested #284 (comment). This would reduce Node.__str__ to return str(self.dp).

@pmeier
Copy link
Contributor Author

pmeier commented Mar 28, 2022

@ejguan in #237 (comment)

Wondering what would be the visualized graph with such circular reference?

from torchdata.datapipes.iter import IterDataPipe, IterableWrapper

class CustomIterDataPipe(IterDataPipe):
    def noop(self, x):
        return x + 1

    def __init__(self):
        self._dp = IterableWrapper([1, 2, 4]).map(self.noop)

    def __iter__(self):
        yield from self._dp

graph

Due to the circular reference, we either need to use traverse(..., only_datapipe=True) as suggested in #330 (comment) or fix the traversal similar to what I proposed in #237 (comment). Otherwise, I don't think we need extra handling for "circular" datapipes.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.
@NivekT What do you think? We have inline doc for to_graph. Do we need an example for users shown in README or another place?

docs/source/conf.py Show resolved Hide resolved
Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not getting back to this earlier. Overall, it looks very good to me!

I tried playing around with it to see if there are cases where the output is not sensible. This is the only case where things may be confusing, and I think we may want to show the ChildDataPipe (perhaps renamed to something else like DemuxChild1 or something). Admittedly, I think the code snippet below may not be representative of the usual use cases of DataPipes in a pipeline:

from torchdata.datapipes.iter import IterableWrapper
from torchdata.datapipes.utils import to_graph

dp = IterableWrapper(range(10))
dp = dp.map(lambda x: x + 1)
dp = dp.filter(lambda _: True)
dp1, dp2 = dp.demux(num_instances=2, classifier_fn=lambda x: x % 0 == 0)
dp3 = dp1.zip(dp2)
g = to_graph(dp3)
g.view()

Screen Shot 2022-05-26 at 6 26 16 PM

We can also apply the function on a few torchvision data pipelines and see what they look like.

@pmeier
Copy link
Contributor Author

pmeier commented May 27, 2022

I wrote a small script to test this on the torchvision datasets. Assuming this PR is merged and you have a nightly from torchdata and torchvision installed, this is the result.

very easy

CIFAR10

cifar10

easy

MNIST

mnist

medium

VOC

voc

difficult

COCO

coco_annotations_instances

CelebA

celeba

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a small script to test this on the torchvision datasets. Assuming this PR is merged and you have a nightly from torchdata and torchvision installed, this is the result.

very easy

CIFAR10

easy

MNIST

medium

VOC

difficult

COCO
CelebA

I wrote a small script to test this on the torchvision datasets. Assuming this PR is merged and you have a nightly from torchdata and torchvision installed, this is the result.

very easy

CIFAR10

easy

MNIST

medium

VOC

difficult

COCO
CelebA

Those graphs look complicated but sensible to me. Unless there are ways to simplify the pre-processing logic, I think they look good. I think the remaining part is the edge case with _ChildDataPipes combining back together immediately.

@pmeier
Copy link
Contributor Author

pmeier commented Jun 3, 2022

@NivekT I've added a debug: bool = False flag as explained in #330 (comment). This gives the following results for your example in #330 (review):

debug=False (default)

example

debug=True

example
`

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is good to land as it is. I will follow up with a different PR of documentations.

@facebook-github-bot
Copy link
Contributor

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pmeier pmeier deleted the graphviz branch June 9, 2022 06:40
@NivekT NivekT mentioned this pull request Jun 9, 2022
NivekT pushed a commit that referenced this pull request Jun 9, 2022
Summary:
Closes #299.

Pull Request resolved: #330

Reviewed By: ejguan

Differential Revision: D36990978

Pulled By: NivekT

fbshipit-source-id: 423a52e6d59da59826d1422676d28287537dd902
NivekT added a commit that referenced this pull request Jun 9, 2022
…507)

* add utility to visualize datapipe graphs (#330)

Summary:
Closes #299.

Pull Request resolved: #330

Reviewed By: ejguan

Differential Revision: D36990978

Pulled By: NivekT

fbshipit-source-id: 423a52e6d59da59826d1422676d28287537dd902

* Update examples and tutorial after automatic sharding has landed (#505)

Summary:
Pull Request resolved: #505

After the `DataLoader` automatic sharding for DataPipe PR has landed in core, this PR is updating the relevant example and tutorial

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin, ejguan

Differential Revision: D37040383

Pulled By: NivekT

fbshipit-source-id: 446b9e6a0ae804b1a0908a26058680f9b3c2f080

Co-authored-by: Philip Meier <github.pmeier@posteo.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Print out a graph of DataPipe from the Dictionary returned from traverse function
4 participants