Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataPipe] Ensure all DataPipes Meet Testing Requirements #106

Open
NivekT opened this issue Nov 10, 2021 · 5 comments
Open

[DataPipe] Ensure all DataPipes Meet Testing Requirements #106

NivekT opened this issue Nov 10, 2021 · 5 comments

Comments

@NivekT
Copy link
Contributor

NivekT commented Nov 10, 2021

🚀 Feature

We have many tests for existing DataPipes (both in PyTorch Core and TorchData). However, over time, they have become less organized. Moreover, as the testing requirements expand, older DataPipes may not have tests to cover the newly added requirements.

This issue aims to track the status of tests for all DataPipes.

Motivation

We want to ensure test coverage for all DataPipe is complete to reduce bugs and unexpected behavior.

Alternative

We also should create some testing templates for IterDataPipe and MapDataPipe that can be widely applied.

IterDataPipe Tracker

X - Done
NA - Not Applicable
Blank - Not Done/Unclear

Test definitions:
Functional - unit test to ensure that the DataPipe works properly with various input arguments
Reset - DataPipe can be reset/restart after being read
__len__ - the __len__ method is implemented whenever possible (or explicitly not implemented)
Serializable - DataPipe is serializable
Graph (future) - can be traversed as part of a DataPipe graph
Snapshot (future) - can be saved/loaded as a checkpoint/snapshot

Name Module Functional Test Reset __len__ Serializable (Pickable) Graph Snapshot
Batcher Core X X X X
Collator Core X X X X
Concater Core X X X X
Demultiplexer Core X X X X
FileLister Core X X X X
FileOpener Core X X X X
Filter Core X X X X
Forker Core X X X X
Grouper Core X X X
IterableWrapper Core X X X X
Mapper Core X X X X
Multiplexer Core X X X X
RoutedDecoder Core X X X X
Sampler Core X X X X
Shuffler Core X X X X
StreamReader Core X X X X
UnBatcher Core X X X
Zipper Core X X X X
BucketBatcher Data X X X X
CSVDictParser Data X X X X
CSVParser Data X X X X
Cycler Data X X X X
DataFrameMaker Data X X X X
Decompressor Data X X X X
Enumerator Data X X X X
FlatMapper Data X X X X
FSSpecFileLister Data X X X X
FSSpecFileOpener Data X X X X
FSSpecSaver Data X X X X
GDriveReader Data X X X X
HashChecker Data X X X X
Header Data X X X X
HttpReader Data X X X X
InMemoryCacheHolder Data X X X X
IndexAdder Data X X X X
IoPathFileLister Data X X X X
IoPathFileOpener Data X X X X
IoPathSaver Data X X X X
IterKeyZipper Data X X X X
JsonParser Data X X X X
LineReader Data X X X X
MapKeyZipper Data X X X X
OnDiskCacheHolder Data X X X X
OnlineReader Data X X X X
ParagraphAggregator Data X X X X
ParquetDataFrameLoader Data X X X X
RarArchiveLoader Data X X X X
Rows2Columnar Data X X X X
SampleMultiplexer Data X X X X
Saver Data X X X X
TarArchiveLoader Data X X X X
UnZipper Data X X X X
XzFileLoader Data X X X X
ZipArchiveLoader Data X X X X

MapDataPipe Tracker

X - Done
NA - Not Applicable
Blank - Not Done/Unclear

Name Module Functional Test __len__ Serializable (Pickable) Graph Snapshot
Batcher Core X X
Concater Core X X
Mapper Core X X X
SequenceWrapper Core X X X
Shuffler Core X X
Zipper Core X X

cc: @ejguan @VitalyFedyunin @NivekT

@ejguan
Copy link
Contributor

ejguan commented Nov 10, 2021

This is awesome. One nit note: serializable should be same as picklable IMO.

NivekT added a commit that referenced this issue Mar 3, 2022
Related to this issue:
* #106

[ghstack-poisoned]
NivekT added a commit that referenced this issue Mar 3, 2022
Related to this issue:
* #106

[ghstack-poisoned]
NivekT added a commit that referenced this issue Mar 3, 2022
Related to this issue:
* #106

[ghstack-poisoned]
NivekT added a commit that referenced this issue Mar 4, 2022
…Pipes"


Related to this issue:
* #106

Differential Revision: [D34625707](https://our.internmc.facebook.com/intern/diff/D34625707)

[ghstack-poisoned]
NivekT added a commit that referenced this issue Mar 4, 2022
Related to this issue:
* #106

Differential Revision: [D34625707](https://our.internmc.facebook.com/intern/diff/D34625707)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this issue Mar 7, 2022
Summary:
Pull Request resolved: #275

Related to this issue:
* #106

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34625707

Pulled By: NivekT

fbshipit-source-id: 896761a9d7c01efc172e579930ae75489ac2d5a0
@ejguan
Copy link
Contributor

ejguan commented Apr 20, 2022

@NivekT
I am concerning about when and how we want to do graph testing. For a single DataPipe instance, the graph testing makes no sense. Then, we may want to construct a datapipe graph by ourselves. The problem is how we can guarantee the testing coverage for all use cases.

@VitalyFedyunin
Copy link
Contributor

We can require each DataPipe to introduce a simple usage example graph for this purpose.

@ejguan
Copy link
Contributor

ejguan commented Apr 28, 2022

When we have time, we might need to go over our DataPipes again to identify any missing test since there are a few DataPipe implemented recently.

Besides, for future reference, we might need to improve our testing framework to something similar to OpInfo in PyTorch Core to run the testing coverage automatically without we go over each test by ourselves.

@NivekT
Copy link
Contributor Author

NivekT commented Apr 28, 2022

When we have time, we might need to go over our DataPipes again to identify any missing test since there are a few DataPipe implemented recently.

Besides, for future reference, we might need to improve our testing framework to something similar to OpInfo in PyTorch Core to run the testing coverage automatically without we go over each test by ourselves.

Agreed that the OpInfo-like way is probably the best. I think our inputs and necessary setup for each test is a bit all over the place. Having tests split between two repos doesn't help either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants