Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Dataset.to_datapipe for converting PyG datasets into a torchdata DataPipe #6141

Merged
merged 7 commits into from
Jan 3, 2023

Conversation

hatemhelal
Copy link
Contributor

@hatemhelal hatemhelal commented Dec 5, 2022

There is some overlap in functionality in the torch DataPipe API with features that are already available in PyG's dataset implementations:

  • downloading from remote filesystems
  • on-disk and in-memory caching
  • handling extraction of data from compressed files
  • etc

This patch adds a @classmethod onto the PyG Dataset interface to support converting any PyG dataset into a DataPipe.

from torch_geometric.datasets import QM9

dp = QM9.to_datapipe(root='./data/QM9/')
dp = dp.batch_graphs(batch_size=2, drop_last=True)

for batch in dp:
    pass

Todo:

  • decide on the right base class for DatasetAdapter : should it be MapDataPipe or is IterDataPipe
  • add a lazy len method
  • unittests

@codecov
Copy link

codecov bot commented Dec 5, 2022

Codecov Report

Merging #6141 (962cb4b) into master (6325f6d) will increase coverage by 0.00%.
The diff coverage is 90.47%.

❗ Current head 962cb4b differs from pull request most recent head 079d9e9. Consider uploading reports for the commit 079d9e9 to get more accurate results

@@           Coverage Diff           @@
##           master    #6141   +/-   ##
=======================================
  Coverage   84.32%   84.33%           
=======================================
  Files         387      387           
  Lines       21364    21385   +21     
=======================================
+ Hits        18016    18035   +19     
- Misses       3348     3350    +2     
Impacted Files Coverage Δ
torch_geometric/data/datapipes.py 62.90% <88.88%> (+10.63%) ⬆️
torch_geometric/data/dataset.py 91.21% <100.00%> (+0.18%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@hatemhelal hatemhelal marked this pull request as ready for review December 5, 2022 10:28
@hatemhelal
Copy link
Contributor Author

Todo:

  • decide on the right base class for DatasetAdapter : should it be MapDataPipe or is IterDataPipe
  • IterDataPipe is preferred since there are more pipe implementations available for the iterable version (including PyG's batcher)
  • add a lazy len method
  • decided to capture the length of the dataset when constructing the pipe adapter and using that.
  • unittests
  • added a basic test in test_dataset.py

Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to merge this. I changed the to_datapipe method to an instance method - I think this is a bit more intuitive.

@rusty1s rusty1s merged commit f30a18c into pyg-team:master Jan 3, 2023
@hatemhelal
Copy link
Contributor Author

Thank you for merging this!

I changed the to_datapipe method to an instance method - I think this is a bit more intuitive.

I agree it is a bit more intuitive as a method but I will need to double check the behaviour with multiprocessing. I think I wanted to avoid having the dataset as a class property of the pipe since I recall this as being more efficient when using multiprocessing since the dataset can be lazily initialised within each worker....at least in theory. Will check the behaviour with the latest pyg-nightly.

@rusty1s
Copy link
Member

rusty1s commented Jan 9, 2023

What's the benefit of initializing the dataset in each worker rather than making use of shared memory?

Sorry I went ahead with this. If it does not fit your needs, please feel free to adjust this :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants