-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Pytorch Data and new DataPipes #201
Comments
Consider #167, and given the possibility provided by DataPipes, I was thinking about giving the user the possibility to apply externally the transform to the events, by making they aware of the datapipe structure, instead of embedding these in the dataset. The transform would be mapped to a IterDataPipe object through the Here's an example code: from tonic.prototype.datasets import nmnist
from tonic.prototype.transforms import ToFrame
dp = nmnist.nmnist(root="./data")
t = ToFrame(sensor_size=nmnist.sensor_size, time_window=10000)
# Mapping the transform to the DataPipe externally
dp = dp.map(t) The following modifications are needed to our transforms. Right now, the DataPipe provides in output an IterDataPipe object (or something like that) which, when passed through a Mapper object, returns a tuple def __call__(self, dp):
return functional.to_frame_numpy(
dp=dp, # The datapipe is passed instead of the events.
... and I replaced the functional def to_frame_numpy(
dp, # Datapipe passed.
...
):
...
# Unpacking the datapipe.
events, target = dp
...
# Getting frames from events
frames = ...
...
# Repacking the datapipe.
return frames, target Is this what you had in mind, @biphasic? It should be pretty straightforward to apply transforms externally thanks to the |
I'm not really sure yet to be honest. I still don't understand how torchvisions new API is going to look like for datasets and transforms combined. I think it'll be good if we let those guys iterate over a few solutions until they find something that works. There is little documentation available at this point but I also found this one https://github.com/pytorch/vision/blob/main/torchvision/prototype/features/_image.py and this one https://github.com/pytorch/vision/blob/main/torchvision/prototype/features/_feature.py |
I agree with you on waiting for what they with their transforms and then copy them :) I see that they are creating dataset classes from |
I don't know about you @biphasic but I completely missed this torchvision folder in which they have already implemented with data pipes many datasets 😆 |
good point, there is lots of stuff that they did already and I have yet to understand. like this one too https://github.com/pytorch/vision/blob/main/torchvision/prototype/datasets/_api.py |
I'll try to summarize here what I understood studying the torchvision code. Their dataset class is the template to implement a generic dataset. Fundamentally, one needs to implement a A def _datapipe(self) -> IterDataPipe[Sample]:
filename = self._TRAIN_FILENAME if self.train else self._TEST_FILENAME
filepath = os.path.join(self._root, filename)
dp = FileLister(str(filepath))
dp = FileOpener(dp, mode="b")
# Unzipping.
dp = ZipArchiveLoader(dp)
# Filtering the non-bin files.
dp = Filter(dp, self._filter, input_col=0)
# Reading data to structured NumPy array and integer target.
dp = NMNISTFileReader(dp)
# Filtering the first saccade.
if self.first_saccade_only:
dp = Mapper(dp, self._saccade_filter, input_col=0)
# Applying transforms.
if self.transforms:
dp = Mapper(dp, self.transforms)
else:
if self.transform:
dp = Mapper(dp, self.transform, input_col=0, output_col=0)
if self.target_transform:
dp = Mapper(dp, self.target_transform, input_col=1, output_col=1)
return dp A A def __len__(self) -> int:
return 60_000 if self.train else 10_000 Then, they require a I hope that this clarifies the situation a little bit 😆 I also hope that I understood correctly what they are doing. |
keeping track of supported datasets here: https://github.com/neuromorphs/tonic/tree/develop/tonic/prototype#currently-supported-datasets |
@biphasic should I try to add some testing routines for the prototypes? |
that is a very good next step. If we are able to test the prototype datasets, we can start including them in the package. In general it would be good to aim for a high test coverage (that means the number of lines covered by tests) for the new datasets. Our coverage for the current versions can be seen here: https://app.codecov.io/gh/neuromorphs/tonic/tree/develop/tonic/datasets Ideally we rely on artificial data injection/creation to test that everything works, as in we create some artificial samples for the tests. But it's not always clear how the custom files are created, so for some datasets, I resorted to downloading single samples from a server that I had access to in the past like here: https://github.com/neuromorphs/tonic/blob/develop/test/test_datasets.py#L116 This server is not going to be available forever so where possible we should not rely on it. |
OK, I will try to come up with something. In the meanwhile, I tried to use their web datapipes and they seem pretty nice. def download_url(url: str, root: Union[Path, str], filename: str) -> None:
# Reading the data from the web.
download_dp = HttpReader(IterableWrapper([url]))
# Extracting the data with a function.
def get_http_data(http_obj):
return http_obj.data
download_dp = Mapper(download_dp, get_http_data, input_col=1)
# Saving to disk.
def filepath_fn(fpath):
return os.path.join(root, filename)
download_dp = Saver(download_dp, mode="wb", filepath_fn=filepath_fn)
_ = next(iter(download_dp)) And you can use it in the following way. What do you think @biphasic? Would it be worth to use as experimental download routine? |
Do you mean to use this as a drop-in replacement to the current tonic.download_utils.download_url? I think this would work, although some datasets call it with the md5 argument as well so that would need to be included (see function signature here https://github.com/neuromorphs/tonic/blob/develop/tonic/download_utils.py#L90-L96) Overall though I'm not sure whether this is worth the work though! Because in the end we want to delete the whole download_utils anyway right? I think the next step would be to figure out how to write a test case for the prototype datasets. Because for Tonic 2.0 with the new version of datasets, we really want everything tested, in contrast to now where some datasets are not really tested and we have to rely on users telling us when things break. If for the moment we use download_utils.download_url in the tests to download some example event files, I'm not too worried about that. Because to me in the end we'll also get rid of any downloading during the tests in the end, it's just much more reliable that way. Please let me know if you have some preferences what you'd like to tackle, then we can certainly work out something and I take on some of the things. Generally I see the following way forward, please complement where you see fit:
|
I would have left it in the I agree with you on the fact that a proper test routine is needed and I will work on it as priority, at least for NMNIST and STMNIST. For what concerns the Goggle Drive download, I think we can use the experimental API based on datapipes. Concerning transforms, in torchvision they are using a new format for the output of the dataset: no more a tuple but a dictionary. I did not like the idea because it would mess up our transforms. We'll see in the future what they will do. They are also changing the API for some reasons that I did not understand, but I would stick with what we are doing now. Edit: moreover, when it is not possible to generate on-the-fly fake data, could we keep some single-sample binary files inside the source code? They should not occupy so much space, am I right? |
Sounds good 👍 thank you |
closing this for the moment as we're already working on this. |
We're currently using an untested copy of torchvision's download utilities, which is far from ideal.
Pytorch Data separates that functionality into another package and introduces data pipes. It is still in a beta stage, but it's maturing quickly.
Having seen this kind of functional interface in Tensorflow data, I think it looks very clean and is very customizable. As the amount of datasets in Tonic grows, we will benefit from the different dataloading functionalities in Pytorch data, although it will need a considerable re-write in the beginning
The text was updated successfully, but these errors were encountered: