-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sintel Dataset to the dataset prototype API #4895
base: main
Are you sure you want to change the base?
Conversation
💊 CI failures summary and remediationsAs of commit 527d1fa (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages: prototype_test (1/1)Step: "Run tests" (full log | diagnosis details | 🔁 rerun)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @krshrimali and thanks a lot for the PR! I left some comments inline. Overall this looks good.
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @pmeier and @NicolasHug
Thank you for your comments and reviews. I'm sorry that the threads could not be continued from the review (I committed the changes, and then the reviews got outdated). Can you guys please review the PR again whenever you find the time?
Looking forward to your comments, and thoughts.
Note: I'll be able to address them in a day, a little caught with other stuff. Thanks!
def _read_flo(self, file: io.IOBase) -> torch.Tensor: | ||
magic = file.read(4) | ||
if magic != b"PIEH": | ||
raise ValueError("Magic number incorrect. Invalid .flo file") | ||
w = int.from_bytes(file.read(4), "little") | ||
h = int.from_bytes(file.read(4), "little") | ||
data = file.read(2 * w * h * 4) | ||
data_arr = np.frombuffer(data, dtype=np.float32) | ||
# Creating a copy of the underlying array, to avoid UserWarning: "The given NumPy array | ||
# is not writeable, and PyTorch does not support non-writeable tensors." | ||
return torch.from_numpy(np.copy(data_arr.reshape(h, w, 2).transpose(2, 0, 1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this function exists, and is not used from the existing read_flo
function (in the other API):
zipfile
objects don't work well with np.fromfile
and there is an io.unsupportedoperation: fileno
error that is raised. Please note that earlier while testing with @pmeier, we didn't get an error because a generator was returned which doesn't give an error unless and until you use it. This error was raised when I replaced yield
with return
.
I also verified the results from the function used here with this function, and the results are same for a single file (didn't check for all of them).
Suggestions are, as always, welcome. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh well. @NicolasHug that is a strong argument in favor of #4882 since reading from archives will be the norm for the new datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, if really we can't re-use the original code, then maybe having a helper makes sense. Although I would still advocate for avoiding fancy features / new parameter names as much as possible, and to make the wrapper as thin as possible.
I'd be curious to see speed comparisions between the current version (with unzipped data) and the new one though, if you have time to run a quick benchmark that would be awesome.
I also verified the results from the function used here with this function, and the results are same for a single file (didn't check for all of them).
Before merging anything we should try to run more robust tests to avoid surprises in the future :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before merging anything we should try to run more robust tests to avoid surprises in the future :)
We should not add this functionality here, but rather go for #4882 and depend on that here. Of course that would involve writing tests for a read_flo
or any other wrapper.
image1=(path1, decoder(buffer1)) if decoder else (path1, buffer1), | ||
image2=(path2, decoder(buffer2)) if decoder else (path2, buffer2), | ||
flow=(flo[0], flow_arr) if config.split == "train" else ("", None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: What do we want to return here? image1
should be a tuple
of path and buffer? Or should it be image1_path
and image1? (same question for
image2and
flow`)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter. The dictionary should contain the image, which is either a tensor or a buffer depending on decoder
, as well the image path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! :) Should the label be flow
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a second pass. Let me know if anything is not clear or you need help.
image1=(path1, decoder(buffer1)) if decoder else (path1, buffer1), | ||
image2=(path2, decoder(buffer2)) if decoder else (path2, buffer2), | ||
flow=(flo[0], flow_arr) if config.split == "train" else ("", None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter. The dictionary should contain the image, which is either a tensor or a buffer depending on decoder
, as well the image path.
def _read_flo(self, file: io.IOBase) -> torch.Tensor: | ||
magic = file.read(4) | ||
if magic != b"PIEH": | ||
raise ValueError("Magic number incorrect. Invalid .flo file") | ||
w = int.from_bytes(file.read(4), "little") | ||
h = int.from_bytes(file.read(4), "little") | ||
data = file.read(2 * w * h * 4) | ||
data_arr = np.frombuffer(data, dtype=np.float32) | ||
# Creating a copy of the underlying array, to avoid UserWarning: "The given NumPy array | ||
# is not writeable, and PyTorch does not support non-writeable tensors." | ||
return torch.from_numpy(np.copy(data_arr.reshape(h, w, 2).transpose(2, 0, 1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh well. @NicolasHug that is a strong argument in favor of #4882 since reading from archives will be the norm for the new datasets.
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor things left. I think when all open comments are addressed that is good to go from my side. After that Nicolas will take look in case I missed something, since he is more familiar with the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two nits inline, but something else is off that I was not able to find without actually deep diving into it:
>>> len(tuple(datasets.load("sintel", decoder=None, split="train")))
0
Could you please add some mock data, so this is tested by our test suite? You need to add a sintel
function https://github.com/pytorch/vision/blob/main/test/builtin_dataset_mocks.py and add "sintel"
to
vision/test/test_prototype_builtin_datasets.py
Lines 14 to 26 in 4b20ac5
# TODO: this can be replaced by torchvision.prototype.datasets.list() as soon as all builtin datasets are supported | |
TMP = [ | |
"mnist", | |
"fashionmnist", | |
"kmnist", | |
"emnist", | |
"qmnist", | |
"cifar10", | |
"cifar100", | |
"caltech256", | |
"caltech101", | |
"imagenet", | |
] |
With the exception of archive generation, you can probably copy a lot from
Line 1874 in 4b20ac5
class SintelTestCase(datasets_utils.ImageDatasetTestCase): |
Update: The code had a logical bug in the filter function and has been fixed now. I remember @pmeier mentioning in the review to have the scene returned in the I'm a little caught up today but will try working on the test as suggested by @pmeier above. Thanks! |
This reverts commit 3724869.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't shoot the messenger here, but I think we need to postpone this PR for a bit. While debugging I realized in order to be able to read from the same image handle, it has to be seekable. Unfortunately, file handles inside a zip file are not for Python <=3.6:
import zipfile
content = "foo"
archive = "foo.zip"
open(content, "w").close()
with zipfile.ZipFile(archive, "w") as file:
file.write(content)
with zipfile.ZipFile(archive) as file:
for info in file.infolist():
with file.open(info) as internal_file:
# This will fail on Python <= 3.6
assert internal_file.seekable()
Since PyTorch's minimum Python requirement is 3.6, we cannot depend on that. I currently three ways out ordered by ascending preference:
- Wait it out. EOL for Python 3.6 is next month, but I don't know when PyTorch will stop supporting it officially. I'm not aware for any plans for the next release, which will we be the one we plan on going public with the new datasets.
- Every time we read and image for the first time, store the data in a buffer and retrieve it from there whenever we request it the second time. This has some major memory implications, since in the worst case we need to store half of the dataset in memory. When storing only the raw bytes, i.e. postpone decoding after the buffer, that is about 1GB.
- Extract the zip archive and work with extracted files. They are seekable and we would circumvent the problem in its entirety. I'm currently working on adding support for that.
Thoughts?
According to pytorch/pytorch#66462 the next release should be >= 3.7, but I'll ask for confirmation |
@malfet confirmed that the tentative plan is to drop 3.6 with the next release. Leaving that aside, I disregarded one fact in #4895 (review) that makes the solution straight forward: the images are ordered. Thus, we need don't need to keep an arbitrary sized buffer, but only need to store the bytes of a single image. I've pushed a commit adding this feature and now we should be able to finalize this PR. |
This PR attempts to add Sintel Dataset to the prototype API.
There are a few TODOs:
Though these TODOs should not block the review process, and I'm looking forward to the first round of review for this PR. In case there are things that I missed, please feel free to point out.
Special thanks to @pmeier on being very helpful with the bug fixing, and helping with the PR.
cc: @pmeier @NicolasHug
cc @pmeier @bjuncek