Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sintel Dataset to the dataset prototype API #4895

Open
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

krshrimali
Copy link
Contributor

@krshrimali krshrimali commented Nov 10, 2021

This PR attempts to add Sintel Dataset to the prototype API.

There are a few TODOs:

  • Add comments wherever required
  • Add sha256 value

Though these TODOs should not block the review process, and I'm looking forward to the first round of review for this PR. In case there are things that I missed, please feel free to point out.

Special thanks to @pmeier on being very helpful with the bug fixing, and helping with the PR.

cc: @pmeier @NicolasHug

cc @pmeier @bjuncek

@facebook-github-bot
Copy link

facebook-github-bot commented Nov 10, 2021

💊 CI failures summary and remediations

As of commit 527d1fa (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build prototype_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

/home/circleci/project/test/test_prototype_feat...ead to errors or silently give incorrect results).
  -6
�[31m�[1m________________ TestCommon.test_num_samples[sintel-test-both] _________________�[0m
Traceback (most recent call last):
  File "/home/circleci/project/test/test_prototype_builtin_datasets.py", line 71, in test_num_samples
    assert num_samples == mock_info["num_samples"]
AssertionError: assert 15 == 12
  +15
  -12
�[33m=============================== warnings summary ===============================�[0m
test/test_prototype_features.py::TestJit::test_bounding_box
  /home/circleci/project/test/test_prototype_features.py:150: TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
    new_height, new_width = size

test/test_prototype_features.py::TestJit::test_bounding_box
  /home/circleci/.local/lib/python3.7/site-packages/torchvision/prototype/features/_feature.py:78: TracerWarning: torch.as_tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
    return torch.as_tensor(data, dtype=dtype, device=device)

test/test_prototype_features.py::TestJit::test_bounding_box
  /home/circleci/project/test/test_prototype_features.py:164: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
    new_x1, new_y1, new_x2, new_y2, like=input, format="xyxy", image_size=tuple(size.tolist())


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Contributor

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @krshrimali and thanks a lot for the PR! I left some comments inline. Overall this looks good.

torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@krshrimali krshrimali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @pmeier and @NicolasHug

Thank you for your comments and reviews. I'm sorry that the threads could not be continued from the review (I committed the changes, and then the reviews got outdated). Can you guys please review the PR again whenever you find the time?

Looking forward to your comments, and thoughts.

Note: I'll be able to address them in a day, a little caught with other stuff. Thanks!

torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
Comment on lines 96 to 106
def _read_flo(self, file: io.IOBase) -> torch.Tensor:
magic = file.read(4)
if magic != b"PIEH":
raise ValueError("Magic number incorrect. Invalid .flo file")
w = int.from_bytes(file.read(4), "little")
h = int.from_bytes(file.read(4), "little")
data = file.read(2 * w * h * 4)
data_arr = np.frombuffer(data, dtype=np.float32)
# Creating a copy of the underlying array, to avoid UserWarning: "The given NumPy array
# is not writeable, and PyTorch does not support non-writeable tensors."
return torch.from_numpy(np.copy(data_arr.reshape(h, w, 2).transpose(2, 0, 1)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this function exists, and is not used from the existing read_flo function (in the other API):

zipfile objects don't work well with np.fromfile and there is an io.unsupportedoperation: fileno error that is raised. Please note that earlier while testing with @pmeier, we didn't get an error because a generator was returned which doesn't give an error unless and until you use it. This error was raised when I replaced yield with return.

I also verified the results from the function used here with this function, and the results are same for a single file (didn't check for all of them).

Suggestions are, as always, welcome. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh well. @NicolasHug that is a strong argument in favor of #4882 since reading from archives will be the norm for the new datasets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if really we can't re-use the original code, then maybe having a helper makes sense. Although I would still advocate for avoiding fancy features / new parameter names as much as possible, and to make the wrapper as thin as possible.

I'd be curious to see speed comparisions between the current version (with unzipped data) and the new one though, if you have time to run a quick benchmark that would be awesome.

I also verified the results from the function used here with this function, and the results are same for a single file (didn't check for all of them).

Before merging anything we should try to run more robust tests to avoid surprises in the future :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before merging anything we should try to run more robust tests to avoid surprises in the future :)

We should not add this functionality here, but rather go for #4882 and depend on that here. Of course that would involve writing tests for a read_flo or any other wrapper.

torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
Comment on lines 139 to 141
image1=(path1, decoder(buffer1)) if decoder else (path1, buffer1),
image2=(path2, decoder(buffer2)) if decoder else (path2, buffer2),
flow=(flo[0], flow_arr) if config.split == "train" else ("", None),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What do we want to return here? image1 should be a tuple of path and buffer? Or should it be image1_path and image1? (same question for image2andflow`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter. The dictionary should contain the image, which is either a tensor or a buffer depending on decoder, as well the image path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! :) Should the label be flow here?

Copy link
Contributor

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a second pass. Let me know if anything is not clear or you need help.

torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
Comment on lines 139 to 141
image1=(path1, decoder(buffer1)) if decoder else (path1, buffer1),
image2=(path2, decoder(buffer2)) if decoder else (path2, buffer2),
flow=(flo[0], flow_arr) if config.split == "train" else ("", None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter. The dictionary should contain the image, which is either a tensor or a buffer depending on decoder, as well the image path.

torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
Comment on lines 96 to 106
def _read_flo(self, file: io.IOBase) -> torch.Tensor:
magic = file.read(4)
if magic != b"PIEH":
raise ValueError("Magic number incorrect. Invalid .flo file")
w = int.from_bytes(file.read(4), "little")
h = int.from_bytes(file.read(4), "little")
data = file.read(2 * w * h * 4)
data_arr = np.frombuffer(data, dtype=np.float32)
# Creating a copy of the underlying array, to avoid UserWarning: "The given NumPy array
# is not writeable, and PyTorch does not support non-writeable tensors."
return torch.from_numpy(np.copy(data_arr.reshape(h, w, 2).transpose(2, 0, 1)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh well. @NicolasHug that is a strong argument in favor of #4882 since reading from archives will be the norm for the new datasets.

Copy link
Contributor

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things left. I think when all open comments are addressed that is good to go from my side. After that Nicolas will take look in case I missed something, since he is more familiar with the dataset.

torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
Copy link
Contributor

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nits inline, but something else is off that I was not able to find without actually deep diving into it:

>>> len(tuple(datasets.load("sintel", decoder=None, split="train")))
0

Could you please add some mock data, so this is tested by our test suite? You need to add a sintel function https://github.com/pytorch/vision/blob/main/test/builtin_dataset_mocks.py and add "sintel" to

# TODO: this can be replaced by torchvision.prototype.datasets.list() as soon as all builtin datasets are supported
TMP = [
"mnist",
"fashionmnist",
"kmnist",
"emnist",
"qmnist",
"cifar10",
"cifar100",
"caltech256",
"caltech101",
"imagenet",
]

With the exception of archive generation, you can probably copy a lot from

class SintelTestCase(datasets_utils.ImageDatasetTestCase):

torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved
@krshrimali
Copy link
Contributor Author

Update:

The code had a logical bug in the filter function and has been fixed now. I remember @pmeier mentioning in the review to have the scene returned in the dict as label, and that has been added in the recent commit.

I'm a little caught up today but will try working on the test as suggested by @pmeier above. Thanks!

Copy link
Contributor

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't shoot the messenger here, but I think we need to postpone this PR for a bit. While debugging I realized in order to be able to read from the same image handle, it has to be seekable. Unfortunately, file handles inside a zip file are not for Python <=3.6:

import zipfile

content = "foo"
archive = "foo.zip"

open(content, "w").close()
with zipfile.ZipFile(archive, "w") as file:
    file.write(content)

with zipfile.ZipFile(archive) as file:
    for info in file.infolist():
        with file.open(info) as internal_file:
            # This will fail on Python <= 3.6
            assert internal_file.seekable()

Since PyTorch's minimum Python requirement is 3.6, we cannot depend on that. I currently three ways out ordered by ascending preference:

  1. Wait it out. EOL for Python 3.6 is next month, but I don't know when PyTorch will stop supporting it officially. I'm not aware for any plans for the next release, which will we be the one we plan on going public with the new datasets.
  2. Every time we read and image for the first time, store the data in a buffer and retrieve it from there whenever we request it the second time. This has some major memory implications, since in the worst case we need to store half of the dataset in memory. When storing only the raw bytes, i.e. postpone decoding after the buffer, that is about 1GB.
  3. Extract the zip archive and work with extracted files. They are seekable and we would circumvent the problem in its entirety. I'm currently working on adding support for that.

Thoughts?

@NicolasHug
Copy link
Member

EOL for Python 3.6 is next month, but I don't know when PyTorch will stop supporting it officially

According to pytorch/pytorch#66462 the next release should be >= 3.7, but I'll ask for confirmation

@pmeier
Copy link
Contributor

pmeier commented Nov 22, 2021

@malfet confirmed that the tentative plan is to drop 3.6 with the next release.

Leaving that aside, I disregarded one fact in #4895 (review) that makes the solution straight forward: the images are ordered. Thus, we need don't need to keep an arbitrary sized buffer, but only need to store the bytes of a single image. I've pushed a commit adding this feature and now we should be able to finalize this PR.

@pmeier pmeier linked an issue Feb 3, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sintel
4 participants