Add Sintel Dataset to the dataset prototype API #4895

krshrimali · 2021-11-10T07:07:46Z

This PR attempts to add Sintel Dataset to the prototype API.

There are a few TODOs:

Add comments wherever required
Add sha256 value

Though these TODOs should not block the review process, and I'm looking forward to the first round of review for this PR. In case there are things that I missed, please feel free to point out.

Special thanks to @pmeier on being very helpful with the bug fixing, and helping with the PR.

cc: @pmeier @NicolasHug

cc @pmeier @bjuncek

facebook-github-bot · 2021-11-10T07:07:52Z

💊 CI failures summary and remediations

As of commit 527d1fa (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

prototype_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

/home/circleci/project/test/test_prototype_feat...ead to errors or silently give incorrect results).

  -6
�[31m�[1m________________ TestCommon.test_num_samples[sintel-test-both] _________________�[0m
Traceback (most recent call last):
  File "/home/circleci/project/test/test_prototype_builtin_datasets.py", line 71, in test_num_samples
    assert num_samples == mock_info["num_samples"]
AssertionError: assert 15 == 12
  +15
  -12
�[33m=============================== warnings summary ===============================�[0m
test/test_prototype_features.py::TestJit::test_bounding_box
  /home/circleci/project/test/test_prototype_features.py:150: TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of different shape won't change the number of iterations executed (and might lead to errors or silently give incorrect results).
    new_height, new_width = size

test/test_prototype_features.py::TestJit::test_bounding_box
  /home/circleci/.local/lib/python3.7/site-packages/torchvision/prototype/features/_feature.py:78: TracerWarning: torch.as_tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
    return torch.as_tensor(data, dtype=dtype, device=device)

test/test_prototype_features.py::TestJit::test_bounding_box
  /home/circleci/project/test/test_prototype_features.py:164: TracerWarning: Converting a tensor to a Python list might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
    new_x1, new_y1, new_x2, new_y2, like=input, format="xyxy", image_size=tuple(size.tolist())

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

torchvision/prototype/datasets/_builtin/sintel.py

pmeier

Hey @krshrimali and thanks a lot for the PR! I left some comments inline. Overall this looks good.

torchvision/prototype/datasets/_builtin/sintel.py

Co-authored-by: Philip Meier <github.pmeier@posteo.de>

krshrimali

Hi, @pmeier and @NicolasHug

Thank you for your comments and reviews. I'm sorry that the threads could not be continued from the review (I committed the changes, and then the reviews got outdated). Can you guys please review the PR again whenever you find the time?

Looking forward to your comments, and thoughts.

Note: I'll be able to address them in a day, a little caught with other stuff. Thanks!

torchvision/prototype/datasets/_builtin/sintel.py

krshrimali · 2021-11-11T05:00:16Z

torchvision/prototype/datasets/_builtin/sintel.py

+ def _read_flo(self, file: io.IOBase) -> torch.Tensor:
+ magic = file.read(4)
+ if magic != b"PIEH":
+ raise ValueError("Magic number incorrect. Invalid .flo file")
+ w = int.from_bytes(file.read(4), "little")
+ h = int.from_bytes(file.read(4), "little")
+ data = file.read(2 * w * h * 4)
+ data_arr = np.frombuffer(data, dtype=np.float32)
+ # Creating a copy of the underlying array, to avoid UserWarning: "The given NumPy array
+ # is not writeable, and PyTorch does not support non-writeable tensors."
+ return torch.from_numpy(np.copy(data_arr.reshape(h, w, 2).transpose(2, 0, 1)))


The reason this function exists, and is not used from the existing read_flo function (in the other API):

zipfile objects don't work well with np.fromfile and there is an io.unsupportedoperation: fileno error that is raised. Please note that earlier while testing with @pmeier, we didn't get an error because a generator was returned which doesn't give an error unless and until you use it. This error was raised when I replaced yield with return.

I also verified the results from the function used here with this function, and the results are same for a single file (didn't check for all of them).

Suggestions are, as always, welcome. :)

Oh well. @NicolasHug that is a strong argument in favor of #4882 since reading from archives will be the norm for the new datasets.

OK, if really we can't re-use the original code, then maybe having a helper makes sense. Although I would still advocate for avoiding fancy features / new parameter names as much as possible, and to make the wrapper as thin as possible.

I'd be curious to see speed comparisions between the current version (with unzipped data) and the new one though, if you have time to run a quick benchmark that would be awesome.

I also verified the results from the function used here with this function, and the results are same for a single file (didn't check for all of them).

Before merging anything we should try to run more robust tests to avoid surprises in the future :)

Before merging anything we should try to run more robust tests to avoid surprises in the future :)

We should not add this functionality here, but rather go for #4882 and depend on that here. Of course that would involve writing tests for a read_flo or any other wrapper.

torchvision/prototype/datasets/_builtin/sintel.py

krshrimali · 2021-11-11T05:04:29Z

torchvision/prototype/datasets/_builtin/sintel.py

+ image1=(path1, decoder(buffer1)) if decoder else (path1, buffer1),
+ image2=(path2, decoder(buffer2)) if decoder else (path2, buffer2),
+ flow=(flo[0], flow_arr) if config.split == "train" else ("", None),


Question: What do we want to return here? image1 should be a tuple of path and buffer? Or should it be image1_path and image1? (same question for image2andflow`)

The latter. The dictionary should contain the image, which is either a tensor or a buffer depending on decoder, as well the image path.

Done! :) Should the label be flow here?

pmeier

I did a second pass. Let me know if anything is not clear or you need help.

torchvision/prototype/datasets/_builtin/sintel.py

pmeier · 2021-11-11T07:24:27Z

torchvision/prototype/datasets/_builtin/sintel.py

+ image1=(path1, decoder(buffer1)) if decoder else (path1, buffer1),
+ image2=(path2, decoder(buffer2)) if decoder else (path2, buffer2),
+ flow=(flo[0], flow_arr) if config.split == "train" else ("", None),


The latter. The dictionary should contain the image, which is either a tensor or a buffer depending on decoder, as well the image path.

torchvision/prototype/datasets/_builtin/sintel.py

pmeier · 2021-11-11T07:27:11Z

torchvision/prototype/datasets/_builtin/sintel.py

+ def _read_flo(self, file: io.IOBase) -> torch.Tensor:
+ magic = file.read(4)
+ if magic != b"PIEH":
+ raise ValueError("Magic number incorrect. Invalid .flo file")
+ w = int.from_bytes(file.read(4), "little")
+ h = int.from_bytes(file.read(4), "little")
+ data = file.read(2 * w * h * 4)
+ data_arr = np.frombuffer(data, dtype=np.float32)
+ # Creating a copy of the underlying array, to avoid UserWarning: "The given NumPy array
+ # is not writeable, and PyTorch does not support non-writeable tensors."
+ return torch.from_numpy(np.copy(data_arr.reshape(h, w, 2).transpose(2, 0, 1)))


Oh well. @NicolasHug that is a strong argument in favor of #4882 since reading from archives will be the norm for the new datasets.

Co-authored-by: Philip Meier <github.pmeier@posteo.de>

…taset/sintel

pmeier

A few minor things left. I think when all open comments are addressed that is good to go from my side. After that Nicolas will take look in case I missed something, since he is more familiar with the dataset.

torchvision/prototype/datasets/_builtin/sintel.py

…taset/sintel

pmeier

Two nits inline, but something else is off that I was not able to find without actually deep diving into it:

>>> len(tuple(datasets.load("sintel", decoder=None, split="train")))
0

Could you please add some mock data, so this is tested by our test suite? You need to add a sintel function https://github.com/pytorch/vision/blob/main/test/builtin_dataset_mocks.py and add "sintel" to

vision/test/test_prototype_builtin_datasets.py

Lines 14 to 26 in 4b20ac5

 # TODO: this can be replaced by torchvision.prototype.datasets.list() as soon as all builtin datasets are supported 

 TMP = [ 

 "mnist", 

 "fashionmnist", 

 "kmnist", 

 "emnist", 

 "qmnist", 

 "cifar10", 

 "cifar100", 

 "caltech256", 

 "caltech101", 

 "imagenet", 

 ]

With the exception of archive generation, you can probably copy a lot from

vision/test/test_datasets.py

Line 1874 in 4b20ac5

class SintelTestCase(datasets_utils.ImageDatasetTestCase):

torchvision/prototype/datasets/_builtin/sintel.py

krshrimali · 2021-11-16T07:15:21Z

Update:

The code had a logical bug in the filter function and has been fixed now. I remember @pmeier mentioning in the review to have the scene returned in the dict as label, and that has been added in the recent commit.

I'm a little caught up today but will try working on the test as suggested by @pmeier above. Thanks!

…taset/sintel

test/builtin_dataset_mocks.py

This reverts commit 3724869.

pmeier

Don't shoot the messenger here, but I think we need to postpone this PR for a bit. While debugging I realized in order to be able to read from the same image handle, it has to be seekable. Unfortunately, file handles inside a zip file are not for Python <=3.6:

import zipfile

content = "foo"
archive = "foo.zip"

open(content, "w").close()
with zipfile.ZipFile(archive, "w") as file:
    file.write(content)

with zipfile.ZipFile(archive) as file:
    for info in file.infolist():
        with file.open(info) as internal_file:
            # This will fail on Python <= 3.6
            assert internal_file.seekable()

Since PyTorch's minimum Python requirement is 3.6, we cannot depend on that. I currently three ways out ordered by ascending preference:

Wait it out. EOL for Python 3.6 is next month, but I don't know when PyTorch will stop supporting it officially. I'm not aware for any plans for the next release, which will we be the one we plan on going public with the new datasets.
Every time we read and image for the first time, store the data in a buffer and retrieve it from there whenever we request it the second time. This has some major memory implications, since in the worst case we need to store half of the dataset in memory. When storing only the raw bytes, i.e. postpone decoding after the buffer, that is about 1GB.
Extract the zip archive and work with extracted files. They are seekable and we would circumvent the problem in its entirety. I'm currently working on adding support for that.

Thoughts?

NicolasHug · 2021-11-19T16:11:36Z

EOL for Python 3.6 is next month, but I don't know when PyTorch will stop supporting it officially

According to pytorch/pytorch#66462 the next release should be >= 3.7, but I'll ask for confirmation

pmeier · 2021-11-22T10:01:22Z

@malfet confirmed that the tentative plan is to drop 3.6 with the next release.

Leaving that aside, I disregarded one fact in #4895 (review) that makes the solution straight forward: the images are ordered. Thus, we need don't need to keep an arbitrary sized buffer, but only need to store the bytes of a single image. I've pushed a commit adding this feature and now we should be able to finalize this PR.

krshrimali added 9 commits November 8, 2021 10:26

WIP: Sintel Dataset

eff55d3

Failing to read streamwrapper object in Python

61831dc

KeyZipper updates

12b5915

Merge remote-tracking branch 'upstream/main' into dataset/sintel

d2dcba9

seek of closed file error for now

1b690ac

Working...

6f371c7

Rearranging functions

081c70f

Merge remote-tracking branch 'upstream/main' into dataset/sintel

ad74f96

Fix mypy failures, minor edits

6c106e7

pytorch-probot bot added the ciflow/default label Nov 10, 2021

facebook-github-bot added the cla signed label Nov 10, 2021

krshrimali commented Nov 10, 2021

View reviewed changes

torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved

pmeier suggested changes Nov 10, 2021

View reviewed changes

pmeier added module: datasets prototype labels Nov 10, 2021

pmeier reviewed Nov 10, 2021

View reviewed changes

torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved

torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved

torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved

krshrimali and others added 3 commits November 11, 2021 08:42

Apply suggestions from code review

32cc661

Co-authored-by: Philip Meier <github.pmeier@posteo.de>

Address reviews...

7f27e3f

Merge branch 'main' into dataset/sintel

28def28

krshrimali commented Nov 11, 2021

View reviewed changes

pmeier suggested changes Nov 11, 2021

View reviewed changes

krshrimali and others added 6 commits November 12, 2021 06:20

Update torchvision/prototype/datasets/_builtin/sintel.py

cdcb914

Co-authored-by: Philip Meier <github.pmeier@posteo.de>

Add support for 'both' as pass_name

b58c14b

Merge branch 'dataset/sintel' of github.com:krshrimali/vision into da…

1d7a36e

…taset/sintel

Keep imports in the same block

52ba6da

Convert re.search output to bool

e515fbb

Merge branch 'main' into dataset/sintel

7892eb6

pmeier reviewed Nov 12, 2021

View reviewed changes

Merge branch 'main' into dataset/sintel

7ccca53

krshrimali requested a review from pmeier November 15, 2021 10:48

krshrimali added 2 commits November 15, 2021 16:22

As per review, use frombuffer consistently

8f84b51

Merge branch 'dataset/sintel' of github.com:krshrimali/vision into da…

709263c

…taset/sintel

pmeier suggested changes Nov 15, 2021

View reviewed changes

torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved

torchvision/prototype/datasets/_builtin/sintel.py Outdated Show resolved Hide resolved

krshrimali added 4 commits November 16, 2021 08:18

Only filter pass name, and not png, include flow filter there

6b40366

Rename the func

34e8de3

Add label (scene dir), needs review

cb904c5

Merge branch 'main' into dataset/sintel

0e13b3f

krshrimali added 4 commits November 17, 2021 08:50

Add test for sintel dataset

10bdc4b

Merge branch 'dataset/sintel' of github.com:krshrimali/vision into da…

7b4265f

…taset/sintel

Merge branch 'main' into dataset/sintel

d34ebe6

Remove comment

54618c6

krshrimali commented Nov 17, 2021

View reviewed changes

test/builtin_dataset_mocks.py Show resolved Hide resolved

krshrimali requested a review from pmeier November 17, 2021 03:24

krshrimali and others added 7 commits November 19, 2021 15:18

Temporary fix + test class fixes

6c04d5f

Revert temp fix

84c4e88

Merge branch 'main' into dataset/sintel

ebf7e4a

use common read_flo instead of custom implementation

c0b254c

remove more obsolete code

e9fa656

[DEBUG] check if tests also run on Python 3.9

3724869

Revert "[DEBUG] check if tests also run on Python 3.9"

69194e1

This reverts commit 3724869.

pmeier suggested changes Nov 19, 2021

View reviewed changes

pmeier added 2 commits November 22, 2021 10:56

store bytes to avoid reading twice from file handle

b4cce90

Merge branch 'main' into dataset/sintel

527d1fa

pmeier linked an issue Feb 3, 2022 that may be closed by this pull request

Sintel #5363

Open

pmeier mentioned this pull request Apr 7, 2022

Refactor prototype datasets #5778

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Sintel Dataset to the dataset prototype API #4895

Add Sintel Dataset to the dataset prototype API #4895

krshrimali commented Nov 10, 2021 •

edited

Loading

facebook-github-bot commented Nov 10, 2021 •

edited

Loading

pmeier left a comment

krshrimali left a comment

krshrimali Nov 11, 2021

pmeier Nov 11, 2021

NicolasHug Nov 12, 2021

pmeier Nov 12, 2021

krshrimali Nov 11, 2021

pmeier Nov 11, 2021

krshrimali Nov 12, 2021

pmeier left a comment

pmeier Nov 11, 2021

pmeier Nov 11, 2021

pmeier left a comment

pmeier left a comment •

edited

Loading

krshrimali commented Nov 16, 2021

pmeier left a comment

NicolasHug commented Nov 19, 2021

pmeier commented Nov 22, 2021

	# TODO: this can be replaced by torchvision.prototype.datasets.list() as soon as all builtin datasets are supported
	TMP = [
	"mnist",
	"fashionmnist",
	"kmnist",
	"emnist",
	"qmnist",
	"cifar10",
	"cifar100",
	"caltech256",
	"caltech101",
	"imagenet",
	]

Add Sintel Dataset to the dataset prototype API #4895

Are you sure you want to change the base?

Add Sintel Dataset to the dataset prototype API #4895

Conversation

krshrimali commented Nov 10, 2021 • edited Loading

facebook-github-bot commented Nov 10, 2021 • edited Loading

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

prototype_test (1/1)

pmeier left a comment

Choose a reason for hiding this comment

krshrimali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmeier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmeier left a comment

Choose a reason for hiding this comment

pmeier left a comment • edited Loading

Choose a reason for hiding this comment

krshrimali commented Nov 16, 2021

pmeier left a comment

Choose a reason for hiding this comment

NicolasHug commented Nov 19, 2021

pmeier commented Nov 22, 2021

krshrimali commented Nov 10, 2021 •

edited

Loading

facebook-github-bot commented Nov 10, 2021 •

edited

Loading

pmeier left a comment •

edited

Loading