-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add prototype utilities to read arbitrary numeric binary files #4882
Conversation
💊 CI failures summary and remediationsAs of commit 0969cf9 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
After some more offline discussion with @NicolasHug, I've refactored the PR to use Note that we cannot use import gzip
import tempfile
import numpy as np
data = np.array([0.0])
path = tempfile.mktemp()
with open(path, "wb") as file:
file.write(gzip.compress(data.tobytes()))
with open(path, "rb") as compressed_file:
with gzip.open(compressed_file, "rb") as file:
print(np.fromfile(file))
file.seek(0)
print(np.frombuffer(file.read()))
If we go through with this design, we should revisit #4598. It probably than should be reverted for consistency. cc @datumbox |
My main concern wasn't just about using numpy vs pytorch it was mostly about code complexity. The original flow decoding function is extremely simple, seld-contained, and easy to understand: vision/torchvision/datasets/_optical_flow.py Lines 365 to 379 in bbd9ff8
I personally see no strong reason to create a new class or a new helper that will add maintenance overhead - we don't need a helper for anything. I would just copy paste it and be done with it.
Not necessarily, I don't think we need to worry about consistency between the use of |
That is true with two caveats:
As for simplicity, my implementation is even shorter and hides all of the complexity above from the user. Plus it is as self-contained as your version in the sense that both implementations let another function do the heavy lifting for them.
#4598 was specifically about using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pmeier , I made some comments below.
Before merging I think it would be nice to have some tests to ensure that the new implementation of read_flow
is consistent with the current one.
I would also be curious if you could run some quick benchmark to see if there's a significant time difference between the 2.
import contextlib
import pathlib
import time
import torch
from torchvision.datasets._optical_flow import _read_flo as read_flo_baseline
from torchvision.prototype import datasets
from torchvision.prototype.datasets.utils._internal import read_flo as read_flo_new
files = [
file
for file in (pathlib.Path(datasets.home()) / "sintel" / "training" / "flow").glob("**/*")
if file.suffix == ".flo"
]
@contextlib.contextmanager
def timeit(label):
start = time.perf_counter()
yield
stop = time.perf_counter()
print(label, f"{(stop - start) / len(files) * 1e3:.1f} milliseconds per file")
with timeit("baseline"):
for file in files:
torch.from_numpy(read_flo_baseline(file).astype("f4", copy=False))
with timeit("baseline+copy"):
for file in files:
torch.from_numpy(read_flo_baseline(file).astype("f4"))
with timeit("new"):
for file in files:
with open(file, "r+b") as f:
read_flo_new(f)
So roughly 8x slower than before, but in absolute terms probably insignificant within the image training loop. The memcopy is the offender. If we add this to the old implementation, the difference is negligible. |
I've added support for dropping the memcopy if the file is opened in update mode ( with timeit("new+update"):
for file in files:
with open(file, "r+b") as f:
read_flo_new(f) the benchmark script now gives:
That makes the new implementation in the regular case roughly 4x slower than the original one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pmeier , some last comments for the tests, looks good otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your work and patience @pmeier , approving now since the remaining comments are a bit trivial / not very important.
…es (#4882) Summary: * add FloReader datapipe * add NumericBinaryReader * revert unrelated change * cleanup * cleanup * add comment for byte reversal * use numpy after all * appease mypy * use .astype() with copy=False * add docstring and cleanuo * reuse current _read_flo and revert MNIST changes * cleanup * revert demonstration * refactor * cleanup * add support for mutable memory * add test * add comments * catch more exceptions * fix mypy * fix variable names * hardcode flow sizes in test * add fix dtype docstring * expand comment on different reading modes * add comment about files in update mode * add tests for fromfile * cleanup * cleanup Reviewed By: NicolasHug Differential Revision: D32694313 fbshipit-source-id: 53c7c9ed32a948ad4bddc0b219c01e291835206d
This is not needed right now, but helps @krshrimali who is porting the Sintel "legacy" dataset to prototypes. This has quite a few parallels to what we do for MNIST:
vision/torchvision/prototype/datasets/_builtin/mnist.py
Line 42 in 6d9a42c
We can probably merge these two in the future.
cc @pmeier @bjuncek