-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a video dataset reader/writer #1312
Conversation
Thank you for putting so much time and effort into this @daniel-falk 🙏 It's looking great! |
Hi @daniel-falk, are you able to continue working on this PR? |
Hi @MerelTheisenQB, I definitelly intend to continue working on the commit - it just got shadowed by a lot of other work I needed to get done first. It will likelly take until May before I can focus more on this. My plan is to write some unit tests next. |
Okay great, thanks for the heads-up! And do let us know if you need any help at all completing this 🙂 |
Here is a first set of unit tests. These test mostly the interface, I intend to add more tests for the actual encoding and decoding of video. Once I have done that and verified that the current implementation is good I would appreciate help from you to test it together with the rest of the integrations, such as other datasets (e.g. versioned dataset) and the different infrastructures it should integrate with. |
How should the OpenCV requirement be installed? I have added it to a |
@noklam would mind helping @daniel-falk getting this awesome contribution over the line? |
@daniel-falk Could you change the relative import to an absolute path and remove the suppression block for now? I suspect it hides the actual error from the CI log. You'll also need to include the necessary dependencies in |
33ebba9
to
0af433e
Compare
@noklam Do you have a Windows based environment setup? In that case, could you help me get the CI through for Windows?
The container vs codec support on Windows and Linux looks like this: EDIT: This table is not correct, OpenCV silently defaults to a supported codec if the one specified is not compatible with the container format. This results in a warning in the log, but the test did not fail. Test is updated and a new table in on the way...
|
Don't worry about it for now, I have seen the same error in my other PR which is also blocked by some Window CI, once that get fixed I am pretty sure this would be passed. Thanks a lot for your effort! Really appreciate it. |
ccf7bb7
to
8c6b699
Compare
Here is the codec-container compatibility matrix as from the CI pipeline. Some codecs can be installed separately (such as e.g. h264 codec which has a bad licence preventing it from being shipped together with libraries)
|
I have added unit tests now to get 100% coverage. @noklam It seems like it is just one unit test that fails on Windows now. Did you have a Windows environment setup so that you could take a look at this? The problem is not related to S3, it fails to write the file to disk when using the temp-dir, the other tests passes when it writes video with the same format to other paths. |
Amazing work! |
Trying to re-run the Window Tests, we have seen the same annoying PEP517 error for a while. It's non-deterministic, let's check if the tests still fail. |
Have you had time to look anything more on this @noklam ? Or is there anyone else on the team who could help me get it trough @datajoely ? |
Hi @daniel-falk apologies we we're waiting for this to get out of draft! |
Amazing work both - thanks for pushing @daniel-falk this is an awesome contribution! |
Awesome, I have converted this back to draft for now, feel free to make it "Ready for Review" to signal that and I will review it again! |
Hi @daniel-falk do you still want to complete this PR? We'd like to get all PRs related to datasets to be merged soon now we're moving our datasets code to a different package (see our medium blog post for more details) Otherwise we'll close this PR and ask you to re-open it on the new repo when it's ready. |
Hi @merelcht. Sorry for being slow on this, some other projects took priority. It would be nice to get it merged before changes are made, when is the move planned to start? I just uploaded a new commit which is:
I still need to make a bit of testing to verify that it works as expected in all situations. I looked a bit into why the tests did not fail due to the missing credentials. As far as I can understand that is not tested by import pytest
from moto import mock_s3
import os
import boto3
@pytest.fixture(scope='function')
def aws_credentials():
"""Mocked AWS Credentials for moto."""
os.environ['AWS_ACCESS_KEY_ID'] = 'testing'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'testing'
os.environ['AWS_SECURITY_TOKEN'] = 'testing'
os.environ['AWS_SESSION_TOKEN'] = 'testing'
os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'
@pytest.fixture(scope='function')
def s3(aws_credentials):
with mock_s3():
yield boto3.client('s3', region_name='us-east-1')
def test_create_bucket_creds(s3):
s32 = boto3.client('s3', region_name='us-east-1', aws_access_key_id="other", aws_secret_access_key="other")
s32.create_bucket(Bucket="somebucket")
result = s3.list_buckets()
assert len(result['Buckets']) == 1
assert result['Buckets'][0]['Name'] == 'somebucket' |
@daniel-falk We aim to move it end of next week with a new Kedro minor release coming. If it helps I am fine with a more basic version without Versioning at this point, you can add the versioning feature later at the Kedro-dataset package. |
@noklam That's a good idea to reduce the size of the test matrix. I have removed the inheritence of the versioned dataset now. I have also tested the dataset in a pipeline, both with local storage and with S3 storage as per this repo. |
test_requirements.txt
Outdated
@@ -13,6 +13,7 @@ dill~=0.3.1 | |||
filelock>=3.4.0, <4.0 | |||
gcsfs>=2021.4, <=2022.1 | |||
geopandas>=0.6.0, <1.0 | |||
h5py>3.5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MerelTheisenQB Something looks weird here, it seems to be our develop
code that are not supposed to be here.
Main:
https://github.com/kedro-org/kedro/blob/main/test_requirements.txt#L16
Develop:
https://github.com/kedro-org/kedro/blob/develop/test_requirements.txt#L16
self._n_frames = length | ||
self._gen = more_itertools.peekable(frames) | ||
self._fourcc = fourcc | ||
self._size = self._gen.peek().size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only place you need this? Would be great if we can remove the extra dependencies from more-itertools
as it seems not doing much here.
I am not familiar with this library, I guess the goal is to peek at the first element so you can know the size of the element before iterating the generator. However, it seems to skip the first element somehow? Is this an expected behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your example peek
will be chained to include the first element even after peeking. In my code this is self._gen
, this is what is used later. frames
will be missing the first element after peeking, but that is not used anymore. Your example should instead be:
In [14]: def gen():
...: for i in range(5):
...: yield i
...:
In [15]:
In [15]: g = gen()
In [16]: p = more_itertools.peekable(g)
In [17]: p.peek()
Out[17]: 0
In [18]: list(p)
Out[18]: [0, 1, 2, 3, 4]
I changed the implementation to use regular itertools instead of more_itertools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for explaining. As I don't see the peek
function as useful here, it almost behaves the same as next
. I think it's nice to remove the more_itertools
dependency.
f_target.write(tmp_file.read()) | ||
|
||
def _write_to_filepath(self, video: AbstractVideo, filepath: str) -> None: | ||
# TODO: This uses the codec specified in the VideoDataSet if it is not None, this is due |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something to be done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good enough for now. According to the unit tests mp4v (which this defaults to unless explicitly overridden) supports both avi and mp4 containers on both windows and linux. mp4v is also the only codec that is installed in many installations. Some other formats might fail to encode with this, but then OpenCV will fall back to a codec that works and print a warning message in the log. In the future I do however think it is worth to spend some thoughts to see if there is any better way to derive the best codec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, that makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this contribution @daniel-falk ! I'm super impressed with all the tests you've added for this as well ⭐ ⭐
I've left some comments in line and have some additional ones here:
- Don't forget to update the
RELEASE.md
to mention this new dataset - One thing I don't fully understand is the purpose of all the different video classes. As a user do I need to specify if I want to use the
SequenceVideo
vs theGeneratorVideo
?
@merelcht If you are only working with video files then you do not need to do that. When you open a video file from a path it will automatically become a I'll add an example on this too in the docstring. |
b5e75d3
to
8acf485
Compare
RELEASE.md
Outdated
@@ -16,6 +16,7 @@ | |||
* You can configure config file patterns through `settings.py` without creating a custom config loader. | |||
|
|||
## Bug fixes and other changes | |||
* Added `VideoDataSet` to read and write video files and interact with their frames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move this to the Major features and improvements
section above and format it in a table similar to the added datasets here https://github.com/kedro-org/kedro/blob/main/RELEASE.md#major-features-and-improvements-7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Thank you for explaining and adding the examples! 👍 |
This is really really awesome work @daniel-falk ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this epic contribution @daniel-falk ! ⭐ 🚀 I really appreciate how much work you've put into this!
This commit adds a video dataset that can read and write video files. The dataset is backed by OpenCVs video reader and writer and different buffer protocols such as PIL.Image and numpy.ndarray. There is one class for iterables (e.g. list of frames) and one for generators. Since large videos might not fit in memory, this ability allows us to read one frame from the video file, transform the frame and write it to another video file. Thus there is only need for one frame at a time in memory. The different codecs that are supported depends on the codecs installed and how OpenCV was compiled. Signed-off-by: Daniel Falk <daniel.falk.1@fixedit.ai>
Thanks for positive feedback! Pushed a new commit with updated RELEASE file. |
@daniel-falk Thank you for your contribution! The release note looks good to me. |
This commit adds a video dataset that can read and write video files. The dataset is backed by OpenCVs video reader and writer and different buffer protocols such as PIL.Image and numpy.ndarray. There is one class for iterables (e.g. list of frames) and one for generators. Since large videos might not fit in memory, this ability allows us to read one frame from the video file, transform the frame and write it to another video file. Thus there is only need for one frame at a time in memory. The different codecs that are supported depends on the codecs installed and how OpenCV was compiled. Signed-off-by: Daniel Falk <daniel.falk.1@fixedit.ai> Signed-off-by: Luiz Otavio V. B. Oliveira <luiz.vbo@gmail.com>
* Update attrs requirement from ~=21.3 to >=22.1.0,<23.0 in /dependency Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> * Updated the documentation to reflect the code changes Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> * Add Comment Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> * Changed Attrs pin to attrs>=20.0, <23.0 Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> * Updated Release Notes Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> * Updated Lower Bound to be 21.3 Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> * Updated Release Notes Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> * Update RELEASE.md Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Remove upper bound * Add a video dataset (#1312) This commit adds a video dataset that can read and write video files. The dataset is backed by OpenCVs video reader and writer and different buffer protocols such as PIL.Image and numpy.ndarray. There is one class for iterables (e.g. list of frames) and one for generators. Since large videos might not fit in memory, this ability allows us to read one frame from the video file, transform the frame and write it to another video file. Thus there is only need for one frame at a time in memory. The different codecs that are supported depends on the codecs installed and how OpenCV was compiled. Signed-off-by: Daniel Falk <daniel.falk.1@fixedit.ai> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Update fsspec requirement in /dependency (#2024) Updates the requirements on [fsspec](https://github.com/fsspec/filesystem_spec) to permit the latest version. - [Release notes](https://github.com/fsspec/filesystem_spec/releases) - [Commits](fsspec/filesystem_spec@2021.04.0...2022.11.0) --- updated-dependencies: - dependency-name: fsspec dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Update pip-tools requirement from ~=6.9 to ~=6.10 in /dependency (#2023) Updates the requirements on [pip-tools](https://github.com/jazzband/pip-tools) to permit the latest version. - [Release notes](https://github.com/jazzband/pip-tools/releases) - [Changelog](https://github.com/jazzband/pip-tools/blob/master/CHANGELOG.md) - [Commits](jazzband/pip-tools@6.9.0...6.10.0) --- updated-dependencies: - dependency-name: pip-tools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Add execution_options to SQLQueryDataSet (#1865) * Add execution_options to SQLQueryDataSet Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * Update RELEASE.md Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * Fix lint issues Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * Add a check in the unit test Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * Add test for connection reuse behaviour Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * Inject execution_options at load time Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> * enhance doc and fix incorrect example Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * More docs change Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Fixing typos Signed-off-by: Nok <nok_lam_chan@mckinsey.com> Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> Co-authored-by: Nok <nok_lam_chan@mckinsey.com> Co-authored-by: Nok Lam Chan <mediumnok@gmail.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Update attrs requirement from ~=21.3 to >=22.1.0,<23.0 in /dependency Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Updated the documentation to reflect the code changes Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Add Comment Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Changed Attrs pin to attrs>=20.0, <23.0 Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Updated Release Notes Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Updated Lower Bound to be 21.3 Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Updated Release Notes Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Update links to Slack & add note about security (#2033) Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Apply OpenSSF Best Practices Badge (#2034) Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Add SVMLight DataSet (#1992) * Add SVMLight DataSet Add DataSet for svmlight/libsvm files using scikit-learn library as backend. Resolves #1972 Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com> * Pin scikit-learn version to work with python 3.7 Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Update dataset docstring Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> * Pin requirements to work with python 3.7 Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Add requirements to setup.py Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Add changes to dataset docstring Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Dummy commit to retrigger CI pt1 Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Dummy commit to retrigger CI pt2 Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Update dataset dockstring Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com> Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> Co-authored-by: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com> Co-authored-by: Ahdra Merali <ahdra.merali@quantumblack.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Update RELEASE.md Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Improve get started docs and guide to working with notebooks (#2031) * Revised the Introduction to make it short and sweet. * Revised the Get Started section. Gone is "Hello Kedro". Gone are the installation pre-requisites (that's just part of the Install Kedro page now). Gone is the "Standalone use of the data catalog - woot woot" and GONE is the page on Kedro starters. * Reordered the create project material to put the project structure breakdown in the section that introduces key concepts and shorten the Iris tutorial to the bare minimum. I did add visualisation at this point though, to highlight Kedro Viz, as I felt it was coming far too late in the spaceflights tutorial and needed to be more prominent as a feature. * Added a TL;DR page to Get Started which some people could probably just use as-is and ignore the rest of the section. * Starters material has moved into a new section all about "Kedro project setup". Much of that section still needs review/revision but I have updated the Starters page so it reads more clearly. * Improved the Kedro-Viz page somewhat (still more to come for Plotly) * Notebooks/IPython materials now merged and simplified Signed-off-by: Nok <nok_lam_chan@mckinsey.com> * Remove upper bound Signed-off-by: Nok <nok_lam_chan@mckinsey.com> Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com> Signed-off-by: Daniel Falk <daniel.falk.1@fixedit.ai> Signed-off-by: Nok <nok_lam_chan@mckinsey.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr> Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Co-authored-by: Nok Lam Chan <mediumnok@gmail.com> Co-authored-by: Nok Lam Chan <nok_lam_chan@mckinsey.com> Co-authored-by: Daniel Falk <daniel@da-robotteknik.se> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: clotildeguinard <clotilde_guinard@hotmail.fr> Co-authored-by: Yetunde Dada <43755008+yetudada@users.noreply.github.com> Co-authored-by: Cyril Korotkov <korotkovkm@gmail.com> Co-authored-by: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com> Co-authored-by: Ahdra Merali <ahdra.merali@quantumblack.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Description
This PR adds a video dataset to read and write video files. The idea was discussed in issue #1295.
When working with computer vision pipelines on video it is in many cases much more effective to read the frames directly from the video. This removes the need for manual preprocessing to extract the frames and it takes much less space on the disk if the frames are encoded in a video file (with inter frame compression) than to store separate frames as images. One of the few cases where this is not the best way is when accessing the frames frequently, e.g. when training models, then it is inefficient to decode the video each time. In this later case we can however still use the video dataset as a preprocessing node in the pipeline to unpack the frames to an intermediate dataset.
Development notes
I would gladly accept help to finish the last details of this PR.
The current status is:
Here is a sample pipeline using the VideoDataSet both for reading and writing:
https://github.com/daniel-falk/kedro-video-example/tree/use-kedro-fork
Checklist
RELEASE.md
file