Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a video dataset reader/writer #1312

Merged
merged 2 commits into from
Nov 15, 2022
Merged

Add a video dataset reader/writer #1312

merged 2 commits into from
Nov 15, 2022

Conversation

daniel-falk
Copy link
Contributor

@daniel-falk daniel-falk commented Mar 6, 2022

Description

This PR adds a video dataset to read and write video files. The idea was discussed in issue #1295.

When working with computer vision pipelines on video it is in many cases much more effective to read the frames directly from the video. This removes the need for manual preprocessing to extract the frames and it takes much less space on the disk if the frames are encoded in a video file (with inter frame compression) than to store separate frames as images. One of the few cases where this is not the best way is when accessing the frames frequently, e.g. when training models, then it is inefficient to decode the video each time. In this later case we can however still use the video dataset as a preprocessing node in the pipeline to unpack the frames to an intermediate dataset.

Development notes

I would gladly accept help to finish the last details of this PR.
The current status is:

  • Load video files
  • Save video files from other VideoDataSet
  • Save video files from list of frames
  • Save video files from generator yielding frames
  • Make it work with fsspec
  • Verify functionality with different containers and codecs
  • Write unittests
  • Test it with versioned dataset, partitioned dataset etc
  • Write documentation

Here is a sample pipeline using the VideoDataSet both for reading and writing:
https://github.com/daniel-falk/kedro-video-example/tree/use-kedro-fork

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

@daniel-falk daniel-falk mentioned this pull request Mar 6, 2022
8 tasks
@datajoely
Copy link
Contributor

Thank you for putting so much time and effort into this @daniel-falk 🙏 It's looking great!

@merelcht merelcht added the Community Issue/PR opened by the open-source community label Mar 7, 2022
@merelcht merelcht linked an issue Mar 7, 2022 that may be closed by this pull request
8 tasks
@merelcht
Copy link
Member

merelcht commented Apr 7, 2022

Hi @daniel-falk, are you able to continue working on this PR?

@daniel-falk
Copy link
Contributor Author

Hi @MerelTheisenQB, I definitelly intend to continue working on the commit - it just got shadowed by a lot of other work I needed to get done first. It will likelly take until May before I can focus more on this.

My plan is to write some unit tests next.

@merelcht
Copy link
Member

merelcht commented Apr 7, 2022

Hi @MerelTheisenQB, I definitelly intend to continue working on the commit - it just got shadowed by a lot of other work I needed to get done first. It will likelly take until May before I can focus more on this.

My plan is to write some unit tests next.

Okay great, thanks for the heads-up! And do let us know if you need any help at all completing this 🙂

@daniel-falk
Copy link
Contributor Author

Here is a first set of unit tests. These test mostly the interface, I intend to add more tests for the actual encoding and decoding of video. Once I have done that and verified that the current implementation is good I would appreciate help from you to test it together with the rest of the integrations, such as other datasets (e.g. versioned dataset) and the different infrastructures it should integrate with.

@daniel-falk
Copy link
Contributor Author

How should the OpenCV requirement be installed? I have added it to a video_requires in setup.py so that it is installed with install_extra video or all, but it seems like it is not installed in the CI job? Seems like that only installs test_requirements.txt and requirements.txt?

@datajoely
Copy link
Contributor

@noklam would mind helping @daniel-falk getting this awesome contribution over the line?

@noklam
Copy link
Contributor

noklam commented May 31, 2022

@daniel-falk Could you change the relative import to an absolute path and remove the suppression block for now? I suspect it hides the actual error from the CI log.

You'll also need to include the necessary dependencies in test_requirements.txt. Most of the CI jobs just install requirements.txt and test_requirements.txt.

@daniel-falk daniel-falk force-pushed the main branch 2 times, most recently from 33ebba9 to 0af433e Compare June 18, 2022 16:11
@daniel-falk
Copy link
Contributor Author

daniel-falk commented Jun 18, 2022

@noklam Do you have a Windows based environment setup? In that case, could you help me get the CI through for Windows?

  • Windows unit test for save to S3 fails, not sure why (the codec-container combo is passing in the test_video_codecs test).
  • pip compile fails for windows (not for Linux) not sure why.
  • lint jobs fail, if I understand correctly due to files that I have not made any changes in.
  • Linux unit test job fails, if I understand it correctly it is just because I don't have 100% test coverage yet. I'm working on this.

The container vs codec support on Windows and Linux looks like this:

EDIT: This table is not correct, OpenCV silently defaults to a supported codec if the one specified is not compatible with the container format. This results in a warning in the log, but the test did not fail. Test is updated and a new table in on the way...

Codec (fourcc) Container Linux Windows
avc1 mjpeg - -
avc1 avi - -
avc1 mp4 - OK
x264 mjpeg - -
x264 avi - -
x264 mp4 - OK
mp4v mjpeg OK OK
mp4v avi OK OK
mp4v mp4 OK OK
mjpg mjpeg OK OK
mjpg avi OK OK
mjpg mp4 OK OK
xvid mjpeg OK OK
xvid avi OK OK
xvid mp4 OK OK
divx mjpeg OK OK
divx avi OK OK
divx mp4 OK OK
fmp4 mjpeg OK OK
fmp4 avi OK OK
fmp4 mp4 OK OK

@noklam
Copy link
Contributor

noklam commented Jun 18, 2022

Don't worry about it for now, I have seen the same error in my other PR which is also blocked by some Window CI, once that get fixed I am pretty sure this would be passed.

Thanks a lot for your effort! Really appreciate it.

@daniel-falk daniel-falk force-pushed the main branch 2 times, most recently from ccf7bb7 to 8c6b699 Compare July 3, 2022 11:47
@daniel-falk
Copy link
Contributor Author

Here is the codec-container compatibility matrix as from the CI pipeline. Some codecs can be installed separately (such as e.g. h264 codec which has a bad licence preventing it from being shipped together with libraries)

Codec (fourcc) Container Linux Windows
avc1 mjpeg - -
avc1 avi - -
avc1 mp4 - -
x264 mjpeg - -
x264 avi - -
x264 mp4 - -
mp4v mjpeg - -
mp4v avi OK OK
mp4v mp4 OK OK
mjpg mjpeg - -
mjpg avi - -
mjpg mp4 - -
xvid mjpeg - -
xvid avi - -
xvid mp4 - -
divx mjpeg - -
divx avi - -
divx mp4 - -
fmp4 mjpeg - -
fmp4 avi - -
fmp4 mp4 - -

@daniel-falk
Copy link
Contributor Author

I have added unit tests now to get 100% coverage.

@noklam It seems like it is just one unit test that fails on Windows now. Did you have a Windows environment setup so that you could take a look at this? The problem is not related to S3, it fails to write the file to disk when using the temp-dir, the other tests passes when it writes video with the same format to other paths.

@datajoely
Copy link
Contributor

Amazing work!

@noklam
Copy link
Contributor

noklam commented Jul 4, 2022

Trying to re-run the Window Tests, we have seen the same annoying PEP517 error for a while. It's non-deterministic, let's check if the tests still fail.

@daniel-falk
Copy link
Contributor Author

Have you had time to look anything more on this @noklam ? Or is there anyone else on the team who could help me get it trough @datajoely ?

@datajoely datajoely marked this pull request as ready for review July 21, 2022 18:15
@datajoely
Copy link
Contributor

Hi @daniel-falk apologies we we're waiting for this to get out of draft!

@datajoely datajoely requested a review from antonymilne July 21, 2022 18:19
@datajoely
Copy link
Contributor

Amazing work both - thanks for pushing @daniel-falk this is an awesome contribution!

@noklam noklam marked this pull request as draft August 9, 2022 14:26
@noklam
Copy link
Contributor

noklam commented Aug 9, 2022

Awesome, I have converted this back to draft for now, feel free to make it "Ready for Review" to signal that and I will review it again!

@merelcht merelcht mentioned this pull request Nov 7, 2022
10 tasks
@merelcht
Copy link
Member

merelcht commented Nov 7, 2022

Hi @daniel-falk do you still want to complete this PR? We'd like to get all PRs related to datasets to be merged soon now we're moving our datasets code to a different package (see our medium blog post for more details)

Otherwise we'll close this PR and ask you to re-open it on the new repo when it's ready.

@daniel-falk
Copy link
Contributor Author

Hi @merelcht. Sorry for being slow on this, some other projects took priority. It would be nice to get it merged before changes are made, when is the move planned to start?

I just uploaded a new commit which is:

  • Rebased to latest main
  • Added the missing parameters to the init function
  • Squashed and cleaned up the commits
  • Considered the failure reason on windows that @noklam found
  • Inherited the Versioned dataset

I still need to make a bit of testing to verify that it works as expected in all situations.

I looked a bit into why the tests did not fail due to the missing credentials. As far as I can understand that is not tested by moto. The same issue is true for other datasets, if I remove the credentials argument the tests still pass. I can also replicate it with moto only, this test case passes with no issues:

import pytest
from moto import mock_s3
import os
import boto3

@pytest.fixture(scope='function')
def aws_credentials():
    """Mocked AWS Credentials for moto."""
    os.environ['AWS_ACCESS_KEY_ID'] = 'testing'
    os.environ['AWS_SECRET_ACCESS_KEY'] = 'testing'
    os.environ['AWS_SECURITY_TOKEN'] = 'testing'
    os.environ['AWS_SESSION_TOKEN'] = 'testing'
    os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'

@pytest.fixture(scope='function')
def s3(aws_credentials):
    with mock_s3():
        yield boto3.client('s3', region_name='us-east-1')

def test_create_bucket_creds(s3):
    s32 = boto3.client('s3', region_name='us-east-1', aws_access_key_id="other", aws_secret_access_key="other")
    s32.create_bucket(Bucket="somebucket")

    result = s3.list_buckets()
    assert len(result['Buckets']) == 1
    assert result['Buckets'][0]['Name'] == 'somebucket'

@noklam
Copy link
Contributor

noklam commented Nov 7, 2022

@daniel-falk We aim to move it end of next week with a new Kedro minor release coming.

If it helps I am fine with a more basic version without Versioning at this point, you can add the versioning feature later at the Kedro-dataset package.

@daniel-falk
Copy link
Contributor Author

@noklam That's a good idea to reduce the size of the test matrix. I have removed the inheritence of the versioned dataset now. I have also tested the dataset in a pipeline, both with local storage and with S3 storage as per this repo.

@daniel-falk daniel-falk marked this pull request as ready for review November 10, 2022 07:27
@@ -13,6 +13,7 @@ dill~=0.3.1
filelock>=3.4.0, <4.0
gcsfs>=2021.4, <=2022.1
geopandas>=0.6.0, <1.0
h5py>3.5.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MerelTheisenQB Something looks weird here, it seems to be our develop code that are not supposed to be here.

Main:
https://github.com/kedro-org/kedro/blob/main/test_requirements.txt#L16

Develop:
https://github.com/kedro-org/kedro/blob/develop/test_requirements.txt#L16

self._n_frames = length
self._gen = more_itertools.peekable(frames)
self._fourcc = fourcc
self._size = self._gen.peek().size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only place you need this? Would be great if we can remove the extra dependencies from more-itertools as it seems not doing much here.

I am not familiar with this library, I guess the goal is to peek at the first element so you can know the size of the element before iterating the generator. However, it seems to skip the first element somehow? Is this an expected behavior?

image

@daniel-falk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example peek will be chained to include the first element even after peeking. In my code this is self._gen, this is what is used later. frames will be missing the first element after peeking, but that is not used anymore. Your example should instead be:

In [14]: def gen():
    ...:     for i in range(5):
    ...:         yield i
    ...: 
In [15]: 
In [15]: g = gen()
In [16]: p = more_itertools.peekable(g)
In [17]: p.peek()
Out[17]: 0
In [18]: list(p)
Out[18]: [0, 1, 2, 3, 4]

I changed the implementation to use regular itertools instead of more_itertools.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for explaining. As I don't see the peek function as useful here, it almost behaves the same as next. I think it's nice to remove the more_itertools dependency.

f_target.write(tmp_file.read())

def _write_to_filepath(self, video: AbstractVideo, filepath: str) -> None:
# TODO: This uses the codec specified in the VideoDataSet if it is not None, this is due
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something to be done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good enough for now. According to the unit tests mp4v (which this defaults to unless explicitly overridden) supports both avi and mp4 containers on both windows and linux. mp4v is also the only codec that is installed in many installations. Some other formats might fail to encode with this, but then OpenCV will fall back to a codec that works and print a warning message in the log. In the future I do however think it is worth to spend some thoughts to see if there is any better way to derive the best codec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, that makes sense!

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this contribution @daniel-falk ! I'm super impressed with all the tests you've added for this as well ⭐ ⭐

I've left some comments in line and have some additional ones here:

  • Don't forget to update the RELEASE.md to mention this new dataset
  • One thing I don't fully understand is the purpose of all the different video classes. As a user do I need to specify if I want to use the SequenceVideo vs the GeneratorVideo ?

kedro/extras/datasets/video/video_dataset.py Outdated Show resolved Hide resolved
kedro/extras/datasets/video/video_dataset.py Outdated Show resolved Hide resolved
kedro/extras/datasets/video/video_dataset.py Outdated Show resolved Hide resolved
@daniel-falk
Copy link
Contributor Author

@merelcht
re: One thing I don't fully understand is the purpose of all the different video classes. As a user do I need to specify if I want to use the SequenceVideo vs the GeneratorVideo ?

If you are only working with video files then you do not need to do that. When you open a video file from a path it will automatically become a FileVideo. In some cases you do however have a bunch of np.ndarrays that you want to create a video from, then you can use the SequenceVideo class to wrap them in, and then save them as a VideoDataSet. The same goes for GeneratorVideo, if you have a generator that yields frames, then you can use that class to create a VideoDataSet from that.

I'll add an example on this too in the docstring.

@daniel-falk daniel-falk force-pushed the main branch 3 times, most recently from b5e75d3 to 8acf485 Compare November 10, 2022 17:54
RELEASE.md Outdated
@@ -16,6 +16,7 @@
* You can configure config file patterns through `settings.py` without creating a custom config loader.

## Bug fixes and other changes
* Added `VideoDataSet` to read and write video files and interact with their frames
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this to the Major features and improvements section above and format it in a table similar to the added datasets here https://github.com/kedro-org/kedro/blob/main/RELEASE.md#major-features-and-improvements-7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@merelcht
Copy link
Member

@merelcht re: One thing I don't fully understand is the purpose of all the different video classes. As a user do I need to specify if I want to use the SequenceVideo vs the GeneratorVideo ?

If you are only working with video files then you do not need to do that. When you open a video file from a path it will automatically become a FileVideo. In some cases you do however have a bunch of np.ndarrays that you want to create a video from, then you can use the SequenceVideo class to wrap them in, and then save them as a VideoDataSet. The same goes for GeneratorVideo, if you have a generator that yields frames, then you can use that class to create a VideoDataSet from that.

I'll add an example on this too in the docstring.

Thank you for explaining and adding the examples! 👍

@datajoely
Copy link
Contributor

datajoely commented Nov 14, 2022

This is really really awesome work @daniel-falk !

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this epic contribution @daniel-falk ! ⭐ 🚀 I really appreciate how much work you've put into this!

This commit adds a video dataset that can read and write video files. The
dataset is backed by OpenCVs video reader and writer and different
buffer protocols such as PIL.Image and numpy.ndarray. There is one class
for iterables (e.g. list of frames) and one for generators. Since large
videos might not fit in memory, this ability allows us to read one frame
from the video file, transform the frame and write it to another video
file. Thus there is only need for one frame at a time in memory.

The different codecs that are supported depends on the codecs installed
and how OpenCV was compiled.

Signed-off-by: Daniel Falk <daniel.falk.1@fixedit.ai>
@daniel-falk
Copy link
Contributor Author

Thanks for positive feedback! Pushed a new commit with updated RELEASE file.

@noklam
Copy link
Contributor

noklam commented Nov 14, 2022

@daniel-falk Thank you for your contribution! The release note looks good to me.

@merelcht merelcht merged commit 13089ea into kedro-org:main Nov 15, 2022
luizvbo pushed a commit to luizvbo/kedro that referenced this pull request Nov 21, 2022
This commit adds a video dataset that can read and write video files. The
dataset is backed by OpenCVs video reader and writer and different
buffer protocols such as PIL.Image and numpy.ndarray. There is one class
for iterables (e.g. list of frames) and one for generators. Since large
videos might not fit in memory, this ability allows us to read one frame
from the video file, transform the frame and write it to another video
file. Thus there is only need for one frame at a time in memory.

The different codecs that are supported depends on the codecs installed
and how OpenCV was compiled.

Signed-off-by: Daniel Falk <daniel.falk.1@fixedit.ai>
Signed-off-by: Luiz Otavio V. B. Oliveira <luiz.vbo@gmail.com>
noklam added a commit that referenced this pull request Nov 30, 2022
* Update attrs requirement from ~=21.3 to >=22.1.0,<23.0 in /dependency

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>

* Updated the documentation to reflect the code changes

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>

* Add Comment

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>

* Changed Attrs pin to attrs>=20.0, <23.0 

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>

* Updated Release Notes

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>

* Updated Lower Bound to be 21.3

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>

* Updated Release Notes

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>

* Update RELEASE.md

Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Remove upper bound

* Add a video dataset (#1312)

This commit adds a video dataset that can read and write video files. The
dataset is backed by OpenCVs video reader and writer and different
buffer protocols such as PIL.Image and numpy.ndarray. There is one class
for iterables (e.g. list of frames) and one for generators. Since large
videos might not fit in memory, this ability allows us to read one frame
from the video file, transform the frame and write it to another video
file. Thus there is only need for one frame at a time in memory.

The different codecs that are supported depends on the codecs installed
and how OpenCV was compiled.

Signed-off-by: Daniel Falk <daniel.falk.1@fixedit.ai>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Update fsspec requirement in /dependency (#2024)

Updates the requirements on [fsspec](https://github.com/fsspec/filesystem_spec) to permit the latest version.
- [Release notes](https://github.com/fsspec/filesystem_spec/releases)
- [Commits](fsspec/filesystem_spec@2021.04.0...2022.11.0)

---
updated-dependencies:
- dependency-name: fsspec
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Update pip-tools requirement from ~=6.9 to ~=6.10 in /dependency (#2023)

Updates the requirements on [pip-tools](https://github.com/jazzband/pip-tools) to permit the latest version.
- [Release notes](https://github.com/jazzband/pip-tools/releases)
- [Changelog](https://github.com/jazzband/pip-tools/blob/master/CHANGELOG.md)
- [Commits](jazzband/pip-tools@6.9.0...6.10.0)

---
updated-dependencies:
- dependency-name: pip-tools
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Add execution_options to SQLQueryDataSet (#1865)

* Add execution_options to SQLQueryDataSet

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* Update RELEASE.md

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* Fix lint issues

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* Add a check in the unit test

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* Add test for connection reuse behaviour

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* Inject execution_options at load time

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>

* enhance doc and fix incorrect example

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* More docs change

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Fixing typos

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
Co-authored-by: Nok <nok_lam_chan@mckinsey.com>
Co-authored-by: Nok Lam Chan <mediumnok@gmail.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Update attrs requirement from ~=21.3 to >=22.1.0,<23.0 in /dependency

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Updated the documentation to reflect the code changes

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Add Comment

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Changed Attrs pin to attrs>=20.0, <23.0

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Updated Release Notes

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Updated Lower Bound to be 21.3

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Updated Release Notes

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Update links to Slack & add note about security (#2033)

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Apply OpenSSF Best Practices Badge (#2034)

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Add SVMLight DataSet (#1992)

* Add SVMLight DataSet

Add DataSet for svmlight/libsvm files using scikit-learn library as backend.

Resolves #1972

Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com>

* Pin scikit-learn version to work with python 3.7

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Update dataset docstring

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>

* Pin requirements to work with python 3.7

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Add requirements to setup.py

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Add changes to dataset docstring

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Dummy commit to retrigger CI pt1

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Dummy commit to retrigger CI pt2

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>

* Update dataset dockstring

Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com>

Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Co-authored-by: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com>
Co-authored-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Update RELEASE.md

Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Improve get started docs and guide to working with notebooks (#2031)

* Revised the Introduction to make it short and sweet.
* Revised the Get Started section. Gone is "Hello Kedro". Gone are the installation pre-requisites (that's just part of the Install Kedro page now). Gone is the "Standalone use of the data catalog - woot woot" and GONE is the page on Kedro starters.
* Reordered the create project material to put the project structure breakdown in the section that introduces key concepts and shorten the Iris tutorial to the bare minimum. I did add visualisation at this point though, to highlight Kedro Viz, as I felt it was coming far too late in the spaceflights tutorial and needed to be more prominent as a feature.
* Added a TL;DR page to Get Started which some people could probably just use as-is and ignore the rest of the section.
* Starters material has moved into a new section all about "Kedro project setup". Much of that section still needs review/revision but I have updated the Starters page so it reads more clearly.
* Improved the Kedro-Viz page somewhat (still more to come for Plotly)
* Notebooks/IPython materials now merged and simplified

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

* Remove upper bound

Signed-off-by: Nok <nok_lam_chan@mckinsey.com>

Signed-off-by: rxm7706 <95496360+rxm7706@users.noreply.github.com>
Signed-off-by: Daniel Falk <daniel.falk.1@fixedit.ai>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Clotilde Guinard <clotilde_guinard@hotmail.fr>
Signed-off-by: Kirill Korotkov <korotkovkm@gmail.com>
Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Co-authored-by: Nok Lam Chan <mediumnok@gmail.com>
Co-authored-by: Nok Lam Chan <nok_lam_chan@mckinsey.com>
Co-authored-by: Daniel Falk <daniel@da-robotteknik.se>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: clotildeguinard <clotilde_guinard@hotmail.fr>
Co-authored-by: Yetunde Dada <43755008+yetudada@users.noreply.github.com>
Co-authored-by: Cyril Korotkov <korotkovkm@gmail.com>
Co-authored-by: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com>
Co-authored-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a video dataset
4 participants