[WIP] add 4MuLA #480

AngeloMendes · 2021-03-12T18:36:45Z

Pull request to add the tiny version of 4MuLA dataset in mir-data.
Issue #427

@magdalenafuentes adding the loader's checklist for quick review (@AngeloMendes feel free to check the boxes as you go):

Description

Please include the following information at the top level docstring for the dataset's module mydataset.py:

Describe annotations included in the dataset
Indicate the size of the datasets (e.g. number files and duration, hours)
Mention the origin of the dataset (e.g. creator, institution)
Describe the type of music included in the dataset
Indicate any relevant papers related to the dataset
Include a description about how the data can be accessed and the license it uses (if applicable)

Dataset loaders checklist:

Create a script in scripts/, e.g. make_my_dataset_index.py, which generates an index file.
Run the script on the canonical version of the dataset and save the index in mirdata/indexes/ e.g. my_dataset_index.json.
Create a module in mirdata, e.g. mirdata/my_dataset.py
Create tests for your loader in tests/datasets/, e.g. test_my_dataset.py
Add your module to docs/source/mirdata.rst and docs/source/quick_reference.rst
Run tests/test_full_dataset.py on your dataset.

If your dataset is not fully downloadable there are two extra steps you should follow:

Contacting the mirdata organizers by opening an issue or PR so we can discuss how to proceed with the closed dataset.
Show that the version used to create the checksum is the "canonical" one, either by getting the version from the dataset creator, or by verifying equivalence with several other copies of the dataset.
Make sure someone has run pytest -s tests/test_full_dataset.py --local --dataset my_dataset once on your dataset locally and confirmed it passes

codecov · 2021-03-12T20:35:08Z

Codecov Report

Merging #480 (6cd0fd7) into master (a5db106) will increase coverage by 2.39%.
The diff coverage is 100.00%.

❗ Current head 6cd0fd7 differs from pull request most recent head 744a171. Consider uploading reports for the commit 744a171 to get more accurate results

@@            Coverage Diff             @@
##           master     #480      +/-   ##
==========================================
+ Coverage   96.67%   99.06%   +2.39%     
==========================================
  Files          50       37      -13     
  Lines        6160     3762    -2398     
==========================================
- Hits         5955     3727    -2228     
+ Misses        205       35     -170

magdalenafuentes

Hey @AngeloMendes thanks so much for all the work and adding this great dataset :)

I made some comments, let me know if you want me to clarify anything further or if you need help with anything!

About the test files: I saw that you have a mini-index and that you reduced the size of the tsv files, that's great! Can you do the same thing with the spectrogram and the parquet files? You can trim the spectrogram to a few samples (i.e a few seconds) for example.

docs/source/contributing.rst

docs/source/table.rst

mirdata/datasets/tiny_4mula.py

mirdata/download_utils.py

magdalenafuentes · 2021-03-20T23:46:14Z

scripts/make_tiny_4mula_index.py

+from mirdata.validate import md5
+from mirdata.download_utils import RemoteFileMetadata, download_from_remote
+from numpy import save
+import pyarrow.parquet as pq


pyarrow is not a dependency in mirdata, is it possible to load the data with csv instead?

Consider this comment is low priority because make_index scripts are here for reproducibility and is rare that users run them

But it would be good practice :)

The parquet format was 300% better than csv to compress the data. I think that parquet is a format interesting to reduce the use of disk and memory. I used pyarrow due to the feature to read the dataset in batches to reduce memory use.

setup.py

update branch

AngeloMendes · 2021-03-29T00:40:33Z

Hey @magdalenafuentes! Thank you so much for your comments! The code review that you made help me a lot.
I followed the same idea and developed the model to download the small version of my dataset too.
I will wait for news comments :)

rabitt

Hey @AngeloMendes ! I'm stepping in for @magdalenafuentes

It's looking good! I made a few small comments, and then I have a couple of questions:

what is the data in the top-level 4mula/4mula folders (the first two files seen here) used for? If they're unused, make sure to remove these files from the PR
I see there are two versions of the data - tiny_4mula and small_4mula. Am I right in understanding that "small_4mula" is a subset of "tiny_4mula"? We're working on how to better support multiple versions of datasets, and this seems like a perfect use case. For this PR, I'd suggest writing this loader for just "tiny_4mula", but calling it "4mula" with one index (4mula_index.json), and one datasets/4mula.py file since the code for both is identical. Then in the datasets table (mirdata.rst/table.rst) it can have a single entry, but with multiple versions. As soon as Support multiple versions/samples #489 is addressed, we could use this dataset as a first use-case, and extend it to support the small version. What do you think?

setup.py

mirdata/datasets/tiny_4mula.py

rabitt · 2021-04-19T17:57:22Z

Hey again @AngeloMendes ! Just a quick note to let you know that we now support more than one version of a dataset.

If you want any help finishing this PR just let us know, we're happy to make some last changes to get this merged!

AngeloMendes · 2021-04-20T23:44:17Z

Hey @rabitt!
Sorry for the delay in responding...
So, can I keep the two dataset versions in separate instances/classes?

rabitt · 2021-04-21T08:53:41Z

Sorry for the delay in responding...

No problem at all!

So, can I keep the two dataset versions in separate instances/classes?

Yes and no - there should only be one file mirdata/datasets/4mula.py, but now you can have two index files and different download data for the different versions. I wrote up some instructions for what to do here: https://mirdata.readthedocs.io/en/latest/source/contributing.html#multiple-versions

If anything isn't clear don't hesitate to ask!

AngeloMendes added 14 commits March 5, 2021 13:53

get 4mula by gdown

b008b22

update docs

6741aab

add 4mula_tiny dataset to test

f0db2d5

update 4mula structure

7a48034

create tests

bd183a1

update make index

132f91d

add resources

703168e

update docs

1a50a58

update tests

017b8b8

add 4mula index

90f955a

update bibtex

9d807ed

fix urllib import

19f1302

update make index

ad66c3d

update dependencies

cb6cb18

AngeloMendes mentioned this pull request Mar 12, 2021

Add 4MuLA #427

Open

AngeloMendes added 3 commits March 12, 2021 15:49

update dependencies

3675260

update formatting

9811685

fix formatting with black

c7b04b3

magdalenafuentes changed the title ~~add 4MuLA~~ [WIP] add 4MuLA Mar 13, 2021

magdalenafuentes requested changes Mar 20, 2021

View reviewed changes

AngeloMendes added 8 commits March 26, 2021 10:34

create small version

be3d735

fixes code review

ef41aef

format with black

dc7788a

fix tests

baeb9c3

fix tests

ef8f468

Merge branch 'master' into 4mula_small

6199317

update branch

add small version

a95550f

format with black

292d764

AngeloMendes added 5 commits March 31, 2021 16:56

add version to pyarrow

8c0003a

update BIBTEX reference

a93d595

add small 4mula index

72ec019

update

2613c2f

undo small_4mula index

6cd0fd7

AngeloMendes requested a review from magdalenafuentes April 4, 2021 19:52

rabitt reviewed Apr 7, 2021

View reviewed changes

setup.py Show resolved Hide resolved

mirdata/datasets/tiny_4mula.py Show resolved Hide resolved

mirdata/datasets/tiny_4mula.py Show resolved Hide resolved

mirdata/datasets/tiny_4mula.py Show resolved Hide resolved

harshpalan added 2 commits October 27, 2022 16:13

Merge branch 'master' into master

16d4eff

Merge branch 'master' into master

744a171

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] add 4MuLA #480

[WIP] add 4MuLA #480

AngeloMendes commented Mar 12, 2021 •

edited

Loading

codecov bot commented Mar 12, 2021 •

edited

Loading

magdalenafuentes left a comment

magdalenafuentes Mar 20, 2021

magdalenafuentes Mar 20, 2021

AngeloMendes Mar 26, 2021

AngeloMendes commented Mar 29, 2021

rabitt left a comment

rabitt commented Apr 19, 2021

AngeloMendes commented Apr 20, 2021

rabitt commented Apr 21, 2021

[WIP] add 4MuLA #480

Are you sure you want to change the base?

[WIP] add 4MuLA #480

Conversation

AngeloMendes commented Mar 12, 2021 • edited Loading

Description

Dataset loaders checklist:

codecov bot commented Mar 12, 2021 • edited Loading

Codecov Report

magdalenafuentes left a comment

Choose a reason for hiding this comment

magdalenafuentes Mar 20, 2021

Choose a reason for hiding this comment

magdalenafuentes Mar 20, 2021

Choose a reason for hiding this comment

AngeloMendes Mar 26, 2021

Choose a reason for hiding this comment

AngeloMendes commented Mar 29, 2021

rabitt left a comment

Choose a reason for hiding this comment

rabitt commented Apr 19, 2021

AngeloMendes commented Apr 20, 2021

rabitt commented Apr 21, 2021

AngeloMendes commented Mar 12, 2021 •

edited

Loading

codecov bot commented Mar 12, 2021 •

edited

Loading