Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] add 4MuLA #480

Open
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

AngeloMendes
Copy link

@AngeloMendes AngeloMendes commented Mar 12, 2021

Pull request to add the tiny version of 4MuLA dataset in mir-data.
Issue #427

@magdalenafuentes adding the loader's checklist for quick review (@AngeloMendes feel free to check the boxes as you go):

Description

Please include the following information at the top level docstring for the dataset's module mydataset.py:

  • Describe annotations included in the dataset
  • Indicate the size of the datasets (e.g. number files and duration, hours)
  • Mention the origin of the dataset (e.g. creator, institution)
  • Describe the type of music included in the dataset
  • Indicate any relevant papers related to the dataset
  • Include a description about how the data can be accessed and the license it uses (if applicable)

Dataset loaders checklist:

  • Create a script in scripts/, e.g. make_my_dataset_index.py, which generates an index file.
  • Run the script on the canonical version of the dataset and save the index in mirdata/indexes/ e.g. my_dataset_index.json.
  • Create a module in mirdata, e.g. mirdata/my_dataset.py
  • Create tests for your loader in tests/datasets/, e.g. test_my_dataset.py
  • Add your module to docs/source/mirdata.rst and docs/source/quick_reference.rst
  • Run tests/test_full_dataset.py on your dataset.

If your dataset is not fully downloadable there are two extra steps you should follow:

  • Contacting the mirdata organizers by opening an issue or PR so we can discuss how to proceed with the closed dataset.
  • Show that the version used to create the checksum is the "canonical" one, either by getting the version from the dataset creator, or by verifying equivalence with several other copies of the dataset.
  • Make sure someone has run pytest -s tests/test_full_dataset.py --local --dataset my_dataset once on your dataset locally and confirmed it passes

@AngeloMendes AngeloMendes mentioned this pull request Mar 12, 2021
@codecov
Copy link

codecov bot commented Mar 12, 2021

Codecov Report

Merging #480 (6cd0fd7) into master (a5db106) will increase coverage by 2.39%.
The diff coverage is 100.00%.

❗ Current head 6cd0fd7 differs from pull request most recent head 744a171. Consider uploading reports for the commit 744a171 to get more accurate results

@@            Coverage Diff             @@
##           master     #480      +/-   ##
==========================================
+ Coverage   96.67%   99.06%   +2.39%     
==========================================
  Files          50       37      -13     
  Lines        6160     3762    -2398     
==========================================
- Hits         5955     3727    -2228     
+ Misses        205       35     -170     

@magdalenafuentes magdalenafuentes changed the title add 4MuLA [WIP] add 4MuLA Mar 13, 2021
Copy link
Collaborator

@magdalenafuentes magdalenafuentes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @AngeloMendes thanks so much for all the work and adding this great dataset :)

I made some comments, let me know if you want me to clarify anything further or if you need help with anything!

About the test files: I saw that you have a mini-index and that you reduced the size of the tsv files, that's great! Can you do the same thing with the spectrogram and the parquet files? You can trim the spectrogram to a few samples (i.e a few seconds) for example.

docs/source/contributing.rst Show resolved Hide resolved
docs/source/table.rst Outdated Show resolved Hide resolved
mirdata/datasets/tiny_4mula.py Outdated Show resolved Hide resolved
mirdata/datasets/tiny_4mula.py Outdated Show resolved Hide resolved
mirdata/datasets/tiny_4mula.py Outdated Show resolved Hide resolved
mirdata/datasets/tiny_4mula.py Outdated Show resolved Hide resolved
mirdata/datasets/tiny_4mula.py Outdated Show resolved Hide resolved
mirdata/download_utils.py Outdated Show resolved Hide resolved
from mirdata.validate import md5
from mirdata.download_utils import RemoteFileMetadata, download_from_remote
from numpy import save
import pyarrow.parquet as pq
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow is not a dependency in mirdata, is it possible to load the data with csv instead?

Consider this comment is low priority because make_index scripts are here for reproducibility and is rare that users run them

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it would be good practice :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parquet format was 300% better than csv to compress the data. I think that parquet is a format interesting to reduce the use of disk and memory. I used pyarrow due to the feature to read the dataset in batches to reduce memory use.

setup.py Outdated Show resolved Hide resolved
@AngeloMendes
Copy link
Author

Hey @magdalenafuentes! Thank you so much for your comments! The code review that you made help me a lot.
I followed the same idea and developed the model to download the small version of my dataset too.
I will wait for news comments :)

Copy link
Collaborator

@rabitt rabitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @AngeloMendes ! I'm stepping in for @magdalenafuentes

It's looking good! I made a few small comments, and then I have a couple of questions:

  • what is the data in the top-level 4mula/4mula folders (the first two files seen here) used for? If they're unused, make sure to remove these files from the PR
  • I see there are two versions of the data - tiny_4mula and small_4mula. Am I right in understanding that "small_4mula" is a subset of "tiny_4mula"? We're working on how to better support multiple versions of datasets, and this seems like a perfect use case. For this PR, I'd suggest writing this loader for just "tiny_4mula", but calling it "4mula" with one index (4mula_index.json), and one datasets/4mula.py file since the code for both is identical. Then in the datasets table (mirdata.rst/table.rst) it can have a single entry, but with multiple versions. As soon as Support multiple versions/samples #489 is addressed, we could use this dataset as a first use-case, and extend it to support the small version. What do you think?

setup.py Show resolved Hide resolved
mirdata/datasets/tiny_4mula.py Show resolved Hide resolved
mirdata/datasets/tiny_4mula.py Show resolved Hide resolved
mirdata/datasets/tiny_4mula.py Show resolved Hide resolved
@rabitt
Copy link
Collaborator

rabitt commented Apr 19, 2021

Hey again @AngeloMendes ! Just a quick note to let you know that we now support more than one version of a dataset.

If you want any help finishing this PR just let us know, we're happy to make some last changes to get this merged!

@AngeloMendes
Copy link
Author

Hey @rabitt!
Sorry for the delay in responding...
So, can I keep the two dataset versions in separate instances/classes?

@rabitt
Copy link
Collaborator

rabitt commented Apr 21, 2021

Sorry for the delay in responding...

No problem at all!

So, can I keep the two dataset versions in separate instances/classes?

Yes and no - there should only be one file mirdata/datasets/4mula.py, but now you can have two index files and different download data for the different versions. I wrote up some instructions for what to do here: https://mirdata.readthedocs.io/en/latest/source/contributing.html#multiple-versions

If anything isn't clear don't hesitate to ask!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants