-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] add 4MuLA #480
base: master
Are you sure you want to change the base?
[WIP] add 4MuLA #480
Conversation
Codecov Report
@@ Coverage Diff @@
## master #480 +/- ##
==========================================
+ Coverage 96.67% 99.06% +2.39%
==========================================
Files 50 37 -13
Lines 6160 3762 -2398
==========================================
- Hits 5955 3727 -2228
+ Misses 205 35 -170 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @AngeloMendes thanks so much for all the work and adding this great dataset :)
I made some comments, let me know if you want me to clarify anything further or if you need help with anything!
About the test files: I saw that you have a mini-index
and that you reduced the size of the tsv
files, that's great! Can you do the same thing with the spectrogram
and the parquet
files? You can trim the spectrogram
to a few samples (i.e a few seconds) for example.
from mirdata.validate import md5 | ||
from mirdata.download_utils import RemoteFileMetadata, download_from_remote | ||
from numpy import save | ||
import pyarrow.parquet as pq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow
is not a dependency in mirdata
, is it possible to load the data with csv
instead?
Consider this comment is low priority because make_index
scripts are here for reproducibility and is rare that users run them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it would be good practice :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parquet
format was 300% better than csv
to compress the data. I think that parquet is a format interesting to reduce the use of disk and memory. I used pyarrow
due to the feature to read the dataset in batches to reduce memory use.
Hey @magdalenafuentes! Thank you so much for your comments! The code review that you made help me a lot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @AngeloMendes ! I'm stepping in for @magdalenafuentes
It's looking good! I made a few small comments, and then I have a couple of questions:
- what is the data in the top-level
4mula/4mula
folders (the first two files seen here) used for? If they're unused, make sure to remove these files from the PR - I see there are two versions of the data - tiny_4mula and small_4mula. Am I right in understanding that "small_4mula" is a subset of "tiny_4mula"? We're working on how to better support multiple versions of datasets, and this seems like a perfect use case. For this PR, I'd suggest writing this loader for just "tiny_4mula", but calling it "4mula" with one index (
4mula_index.json
), and onedatasets/4mula.py
file since the code for both is identical. Then in the datasets table (mirdata.rst/table.rst) it can have a single entry, but with multiple versions. As soon as Support multiple versions/samples #489 is addressed, we could use this dataset as a first use-case, and extend it to support the small version. What do you think?
Hey again @AngeloMendes ! Just a quick note to let you know that we now support more than one version of a dataset. If you want any help finishing this PR just let us know, we're happy to make some last changes to get this merged! |
Hey @rabitt! |
No problem at all!
Yes and no - there should only be one file If anything isn't clear don't hesitate to ask! |
Pull request to add the tiny version of 4MuLA dataset in mir-data.
Issue #427
@magdalenafuentes adding the loader's checklist for quick review (@AngeloMendes feel free to check the boxes as you go):
Description
Please include the following information at the top level docstring for the dataset's module mydataset.py:
Dataset loaders checklist:
scripts/
, e.g.make_my_dataset_index.py
, which generates an index file.mirdata/indexes/
e.g.my_dataset_index.json
.mirdata/my_dataset.py
tests/datasets/
, e.g.test_my_dataset.py
docs/source/mirdata.rst
anddocs/source/quick_reference.rst
tests/test_full_dataset.py
on your dataset.If your dataset is not fully downloadable there are two extra steps you should follow:
pytest -s tests/test_full_dataset.py --local --dataset my_dataset
once on your dataset locally and confirmed it passes