Add QUESST14 dataset #2290

carolineechen · 2022-03-25T15:10:14Z

implementation adapted from s3prl

modifying the s3prl downstream expert to this using this dataset implementation produces the same results as using the original s3prl pipeline

mthrok · 2022-03-25T16:58:03Z

torchaudio/datasets/quesst14.py

+
+    def __init__(
+        self,
+        root: Union[str, Path],


I am wondering if we can use a default path for the download location, similar to pretained model. Sure it will be different from the previous datasets, but I often find that it's more convenient that way. Thoughts?

The default path will be tricky if we consider Windows (C:\), Linux and MacOS (/usr/datasets/).

There are many OS abstraction already exists for path manipulation, so that should not be a problem.

this seems out of scope for this PR, but I agree that it can be good for managing consistent dataset paths and allow us to better maintain cached dataset artifacts, and would not negatively affect users as they can still input their own root path if they have preferences. do you have a suggestion on the path location?

torchaudio/datasets/quesst14.py

nateanl · 2022-04-10T12:10:16Z

torchaudio/datasets/quesst14.py

+
+        self.n_docs = len(doc_paths)
+        self.n_queries = len(query_paths)
+        self.data = query_paths + doc_paths


So self.data contain both doc and subset. I'm wondering is there a case where only subset is used?

The docs and dev subsets are used separately, according to s3prl
docs is the audios for retrieval, dev and test are query audios. It makes more sense to split docs as a separate subset. We should also remove n_docs in the dataset, as they can be gotten by len(dataset) if the dataset is initialized with docs subset. Same for n_queries.

nateanl

LGTM. Just some nits and we can merge it. Thanks!

nateanl · 2022-04-15T08:49:28Z

torchaudio/datasets/quesst14.py

+        language (str, optional): Language to get dataset for.
+            Options: [None, ``albanian``, ``basque``, ``czech``, `nnenglish``, ``romanian``, ``slovak``].
+            (default: ``"nnenglish"``)
+        subset (str): subset of the dataset to use. Options: ["docs", "dev", "eval"].


Suggested change

subset (str): subset of the dataset to use. Options: ["docs", "dev", "eval"].

subset (str or None, optional): subset of the dataset to use. Options: ["docs", "dev", "eval"].

I'm thinking of keeping subset as a required parameter, and not allowing None as it is a bit vague given there is the docs subset. It seems docs would generally be extracted separately and used differently than the dev and eval set, so supporting a dataset that could support docs+dev only or dev+eval only etc seems like it could be messy without having additional class variables like len(docs). I think the user can just use ConcatDataset if they'd like to merge multiple subsets, thoughts?

Yeah that's also what I'm thinking. making it a required parameter is a good idea.

torchaudio/datasets/quesst14.py

Co-authored-by: nateanl <zni@fb.com>

facebook-github-bot · 2022-04-15T22:57:02Z

@carolineechen has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

github-actions · 2022-04-18T18:09:21Z

Hey @carolineechen.
You merged this PR, but labels were not properly added. Please add a primary and secondary label (See https://github.com/pytorch/audio/blob/main/.github/process_commit.py)

mthrok · 2022-04-18T20:39:50Z

torchaudio/datasets/quesst14.py

+]
+
+
+class QUESST14(Dataset):


@carolineechen Can we have Quesst14 as the name of the dataset?
I understand that the original name is all capital but there is no benefit going against PEP-8's naming convention. Currently torchaudio.datasets has couple of datasets named like that but that's definitely not a standard and we should not make it a standard.

@mthrok I agree we shouldn't default to all caps for dataset naming as we have done with several datasets in the past, but as this dataset's original name is already all capitalized ("QUESST 2014 Multilingual..."), I think it makes sense for this to remain all caps, and to follow the dataset's conventional naming. What do you think?

I think this is where the sticking with the PEP-8 naming convention matters.
It is unfortunate that existing datasets violate them and we are not gonna change it for the sake of changing it, but
as we add more datasets, we should stick with PEP-8.

The reason is that boundary of allowing all caps and not are kinda arbitrary this manner.
This can lead to discussion and disagreement every time we add datasets
It's like formatting issue, where subjectivity causes disagreement and make the collaboration unproductive.
Sticking with PEP-8 allows us to reason why more clearly. If all caps are required from technical reason, it will be clearer.

Summary: implementation adapted from [s3prl](https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/quesst14_dtw/dataset.py) modifying the s3prl downstream expert to [this](carolineechen/s3prl@adc91a5) using this dataset implementation produces the same results as using the original s3prl pipeline Pull Request resolved: pytorch#2290 Reviewed By: nateanl Differential Revision: D35692551 Pulled By: carolineechen fbshipit-source-id: 035ad161d4cbbd2072411cfdf89984b73a89868c

add quesst14 dataset

96dfc7b

facebook-github-bot added the CLA Signed label Mar 25, 2022

carolineechen requested a review from nateanl March 25, 2022 15:10

mthrok reviewed Mar 25, 2022

View reviewed changes

nateanl reviewed Mar 27, 2022

View reviewed changes

torchaudio/datasets/quesst14.py Outdated Show resolved Hide resolved

torchaudio/datasets/quesst14.py Outdated Show resolved Hide resolved

torchaudio/datasets/quesst14.py Outdated Show resolved Hide resolved

torchaudio/datasets/quesst14.py Outdated Show resolved Hide resolved

Caroline Chen added 2 commits April 8, 2022 11:00

update dataset

3d1bea1

remove wav effects

bf81a63

carolineechen force-pushed the quesst14-dataset branch from 7a8eb0e to bf81a63 Compare April 8, 2022 19:30

nateanl reviewed Apr 10, 2022

View reviewed changes

Caroline Chen added 2 commits April 11, 2022 17:24

fix docstrings

9f96b92

separate out docs subset

7c2b697

nateanl approved these changes Apr 15, 2022

View reviewed changes

Apply suggestions from code review

9b41022

Co-authored-by: nateanl <zni@fb.com>

carolineechen marked this pull request as ready for review April 15, 2022 22:56

facebook-github-bot closed this in aebcf6a Apr 18, 2022

carolineechen added module: datasets new feature labels Apr 18, 2022

mthrok reviewed Apr 18, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add QUESST14 dataset #2290

Add QUESST14 dataset #2290

carolineechen commented Mar 25, 2022 •

edited

Loading

mthrok Mar 25, 2022

nateanl Mar 27, 2022

mthrok Mar 29, 2022

carolineechen Apr 8, 2022

nateanl Apr 10, 2022

nateanl Apr 13, 2022

nateanl left a comment

nateanl Apr 15, 2022

carolineechen Apr 15, 2022

nateanl Apr 15, 2022

facebook-github-bot commented Apr 15, 2022

github-actions bot commented Apr 18, 2022

mthrok Apr 18, 2022

carolineechen Apr 20, 2022 •

edited

Loading

mthrok Apr 21, 2022

	subset (str): subset of the dataset to use. Options: ["docs", "dev", "eval"].
	subset (str or None, optional): subset of the dataset to use. Options: ["docs", "dev", "eval"].

Add QUESST14 dataset #2290

Add QUESST14 dataset #2290

Conversation

carolineechen commented Mar 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nateanl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 15, 2022

github-actions bot commented Apr 18, 2022

Choose a reason for hiding this comment

carolineechen Apr 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carolineechen commented Mar 25, 2022 •

edited

Loading

carolineechen Apr 20, 2022 •

edited

Loading