Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add QUESST14 dataset #2290

Closed
wants to merge 6 commits into from

Conversation

carolineechen
Copy link
Contributor

@carolineechen carolineechen commented Mar 25, 2022

implementation adapted from s3prl

modifying the s3prl downstream expert to this using this dataset implementation produces the same results as using the original s3prl pipeline


def __init__(
self,
root: Union[str, Path],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we can use a default path for the download location, similar to pretained model. Sure it will be different from the previous datasets, but I often find that it's more convenient that way. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default path will be tricky if we consider Windows (C:\), Linux and MacOS (/usr/datasets/).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many OS abstraction already exists for path manipulation, so that should not be a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems out of scope for this PR, but I agree that it can be good for managing consistent dataset paths and allow us to better maintain cached dataset artifacts, and would not negatively affect users as they can still input their own root path if they have preferences. do you have a suggestion on the path location?

torchaudio/datasets/quesst14.py Outdated Show resolved Hide resolved
torchaudio/datasets/quesst14.py Outdated Show resolved Hide resolved
torchaudio/datasets/quesst14.py Outdated Show resolved Hide resolved
torchaudio/datasets/quesst14.py Outdated Show resolved Hide resolved
torchaudio/datasets/quesst14.py Outdated Show resolved Hide resolved
torchaudio/datasets/quesst14.py Outdated Show resolved Hide resolved
torchaudio/datasets/quesst14.py Outdated Show resolved Hide resolved
torchaudio/datasets/quesst14.py Outdated Show resolved Hide resolved

self.n_docs = len(doc_paths)
self.n_queries = len(query_paths)
self.data = query_paths + doc_paths
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So self.data contain both doc and subset. I'm wondering is there a case where only subset is used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs and dev subsets are used separately, according to s3prl
docs is the audios for retrieval, dev and test are query audios. It makes more sense to split docs as a separate subset. We should also remove n_docs in the dataset, as they can be gotten by len(dataset) if the dataset is initialized with docs subset. Same for n_queries.

Copy link
Member

@nateanl nateanl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just some nits and we can merge it. Thanks!

language (str, optional): Language to get dataset for.
Options: [None, ``albanian``, ``basque``, ``czech``, `nnenglish``, ``romanian``, ``slovak``].
(default: ``"nnenglish"``)
subset (str): subset of the dataset to use. Options: ["docs", "dev", "eval"].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
subset (str): subset of the dataset to use. Options: ["docs", "dev", "eval"].
subset (str or None, optional): subset of the dataset to use. Options: ["docs", "dev", "eval"].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of keeping subset as a required parameter, and not allowing None as it is a bit vague given there is the docs subset. It seems docs would generally be extracted separately and used differently than the dev and eval set, so supporting a dataset that could support docs+dev only or dev+eval only etc seems like it could be messy without having additional class variables like len(docs). I think the user can just use ConcatDataset if they'd like to merge multiple subsets, thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's also what I'm thinking. making it a required parameter is a good idea.

torchaudio/datasets/quesst14.py Outdated Show resolved Hide resolved
torchaudio/datasets/quesst14.py Outdated Show resolved Hide resolved
Co-authored-by: nateanl <zni@fb.com>
@carolineechen carolineechen marked this pull request as ready for review April 15, 2022 22:56
@facebook-github-bot
Copy link
Contributor

@carolineechen has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@github-actions
Copy link

Hey @carolineechen.
You merged this PR, but labels were not properly added. Please add a primary and secondary label (See https://github.com/pytorch/audio/blob/main/.github/process_commit.py)

]


class QUESST14(Dataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carolineechen Can we have Quesst14 as the name of the dataset?
I understand that the original name is all capital but there is no benefit going against PEP-8's naming convention. Currently torchaudio.datasets has couple of datasets named like that but that's definitely not a standard and we should not make it a standard.

Copy link
Contributor Author

@carolineechen carolineechen Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mthrok I agree we shouldn't default to all caps for dataset naming as we have done with several datasets in the past, but as this dataset's original name is already all capitalized ("QUESST 2014 Multilingual..."), I think it makes sense for this to remain all caps, and to follow the dataset's conventional naming. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where the sticking with the PEP-8 naming convention matters.
It is unfortunate that existing datasets violate them and we are not gonna change it for the sake of changing it, but
as we add more datasets, we should stick with PEP-8.

The reason is that boundary of allowing all caps and not are kinda arbitrary this manner.
This can lead to discussion and disagreement every time we add datasets
It's like formatting issue, where subjectivity causes disagreement and make the collaboration unproductive.
Sticking with PEP-8 allows us to reason why more clearly. If all caps are required from technical reason, it will be clearer.

xiaohui-zhang pushed a commit to xiaohui-zhang/audio that referenced this pull request May 4, 2022
Summary:
implementation adapted from [s3prl](https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/quesst14_dtw/dataset.py)

modifying the s3prl downstream expert to [this](carolineechen/s3prl@adc91a5) using this dataset implementation produces the same results as using the original s3prl pipeline

Pull Request resolved: pytorch#2290

Reviewed By: nateanl

Differential Revision: D35692551

Pulled By: carolineechen

fbshipit-source-id: 035ad161d4cbbd2072411cfdf89984b73a89868c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants