-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add QUESST14 dataset #2290
Add QUESST14 dataset #2290
Conversation
|
||
def __init__( | ||
self, | ||
root: Union[str, Path], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we can use a default path for the download location, similar to pretained model. Sure it will be different from the previous datasets, but I often find that it's more convenient that way. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default path will be tricky if we consider Windows (C:\
), Linux and MacOS (/usr/datasets/
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many OS abstraction already exists for path manipulation, so that should not be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems out of scope for this PR, but I agree that it can be good for managing consistent dataset paths and allow us to better maintain cached dataset artifacts, and would not negatively affect users as they can still input their own root path if they have preferences. do you have a suggestion on the path location?
7a8eb0e
to
bf81a63
Compare
torchaudio/datasets/quesst14.py
Outdated
|
||
self.n_docs = len(doc_paths) | ||
self.n_queries = len(query_paths) | ||
self.data = query_paths + doc_paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So self.data
contain both doc
and subset. I'm wondering is there a case where only subset is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs and dev subsets are used separately, according to s3prl
docs
is the audios for retrieval, dev
and test
are query audios. It makes more sense to split docs
as a separate subset. We should also remove n_docs
in the dataset, as they can be gotten by len(dataset)
if the dataset is initialized with docs
subset. Same for n_queries
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just some nits and we can merge it. Thanks!
language (str, optional): Language to get dataset for. | ||
Options: [None, ``albanian``, ``basque``, ``czech``, `nnenglish``, ``romanian``, ``slovak``]. | ||
(default: ``"nnenglish"``) | ||
subset (str): subset of the dataset to use. Options: ["docs", "dev", "eval"]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subset (str): subset of the dataset to use. Options: ["docs", "dev", "eval"]. | |
subset (str or None, optional): subset of the dataset to use. Options: ["docs", "dev", "eval"]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of keeping subset as a required parameter, and not allowing None
as it is a bit vague given there is the docs
subset. It seems docs
would generally be extracted separately and used differently than the dev
and eval
set, so supporting a dataset that could support docs+dev only or dev+eval only etc seems like it could be messy without having additional class variables like len(docs). I think the user can just use ConcatDataset
if they'd like to merge multiple subsets, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's also what I'm thinking. making it a required parameter is a good idea.
Co-authored-by: nateanl <zni@fb.com>
@carolineechen has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Hey @carolineechen. |
] | ||
|
||
|
||
class QUESST14(Dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carolineechen Can we have Quesst14
as the name of the dataset?
I understand that the original name is all capital but there is no benefit going against PEP-8's naming convention. Currently torchaudio.datasets
has couple of datasets named like that but that's definitely not a standard and we should not make it a standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mthrok I agree we shouldn't default to all caps for dataset naming as we have done with several datasets in the past, but as this dataset's original name is already all capitalized ("QUESST 2014 Multilingual..."), I think it makes sense for this to remain all caps, and to follow the dataset's conventional naming. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is where the sticking with the PEP-8 naming convention matters.
It is unfortunate that existing datasets violate them and we are not gonna change it for the sake of changing it, but
as we add more datasets, we should stick with PEP-8.
The reason is that boundary of allowing all caps and not are kinda arbitrary this manner.
This can lead to discussion and disagreement every time we add datasets
It's like formatting issue, where subjectivity causes disagreement and make the collaboration unproductive.
Sticking with PEP-8 allows us to reason why more clearly. If all caps are required from technical reason, it will be clearer.
Summary: implementation adapted from [s3prl](https://github.com/s3prl/s3prl/blob/master/s3prl/downstream/quesst14_dtw/dataset.py) modifying the s3prl downstream expert to [this](carolineechen/s3prl@adc91a5) using this dataset implementation produces the same results as using the original s3prl pipeline Pull Request resolved: pytorch#2290 Reviewed By: nateanl Differential Revision: D35692551 Pulled By: carolineechen fbshipit-source-id: 035ad161d4cbbd2072411cfdf89984b73a89868c
implementation adapted from s3prl
modifying the s3prl downstream expert to this using this dataset implementation produces the same results as using the original s3prl pipeline