All examples are under directory egs
and named by its name of dataset. All data-sets starts with "mock" are data-sets for test.
DataSet | Supported Tasks | Description |
---|---|---|
ATIS | Sequence labeling/ Text classification/ NLU joint learning | Air Travel Information System (ATIS) pilot corpus. |
CoNLL2003 | Sequence labeling | The CoNLL 2003 NER task consists of newswire text from the Reuters RCV1 corpus tagged with four different entity types (PER, LOC, ORG, MISC). |
MSRA_NER | Sequence labeling | MSRA datasets are in the news domain about NER. |
SNIL | Sentence Matching | Stanford Natural Language Inference corpus is a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning. |
Quora_QP | Sentence Matching | Data collected from the quara platform. Quora is a place to gain and share knowledge—about anything. |
Yahoo_Answer | Document Classification | Yahoo answers are obtained from (Zhang et al., 2015). This is a topic classification task with 10 classes. The document we use includes question titles, question contexts and best answers. |
Trec | Document Classification | This data collection contains all the data used in our learning question classification experiments,which has question class definitions. |
DataSet | Supported Tasks | Description |
---|---|---|
hkust | ASR | HKUST Mandarin Telephone Speech |
voxceleb | Speaker Verfication | VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube |
iemocap | Emotion | The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database is an acted, multimodal and multispeaker database, recently collected at SAIL lab at USC. |