TDI

Textual Difficulty Identification dataset

TDI(Textual Difficulty Identification) dataset aims to train the neural network model to distinguish textual difficulty of english sentences.

For each topic, 5 sentences are labeled as 1 (Normal Wikipedia), while the other 5 sentences are labeled as 0(Simple Wikipedia).
Each topic consists of 10 sentences.
The format of the files are the same as in GLUE SST-2 (except the column 'title')

Note

Collected text data dump from the both of the sources.
Extracted the text and eliminated any control / xml characters and normalized unicode.
Filtering conditions are minimal to prevent possible human bias which can affect model performance.
- First, sentence length should be between 70 and 210
- Second, every topic should have at least five sentences to be included in the dataset)
Randomly selected five sentences from each source.
Labeled as written above.

Citation

You can cite the paper as follows:

@unpublished{Park2019MTD,
author  = {Dongjun Park},
title   = {Transformer: Measuring the Textual Difficulty of English Sentences},
year    = {2019},
note    = {unpublished},
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
dev.tsv		dev.tsv
test.tsv		test.tsv
train.tsv		train.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TDI

Note

Citation

About

Releases

Packages

License

broaddeep/TDI

Folders and files

Latest commit

History

Repository files navigation

TDI

Note

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages