Add IMDB data fetcher #410

stefan-it · 2019-01-22T00:57:55Z

Hi,

this PR adds a data fetcher that downloads and processes the IMDB dataset.

Additionally, the training corpus for a text classification task is downsampled, when no development dataset was specified.

To test the new IMDB data fetcher, just use:

from flair.data_fetcher import NLPTaskDataFetcher, NLPTask, TaggedCorpus

# Download IMDB corpus
NLPTaskDataFetcher.download_dataset(NLPTask.IMDB)

corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.IMDB)

print(corpus.obtain_statistics())

The output of the corpus statistics:

{
    "TRAIN": {
        "dataset": "TRAIN",
        "total_number_of_documents": 22500,
        "number_of_documents_per_class": {
            "pos": 11191,
            "neg": 11309
        },
        "number_of_tokens_per_tag": {},
        "number_of_tokens": {
            "total": 6168036,
            "min": 10,
            "max": 2786,
            "avg": 274.1349333333333
        }
    },
    "TEST": {
        "dataset": "TEST",
        "total_number_of_documents": 25000,
        "number_of_documents_per_class": {
            "pos": 12500,
            "neg": 12500
        },
        "number_of_tokens_per_tag": {},
        "number_of_tokens": {
            "total": 6714408,
            "min": 7,
            "max": 2768,
            "avg": 268.57632
        }
    },
    "DEV": {
        "dataset": "DEV",
        "total_number_of_documents": 2500,
        "number_of_documents_per_class": {
            "pos": 1309,
            "neg": 1191
        },
        "number_of_tokens_per_tag": {},
        "number_of_tokens": {
            "total": 700278,
            "min": 17,
            "max": 2003,
            "avg": 280.1112
        }
    }
}

… training corpus for text classification, when no development dataset is specified

alanakbik · 2019-01-22T10:41:35Z

Hi @stefan-it this is great, thanks - This will make it much easier for people to get started with text classification experiments!

data_fetcher: add data fetcher for IMDB corpus. Apply downsampling of…

fa11530

… training corpus for text classification, when no development dataset is specified

alanakbik merged commit fcdcc1c into flairNLP:master Jan 22, 2019

This was referenced Feb 6, 2019

loading corpus in flair: how to import NLPTask.IMDB? #461

Closed

GH-461: mention IMDB data fetcher in documentation #477

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IMDB data fetcher #410

Add IMDB data fetcher #410

stefan-it commented Jan 22, 2019

alanakbik commented Jan 22, 2019

Add IMDB data fetcher #410

Add IMDB data fetcher #410

Conversation

stefan-it commented Jan 22, 2019

alanakbik commented Jan 22, 2019