GH-457: PyTorch DataLoader #735

alanakbik · 2019-05-17T15:11:42Z

This PR refactors the way datasets are loaded in Flair. Instead of loading all data sets into memory as a List of Sentence, we now utilize PyTorch's Dataset and DataLoader methods and allow the user to choose whether or not to load a data set into memory. This allows us to scale training to very large datasets that do not fit into memory.

This PR also changes the syntax of how to load datasets.

Old way:

from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)

New way:

import flair.datasets
corpus = flair.datasets.UD_ENGLISH()

To use streaming loading, i.e. to not load into memory, you can pass the in_memory parameter:

import flair.datasets
corpus = flair.datasets.UD_ENGLISH(in_memory=False)

closes #458 and #457 #426

kashif · 2019-05-20T09:57:03Z

👍

aakbik and others added 25 commits May 10, 2019 21:35

GH-457: begin refactor of data loaders

2972243

GH-457: begin refactor of data loaders

fd557e7

GH-457: changed training example for new dataset loader

948cb0c

GH-457: English and German UD DataLoader

9f05704

GH-457: round sampling of dev data

558ad40

GH-457: classification data loader

6042d95

GH-457: streaming data loading for classification

7d6ae79

GH-457: adapt unit tests for new data loaders

cbb8502

GH-457: MultiCorpus for DataLoader

6745fcd

GH-457: MultiCorpus for DataLoader

2c5ab0b

GH-457: Refactor Corpus interface to base class

4402492

GH-457: debug regression model

0c876ac

GH-457: support for CoNLL-03 Dutch and Spanish

c81c20a

GH-457: comment out failing unit test

8a055de

GH-457: Streaming DataLoading for NER corpora

fc3b8c6

GH-457: Streaming DataLoading for NER corpora

f39287e

GH-457: Add support for UD corpora

bca40e8

GH-457: More UD languages

3de5818

GH-457: Large UD corpora

028c83b

GH-457: BIOES conversion in data loader

8a4d2a5

GH-457: add deprecation warning to NLPTaskDataLoader

73cdaed

GH-457: num_workers as settable parameter in Trainer

4576777

Merge branch 'master' into GH-457-data-loader

1fa4530

make urllib version explicit to avoid conflict

3336f8f

formatting

c84ab52

alanakbik merged commit b2230ec into master May 20, 2019

alanakbik deleted the GH-457-data-loader branch May 20, 2019 09:44

alanakbik pushed a commit that referenced this pull request May 20, 2019

GH-735: speed improvements for loading large classification files

2027c16

alanakbik pushed a commit that referenced this pull request May 23, 2019

GH-735: add TREC dataset

a3987dc

david-MYS mentioned this pull request Nov 1, 2019

Does v0.4.4 support multi GPU? #1258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-457: PyTorch DataLoader #735

GH-457: PyTorch DataLoader #735

alanakbik commented May 17, 2019

kashif commented May 20, 2019

GH-457: PyTorch DataLoader #735

GH-457: PyTorch DataLoader #735

Conversation

alanakbik commented May 17, 2019

kashif commented May 20, 2019