Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text classification datasets with new torchtext dataset abstraction #701

Merged
merged 48 commits into from
Apr 21, 2020
Merged
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
9d8aa61
new dataset design
Feb 27, 2020
1031efb
remove doc
Feb 28, 2020
371c23e
minor
Feb 28, 2020
29e5d0a
Merge remote-tracking branch 'upstream/master' into new_dataset_design
Mar 2, 2020
8187b84
revise build_vocab func in torchtext.experimental.datasets.new_text_c…
Mar 2, 2020
6272f57
flake8
Mar 2, 2020
15e2a94
docs
Mar 2, 2020
6611a09
switch transforms to torch.nn.Module
Mar 3, 2020
7c31d2e
add default None to tokenizer_name in TokenizerTransform
Mar 3, 2020
c5d487a
jit support for Dict[str, int] vocab in VocabTransform
Mar 3, 2020
7c9c969
remove F821
Mar 3, 2020
4689b29
add functional.py file
Mar 3, 2020
fe90c51
minor fix to have split tokenizer scriptable
Mar 3, 2020
d9ef2ee
add functional.py file
Mar 3, 2020
e98ae46
add a wrapper to support one-command data loading
Mar 12, 2020
8ce6779
add raw file
Mar 12, 2020
8291ebc
flake8
Mar 13, 2020
1864e7d
update raw text classification dataset docs
Mar 19, 2020
64cbde6
minor docs
Mar 19, 2020
51d1b8e
add ngrams
Mar 19, 2020
855e701
add label transform
Mar 19, 2020
a955579
combine imdb and text classification datasets
Mar 20, 2020
fa3565b
add more attributes to dataset API
Mar 20, 2020
94870df
update text classification datasets docs
Mar 20, 2020
74f50b6
remove two transforms
Mar 20, 2020
55e4848
add get_vocab in text_classification
Mar 20, 2020
db66774
minor fix
Mar 20, 2020
5a20115
Add TextSequential
Mar 20, 2020
650928a
swithc text classification to TextSequential
Mar 20, 2020
2447837
fix flake8 error
Mar 23, 2020
be20884
add vocab to dataset
Mar 23, 2020
a6bc30a
add docs strings for transforms.
Mar 23, 2020
3821282
move raw datasets to a separate folder
Mar 23, 2020
9b97ac2
.flake8 file
Mar 23, 2020
b565565
move raw text folder
Mar 23, 2020
ebe87f7
move transforms to experimental.datasets.text_classification
Mar 23, 2020
e382503
Fix IMDB
Mar 23, 2020
c711c34
remove some transforms in experimental text classification
Apr 1, 2020
9a0c3ac
switch raw dataset to iterable style
Apr 9, 2020
c6f6a42
add squential_transforms
Apr 9, 2020
f1d394c
Merge branch 'master' into new_dataset_design
Apr 9, 2020
7404519
add get_iterator func
Apr 9, 2020
bc2c83a
flake8
Apr 9, 2020
644b759
support partial cache for raw text classification dataset
Apr 9, 2020
2f93dec
Merge branch 'master' into new_dataset_design
Apr 13, 2020
aa15019
change None arguments
Apr 14, 2020
793349c
change import raw path
Apr 14, 2020
33053e8
Merge branch 'master' into new_dataset_design
zhangguanheng66 Apr 21, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[flake8]
ignore = E402,E722,W503,W504
ignore = E402,E722,W503,W504,F821
max-line-length = 120
exclude = docs/source
77 changes: 76 additions & 1 deletion docs/source/experimental_datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,82 @@ IMDb

.. autoclass:: IMDB
:members: __init__



Text Classification
^^^^^^^^^^^^^^^^^^^

TextClassificationDataset
~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: TextClassificationDataset
:members: __init__

AG_NEWS
~~~~~~

AG_NEWS dataset is subclass of ``TextClassificationDataset`` class.

.. autoclass:: AG_NEWS
:members: __init__

SogouNews
~~~~~~~~

SogouNews dataset is subclass of ``TextClassificationDataset`` class.

.. autoclass:: SogouNews
:members: __init__

DBpedia
~~~~~~

DBpedia dataset is subclass of ``TextClassificationDataset`` class.

.. autoclass:: DBpedia
:members: __init__

YelpReviewPolarity
~~~~~~~~~~~~~~~~~

YelpReviewPolarity dataset is subclass of ``TextClassificationDataset`` class.

.. autoclass:: YelpReviewPolarity
:members: __init__

YelpReviewFull
~~~~~~~~~~~~~

YelpReviewFull dataset is subclass of ``TextClassificationDataset`` class.

.. autoclass:: YelpReviewFull
:members: __init__

YahooAnswers
~~~~~~~~~~~

YahooAnswers dataset is subclass of ``TextClassificationDataset`` class.

.. autoclass:: YahooAnswers
:members: __init__

AmazonReviewPolarity
~~~~~~~~~~~~~~~~~~~

AmazonReviewPolarity dataset is subclass of ``TextClassificationDataset`` class.

.. autoclass:: AmazonReviewPolarity
:members: __init__

AmazonReviewFull
~~~~~~~~~~~~~~~

AmazonReviewFull dataset is subclass of ``TextClassificationDataset`` class.

.. autoclass:: AmazonReviewFull
:members: __init__


Language Modeling
^^^^^^^^^^^^^^^^^

Expand Down
Empty file modified examples/vocab/vocab.py
100644 → 100755
Empty file.
1 change: 1 addition & 0 deletions torchtext/data/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@


def _split_tokenizer(x):
# type: (str) -> List[str]
return x.split()


Expand Down
14 changes: 12 additions & 2 deletions torchtext/experimental/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,18 @@
from .language_modeling import LanguageModelingDataset, WikiText2, WikiText103, PennTreebank # NOQA
from .text_classification import IMDB
from .text_classification import AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, \
YelpReviewFull, YahooAnswers, \
AmazonReviewPolarity, AmazonReviewFull, IMDB

__all__ = ['LanguageModelingDataset',
'WikiText2',
'WikiText103',
'PennTreebank',
'IMDB']
'IMDB',
'AG_NEWS',
'SogouNews',
'DBpedia',
'YelpReviewPolarity',
'YelpReviewFull',
'YahooAnswers',
'AmazonReviewPolarity',
'AmazonReviewFull']
13 changes: 13 additions & 0 deletions torchtext/experimental/datasets/raw/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
from .text_classification import AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, \
YelpReviewFull, YahooAnswers, \
AmazonReviewPolarity, AmazonReviewFull, IMDB

__all__ = ['IMDB',
'AG_NEWS',
'SogouNews',
'DBpedia',
'YelpReviewPolarity',
'YelpReviewFull',
'YahooAnswers',
'AmazonReviewPolarity',
'AmazonReviewFull']
Loading