Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new SST2 dataset class #1410

Merged
merged 15 commits into from
Oct 18, 2021
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -497,7 +497,6 @@ jobs:
- v1-windows-dataset-vector-{{ checksum ".cachekey" }}
- v1-windows-dataset-{{ checksum ".cachekey" }}


- run:
name: Run tests
# Downloading embedding vector takes long time.
Expand Down Expand Up @@ -545,6 +544,7 @@ jobs:
command: |
set -x
conda install -y make python=${PYTHON_VERSION}
pip install git+https://github.com/pytorch/data.git@7772406#egg=torchdata
pip install $(ls ~/workspace/torchtext*.whl) --pre -f "https://download.pytorch.org/whl/${UPLOAD_CHANNEL}/cpu/torch_${UPLOAD_CHANNEL}.html"
- run:
name: Build docs
Expand Down
2 changes: 1 addition & 1 deletion .circleci/config.yml.in
Original file line number Diff line number Diff line change
Expand Up @@ -497,7 +497,6 @@ jobs:
- v1-windows-dataset-vector-{{ checksum ".cachekey" }}
- v1-windows-dataset-{{ checksum ".cachekey" }}
{% endraw %}

- run:
name: Run tests
# Downloading embedding vector takes long time.
Expand Down Expand Up @@ -545,6 +544,7 @@ jobs:
command: |
set -x
conda install -y make python=${PYTHON_VERSION}
pip install git+https://github.com/pytorch/data.git@7772406#egg=torchdata
pip install $(ls ~/workspace/torchtext*.whl) --pre -f "https://download.pytorch.org/whl/${UPLOAD_CHANNEL}/cpu/torch_${UPLOAD_CHANNEL}.html"
- run:
name: Build docs
Expand Down
3 changes: 3 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ tqdm
# Downloading data and other files
requests

# Torchdata
git+https://github.com/pytorch/data.git

# Optional NLP tools
nltk
spacy
Expand Down
6 changes: 4 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,9 +86,11 @@ def run(self):
description='Text utilities and datasets for PyTorch',
long_description=read('README.rst'),
license='BSD',

dependency_links=[
"git+https://github.com/pytorch/data.git@7772406#egg=torchdata-0.1.0a0+7772406",
],
install_requires=[
'tqdm', 'requests', pytorch_package_dep, 'numpy'
'tqdm', 'requests', pytorch_package_dep, 'numpy', 'torchdata==0.1.0a0+7772406'
],
python_requires='>=3.5',
classifiers=[
Expand Down
32 changes: 32 additions & 0 deletions test/experimental/test_datasets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import hashlib
import json

from torchtext.experimental.datasets import sst2

from ..common.torchtext_test_case import TorchtextTestCase


class TestDataset(TorchtextTestCase):
def test_sst2_dataset(self):
split = ("train", "dev", "test")
train_dp, dev_dp, test_dp = sst2.SST2(split=split)

# verify hashes of first line in dataset
self.assertEqual(
hashlib.md5(
json.dumps(next(iter(train_dp)), sort_keys=True).encode("utf-8")
).hexdigest(),
sst2._FIRST_LINE_MD5["train"],
)
self.assertEqual(
hashlib.md5(
json.dumps(next(iter(dev_dp)), sort_keys=True).encode("utf-8")
).hexdigest(),
sst2._FIRST_LINE_MD5["dev"],
)
self.assertEqual(
hashlib.md5(
json.dumps(next(iter(test_dp)), sort_keys=True).encode("utf-8")
).hexdigest(),
sst2._FIRST_LINE_MD5["test"],
)
3 changes: 2 additions & 1 deletion torchtext/experimental/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from . import raw
from . import sst2

__all__ = ['raw']
__all__ = ["raw", "sst2"]
88 changes: 88 additions & 0 deletions torchtext/experimental/datasets/sst2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Copyright (c) Facebook, Inc. and its affiliates.
import logging
import os


try:
from torchdata.datapipes.iter import (
HttpReader,
IterableWrapper,
)
except ImportError:
logging.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be warning (or info). Otherwise users who do not intend to use this dataset would be puzzled.

Also, to use logging module, define a logger object par module;

_LG = logging.getLogger(__name__)

_LG.info("`torchdata` is not installed. To use SST2Dataset, please install `torchdata`. https://github.com/pytorch/data")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to use per-module logger and log a warning message.

"Package `torchdata` is required to be installed to use this dataset."
"Please use `pip install git+https://github.com/pytorch/data.git'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Often times, the proper way to install a package varies depending on the environment, so it's better either avoid telling users to do exactly pip install or say it is just a guess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine as this is just a temporary solution. The wheel would be uploaded to PyPI and conda shortly when the package is released to us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified the message to tell the user to refer to the instructions in https://github.com/pytorch/data for how to install the package.

"to install the package."
)

from torchtext.data.datasets_utils import (
_add_docstring_header,
_create_dataset_directory,
_wrap_split_argument,
)


NUM_LINES = {
"train": 67349,
"dev": 872,
"test": 1821,
}

MD5 = "9f81648d4199384278b86e315dac217c"
URL = "https://dl.fbaipublicfiles.com/glue/data/SST-2.zip"

_EXTRACTED_FILES = {
"train": f"{os.sep}".join(["SST-2", "train.tsv"]),
"dev": f"{os.sep}".join(["SST-2", "dev.tsv"]),
"test": f"{os.sep}".join(["SST-2", "test.tsv"]),
}
Nayef211 marked this conversation as resolved.
Show resolved Hide resolved

_EXTRACTED_FILES_MD5 = {
"train": "da409a0a939379ed32a470bc0f7fe99a",
"dev": "268856b487b2a31a28c0a93daaff7288",
"test": "3230e4efec76488b87877a56ae49675a",
}

_FIRST_LINE_MD5 = {
"train": "2552b8cecd57b2e022ef23411c688fa8",
"dev": "1b0ffd6aa5f2bf0fd9840a5f6f1a9f07",
"test": "f838c81fe40bfcd7e42e9ffc4dd004f7",
}

DATASET_NAME = "SST2"


@_add_docstring_header(num_lines=NUM_LINES, num_classes=2)
@_create_dataset_directory(dataset_name=DATASET_NAME)
@_wrap_split_argument(("train", "dev", "test"))
def SST2(root, split):
return SST2Dataset(root, split).get_datapipe()


class SST2Dataset:
"""The SST2 dataset uses torchdata datapipes end-2-end.
To avoid download at every epoch, we cache the data on-disk
We do sanity check on dowloaded and extracted data
"""

def __init__(self, root, split):
self.root = root
self.split = split

def get_datapipe(self):
# cache data on-disk
cache_dp = IterableWrapper([URL]).on_disk_cache(
HttpReader,
op_map=lambda x: (x[0], x[1].read()),
filepath_fn=lambda x: os.path.join(self.root, os.path.basename(x)),
)
Comment on lines +76 to +80
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am adding a PR to improve the interface of on_disk_cache. I will let you know when it's landed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that sounds great!


# extract data from zip
extracted_files = cache_dp.read_from_zip()

# Parse CSV file and yield data samples
return extracted_files.filter(
lambda x: self.split in x[0]
).parse_csv(skip_lines=1, delimiter="\t").map(
lambda x: (x[0], x[1])
)