Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sentencepiece binding to torchtext, plus example to build torchtext dataset with sentencepiece #597

Merged
merged 71 commits into from
Sep 26, 2019

Conversation

zhangguanheng66
Copy link
Contributor

@zhangguanheng66 zhangguanheng66 commented Sep 9, 2019

Add sentencepiece binding to torchtext.
Show an example to build torchtext dataset with sentencepiece. Reproduce fastTest's results in the example with sentencepiece binding.

@zhangguanheng66 zhangguanheng66 changed the title Example to build torchtext dataset with sentencepiece [WIP] Example to build torchtext dataset with sentencepiece Sep 9, 2019
@zhangguanheng66 zhangguanheng66 changed the title [WIP] Example to build torchtext dataset with sentencepiece [WIP] Add Sentencepiece binding to torchtext, plus example to build torchtext dataset with sentencepiece Sep 12, 2019
Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm noticing the dataset class does not use iterator. Is this meant to represent a consensus we agreed on?

>>> sp_user.load('spm_user.model')
"""

import sentencepiece as spm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be at the beginning?

@@ -0,0 +1,13 @@
# This is an example to create a dataset with SentencePiece binding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: title should probably just be "# Create a dataset with SentencePiece binding" since the text already refers to the content as an example, and its in the example folder :)


class TestUtils(TorchtextTestCase):
def test_sentencepiece_encode_as_ids(self):
from torchtext.data.transforms import sentencepiece_encode_as_ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same about import

ref_results)

def test_sentencepiece_encode_as_pieces(self):
import sys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same about imports

>>> list(sp_tokens_generator(list_a))
[['_sentence', 'piece', '_en', 'co', 'de', '_as', '_pieces'],
['_example', 's', '_to', '_try', '!']]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space?

>>> list(sp_id_generator(list_a))
[[9858, 9249, 1629, 1305, 1809, 53, 842],
[2347, 13, 9, 150, 37]]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space?

torchtext/data/transforms.py Outdated Show resolved Hide resolved
@zhangguanheng66
Copy link
Contributor Author

I'm noticing the dataset class does not use iterator. Is this meant to represent a consensus we agreed on?

Not to the dataset level. At current stage, I would like to implement a few transforms (two in this case) in the format of generators, apply them as tokenizers. As we have more usage cases, we can sit together and make a final decision across all the domains. :)

@@ -137,6 +137,10 @@ def test(data_):
help='device (default=cpu)')
parser.add_argument('--data', default='.data',
help='data directory (default=.data)')
parser.add_argument('--use-sp-tokenizer', type=bool, default=False,
help='use sentencepiece tokenizer (default=False)')
parser.add_argument('--vocab-size', type=int, default=20000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change this to say "--sp-vocab-size" just so it's less confusing

model_prefix,
model_type)
spm.SentencePieceTrainer.train(spm_training_string)
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah because the straining also involves writing to disk - is it possible to just return the object and have the user save it herself (if necessary). That way we don't add any constraints for where this is supposed to be saved or whether it's supposed to be saved at all.

Copy link
Contributor Author

@zhangguanheng66 zhangguanheng66 Sep 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked the APIs in sentencepiece, and it seems they don't have the option to return an object.

"""

sp_model = spm.SentencePieceProcessor()
sp_model.Load(spm_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of loading the model here, can this just accept a sentence piece model and nothing else? That way we don't add any constraints on where the model is supposed to be loaded from. It could however be good to have a "load_sentencepiece_model_from_path" or something that wraps the first two lines (just because static constructors are more idiomatic than a method that overwrites an entire state).

"""

sp_model = spm.SentencePieceProcessor()
sp_model.Load(spm_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing as above. Here the "load...from_path" function can be used to pass into this. Also, if you can load it independently you can run this multiple times.

Further, this doesn't need to return a function that then yields, it can yield itself.


Arguments:
spm_path: the file path saving the sentencepiece model.
txt_iter: input sentence text generator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't accept a "txt_iter" argument.

return _internal_func


def sentencepiece_encode_as_pieces(spm_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "pieces" means "tokens" in our language I suggest we use tokens instead of pieces here because we need to standardize on a language within our project even if other projects might use a different language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants