Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Language Modeling Datasets and Sampler #9514

Merged
merged 8 commits into from
Jan 30, 2018
Merged

Language Modeling Datasets and Sampler #9514

merged 8 commits into from
Jan 30, 2018

Conversation

szha
Copy link
Member

@szha szha commented Jan 22, 2018

Description

Add language modeling dataset wikitext-2 and wikitext-103. Add interval sampler that is suitable for batched language model training. Update word-language-model example.

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Add wikitext-2, wikitext-103, and tests.
  • Add Interval Sampler, and test.
  • Update example to use new dataset.

Parameters
----------
length : int
Length of the sequence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interval : int
Sampling interval

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docstring.

"""
def __init__(self, length, interval):
self._length = length
self._interval = interval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add range check?

Path to temp folder for storing data.
segment : str, default 'train'
Dataset segment. Options are 'train', 'validation', 'test'.
indexer : :class:`~mxnet.contrib.text.indexer.TokenIndexer`, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expose this to users.

  1. Indexer is not the standard term for this.
  2. This is contrib API and subject to change. Gluon Dataset should use a separate vocabulary API

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

  1. I think it's safe to say that indexer is a clear enough term to reflect what it does.
  2. If I understand correctly, I believe the indexer class is intended to serve the same purpose as what you call 'vocabulary'

There is a reason this needs to be exposed. Suppose we have training dataset whose vocabulary is {a, b, c} plus unknown tokens, an test dataset has vocabulary {a, b, d}. As a standard practice, the token 'd' in the test dataset should be indexed as unknown. This means the indexing of test dataset depends on the index from training dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't think so. Do you have reference of it being used somewhere?
  2. It is, but it is contrib API. If you want to use it directly then gluon.data.text need to be in gluon.contrib too.

We do need to expose something like this. But it can't be TokenIndexer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have something else in mind?

Also, what should I provide here in place of TokenIndexer? If you could help me understand the reasoning for "it can't be TokenIndexer", I can help propose alternatives too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably a safer bet to move the dataset to contrib first.

License: Creative Commons Attribution-ShareAlike

Each sample is a vector of length equal to the specified sequence length.
At the end of each sentence, an end-of-sentence token '<eos>' is added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if seq_len doesn't respect sentence boundary why should it end with eos?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the sentence boundary is not considered in providing sample chunks, it's still necessary for language model to be able to predict where sentence ends. In that sense, these concepts are orthogonal.

The indexer to use for indexing the text dataset. If None, a default indexer is created.
seq_len : int, default 35
The sequence length of each sample, regardless of the sentence boundary.
transform : function, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset now has a transform API. Use that instead of adding transform callback to every dataset

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can take a look at that. What about vision? Are the existing transform options dropped? Where can I find relevant discussion?

--------
>>> sampler = gluon.data.IntervalSampler(13, interval=3)
>>> list(sampler)
[0, 3, 6, 9, 12, 1, 4, 7, 10, 2, 5, 8, 11]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should it roll over at the end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that you think it shouldn't? I think this sampler should exhaust every sample in a dataset. If for some reason it needs to drop some samples, for the purpose of mini-batch for example, then a wrapper sampler should take care of that.

Copy link
Contributor

@piiswrong piiswrong Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name interval sampler suggests it should behave like [begin:end:step]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the confusion now. Should I add an option to specify whether to roll over?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem very generic anyway. I would put it in examples

Copy link
Member Author

@szha szha Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piiswrong This sampler is needed for any long-form text processing that requires passing hidden state from sample to sample. I'd expect repeated use for this, which is why I chose to put it here. Do you prefer to update its interface for handling roll-over, or to move this contrib, or do you still prefer it to be dropped?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piiswrong ping

@piiswrong
Copy link
Contributor

This Dataset.indexer design doesn't work for the use case where you want to combine (or take intersection of) the vocab of too datasets (like train and val)

@szha
Copy link
Member Author

szha commented Jan 22, 2018

Indeed, I wasn't considering such case because it isn't good practice to index using anything other than training set. That said, providing an option to update the input indexer should be sufficient to cover this case. Would that be OK?

@piiswrong
Copy link
Contributor

Then you would have problem when you want only the top 2000 tokens.

Since this is not a very common use case, I think the current version is fine for contrib

@zhreshold
Copy link
Member

I am more concerned with the name Indexer and the related behavior(manually extracting Indexer and synchronize between train/valid dataset). The others LGTM now.

@szha
Copy link
Member Author

szha commented Jan 23, 2018

@zhreshold thanks. @astonzhang shall we consider a different name for indexer, like the aforementioned "vocabulary"?

@astonzhang
Copy link
Member

I would like to propose the following change for class names:
TokenIndexer -> Vocabulary
Glossary -> VocabularyEmbedding
TokenEmbedding -> PretrainedEmbedding

Otherwise, having both Vocabulary and Glossary is likely confusing. Having both VocabularyEmbedding and TokenEmbedding is also likely confusing.

Is the proposed change OK?

@szha
Copy link
Member Author

szha commented Jan 23, 2018

VocabularyEmbedding and PretrainedEmbedding don't sound like they would inherit each other, and the concerns are unclear just based on the name. I probably won't remember which is which after a couple of weeks. Let's consider other names for those two.

@astonzhang
Copy link
Member

astonzhang commented Jan 23, 2018

How about:
TokenIndexer -> Vocabulary
Glossary -> VocabularyEmbedding
TokenEmbedding -> Embedding
OR
TokenEmbedding (no change)

@szha
Copy link
Member Author

szha commented Jan 23, 2018

CompositeEmbedding sounds more like what Glossary does.

@szha
Copy link
Member Author

szha commented Jan 25, 2018

I updated this PR based on the latest change in text api naming. Also, I made the vocabulary as a property of the dataset for exchanging the index. Feel free to comment and I'd like to get this merged once 1.1 release is cut.

@szha
Copy link
Member Author

szha commented Jan 26, 2018

To address the concern of merging datasets based on frequencies, I made the frequencies (word-counts) a property of the dataset too. This way, user has the control on how vocabulary is made.

Currently the tokenization is naive and the next step should be to have a proper tokenizer class. Once that's available, the datasets should expose an option for specifying tokenizers.

@szha
Copy link
Member Author

szha commented Jan 28, 2018

@zhreshold @piiswrong pinging for another pass of review.

Copy link
Member

@zhreshold zhreshold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to me now

@szha
Copy link
Member Author

szha commented Jan 29, 2018

Thanks. I will wait another day before merging, in case @piiswrong has additional feedback.

@szha
Copy link
Member Author

szha commented Jan 30, 2018

Connected offline with @piiswrong that current design is OK to check in in contrib package.

@szha szha merged commit 8bdc806 into apache:master Jan 30, 2018
larroy pushed a commit to larroy/mxnet that referenced this pull request Jan 31, 2018
* refactor dataset

* add interval sampler

* wikitext-2/-103

* update word language model

* address comments

* move interval sampler to contrib

* update

* add frequencies property
@szha szha deleted the lm_dataset branch April 26, 2018 18:20
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* refactor dataset

* add interval sampler

* wikitext-2/-103

* update word language model

* address comments

* move interval sampler to contrib

* update

* add frequencies property
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* refactor dataset

* add interval sampler

* wikitext-2/-103

* update word language model

* address comments

* move interval sampler to contrib

* update

* add frequencies property
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants