-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer improvements #519
Trainer improvements #519
Conversation
73415a2
to
a1f490b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the inversion of control !
Mostly seems fine, but we should probably have a discussion between available syntax for training if we're changing it.
Options:
tokenizer.train(trainer, files)
(current)tokenizer.train(files, trainer=trainer)
tokenizer.train(files)
(proposed)tokenizer.train(files, vocab_size=X, ...)
trainer.train(tokenizer, files)
I like 2. because it can drop completely the concept of trainer
in the default case, but I think it makes the case were you need it a bit awkward, 3. Fixes that, but then knowing the exact list of options is going to be tricky (as it depends on tokenizer.model
now)
Option 4. does not remove the concept of trainer
and the inversion of flow makes it closer to the rust version. But it does feel as simple as it could be.
Right now, I'm leaning on 4.
that would get called automatically by a syntax like 2.
in the default case, and we would drop the non default case. So:
tokenizer.train(files)
works, and it equivalent to trainer = tokenizer.default_trainer(); trainer.train(tokenizer, files)
.
use tempfile::NamedTempFile; | ||
use tk::normalizers::{Lowercase, NFKC}; | ||
|
||
#[test] | ||
fn serialize() { | ||
let mut tokenizer = Tokenizer::new(PyModel::new(Arc::new( | ||
let mut tokenizer = Tokenizer::new(PyModel::new(Arc::new(RwLock::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why all those RwLock are need ? I'm not sure why we would need them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we need them because anything in an Arc
is immutable. These RwLock
provides us with a way to actually mutate the Model
here.
@@ -115,4 +115,4 @@ def train( | |||
) | |||
if isinstance(files, str): | |||
files = [files] | |||
self._tokenizer.train(trainer, files) | |||
self._tokenizer.train(files, trainer=trainer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we're breaking signature, I would argue we have a different signature like
train(files, options=bpe_train_options)
, or train(files, vocab_size=X, ....)
what do you think ?
I like the second version better, the only trouble is the exact description of those options if going to get fuzzy pretty fast and error handling a bit hard. But it "feels" more pythonic, what do you think ?
Either that, or if we keep the trainer
concept, we should stick to something closer to Rust, with trainer.train(tokenizer, files)
I actually like that last version better at this moment, the control flow feels more natural.
fn tokenize(&self, tokens: &str) -> tk::Result<Vec<Token>> { | ||
self.model.tokenize(tokens) | ||
self.model.read().unwrap().tokenize(tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always a bit scared by adding that much unwrap
everywhere... Do you think there's a way we could avoid them ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unwrap
here is for the std::sync::LockResult
that is returned by the RwLock
when you try to access its content. The Err
case happens when a thread that was holding the lock panicked. So since this shouldn't happen, and we don't want to recover from it, I think the unwrap
should be ok.
Sure! Happy to discuss the possible options. With # Default training, for example when you just want to train a pre-trained tokenizer on your own dataset
tokenizer = Tokenizer.from_pretrained("bert-base-uncased") # Not available as we speak, but soon
tokenizer.train(files=[ ... ])
# Same case but with some custom parameters
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
trainer = tokenizer.model.get_trainer()
print(trainer) # inspect the current training params
trainer.vocab_size = 32000
trainer.special_tokens = [ "[UNK]", "[PAD]" ]
tokenizer.train(files=[ ... ], trainer=trainer)
# Or building your own
trainer = BpeTrainer(...)
tokenizer.train(files=[ ... ], trainer) I think we should avoid I'm not entirely sure about what you have in mind with |
Hi, I'm using the WordLevelTrainer in this branch. I'm able to import the package and initialize a trainer. However, when I tried to use it in:
error. Wondering if there is any way to fix this? Thanks! |
In this PR we make the trainer optional when calling trainer = WordLevelTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer) |
I think this shows there's a problem in the naming of stuff. This example is using tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
train_options = tokenizer.model.get_train_options()
print(train_options) # it's actually a raw Python dict
train_options["vocab_size"] = 32000
train_options["special_tokens"] = [ "[UNK]", "[PAD]" ]
tokenizer.train(files=[ ... ], train_options=train_options)
# or
tokenizer.train(files=[...], **train_options) Small nit: train_options = tokenizer.model.get_train_options()
tokenizer.train(..., train_options) feels strange, why not train_options = tokenizer.get_train_options() # No need to call the underlying model.
tokenizer.train(..., train_options) The main issue I have with that API is for training a tokenizer from scratch: tokenizer = Tokenizer(BPE())
train_options = tokenizer.get_train_options() should not work as there's no way this code can know what vocabulary size I want. This code should raise an Exception IMO saying the user should define the vocab_size.
BpeTrainer with Unigram model does not make sense, we should avoid at all cost enabling the mixup. |
b818d10
to
f1a9742
Compare
I think you might have the wrong idea about the way things currently work. The
Agreed, and we don't allow it. It makes me realize that I treated this path as |
This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.
7e47862
to
2410452
Compare
WordLevelTrainer
(Fix Add the WordLevelTrainer #265)Trainer::process_tokens
Model
can now return its associated trainer withget_trainer
. This removes the need to provide a trainer in Python (could be optional on Rust too) (Fix Make thetrainer
optional when training #527)Model
on theTokenizer
now gets trained in-place. This allows keeping its customization (dropout, unk_token, ...) while just training it. That's also more in line with the role of the trainer, which is to train a Model, not to build one. (Fix BPE dropout not working as expected #201 & Fix Trainer trains the Model in-place #526)Model
in aRwLock
on the Python's side. This is something that we'll have to do anyway very soon to provide the ability to customize the components properties in Python.Missing before merge:
Model
after trainingTokenizer.train
is used