-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BERT finetuning (classification + language model) #637
Conversation
TODO before merge
Then
|
Example: Finetune a classification model
|
4d2767d
to
a92c4a6
Compare
Example: Finetune language model
|
The fix on txtinputconnector is temporary, vocab generation should be fixed a more robust way
6ff3a42
to
0f233a8
Compare
Is the vocabulary matched against the wildcard tokens of BERT (the '##' tokens) ? From line
|
What would be needed to train a model from scratch with a new vocabulary ? |
Please replace with |
The txt input connector already defines |
This fails for me, with error:
and server log:
I believe this is due to the attention head of the traced model. @BynaryCobweb can you make sure that Then I get this error: [2019-09-25 07:24:03.593] [api] [error] {"code":500,"msg":"InternalError","dd_code":500,"dd_msg":"Libtorch error:isTuple() ASSERT FAILED at /home/beniz/projects/deepdetect/dev/deepdetect/build_bert/pytorch/src/pytorch/torch/include/ATen/core/ivalue.h:246, please report a bug to PyTorch. (toTuple at /home/beniz/projects/deepdetect/dev/deepdetect/build_bert/pytorch/src/pytorch/torch/include/ATen/core/ivalue.h:246)\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x66 (0x7f2d32266cd6 in /home/beniz/projects/deepdetect/dev/deepdetect/build_bert/pytorch/src/pytorch/torch/lib/libc10.so)\nframe #1: c10::IValue::toTuple() const & + 0x222 (0x76f122 in ./dede)\nframe #2: dd::TorchModule::forward(std::vector<c10::IValue, std::allocatorc10::IValue >) + 0xa1d (0x75977d in ./dede)\nframe #3: dd::TorchLib<dd::TxtTorchInputFileConn, dd::SupervisedOutput, dd::TorchModel>::train(dd::APIData const&, dd::APIData&) + 0xd40 (0x776790 in ./dede)\nframe #4: std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, std::_Bind_simple<dd::MLService<dd::TorchLib, dd::TxtTorchInputFileConn, dd::SupervisedOutput, dd::TorchModel>::train_job(dd::APIData const&, dd::APIData&)::{lambda()#1} ()>, int> >::_M_invoke(std::_Any_data const&) + 0xbd (0x69fbad in ./dede)\nframe #5: std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) + 0x29 (0x686719 in ./dede)\nframe #6: + 0xea99 (0x7f2d39b4ca99 in /lib/x86_64-linux-gnu/libpthread.so.0)\nframe #7: std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) + 0x95 (0x6878b5 in ./dede)\nframe #8: std::thread::_Impl<std::_Bind_simple<std::__future_base::_Async_state_impl<std::_Bind_simple<dd::MLService<dd::TorchLib, dd::TxtTorchInputFileConn, dd::SupervisedOutput, dd::TorchModel>::train_job(dd::APIData const&, dd::APIData&)::{lambda()#1} ()>, int>::_Async_state_impl(dd::MLService<dd::TorchLib, dd::TxtTorchInputFileConn, dd::SupervisedOutput, dd::TorchModel>::train_job(dd::APIData const&, dd::APIData&)::{lambda()#1} (&&)())::{lambda()#1} ()> >::_M_run() + 0x5a (0x68abaa in ./dede)\nframe #9: + 0xb8c80 (0x7f2d2de1cc80 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)\nframe #10: + 0x76ba (0x7f2d39b456ba in /lib/x86_64-linux-gnu/libpthread.so.0)\nframe #11: clone + 0x6d (0x7f2d2d36041d in /lib/x86_64-linux-gnu/libc.so.6)\n"} |
Tokenization (including wordpiece) is done in |
The model has to be traced without initial weights:
When using a traced model, the size of the vocabulary is fixed (as well as the sequence length), so I may add an option to the script to chose the size of the vocabulary. Then the vocabulary should be generated, the best option is to use Byte Pair Encoding I think (RoBERTa uses BPE for its vocabulary). A BPE tokenizer may be required. Beside of this, the training process should be similar to finetuning, with different hyperparameters. |
ba54cba
to
7061257
Compare
7920fb7
to
48f74ad
Compare
<!> When tracing models, use pytorch 1.3.1. and latest transformers (formerly pytorch-transformers)
Added parameters