Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT finetuning (classification + language model) #637

Merged
merged 14 commits into from
Sep 27, 2019

Conversation

Bycob
Copy link
Collaborator

@Bycob Bycob commented Sep 10, 2019

  • Add training with pytorch backend
  • Finetune and test BERT-based text classifiers
  • Finetune language models using word masking

<!> When tracing models, use pytorch 1.3.1. and latest transformers (formerly pytorch-transformers)

pip3 install torch==1.3.1 transformers

Added parameters

  • Text Input Connector
Parameter Type Optional Default Description
ordered_words bool yes false word-based processing with positionnal information
wordpiece_tokens bool yes false if vocabulary contains partial words. Words can be split into multiple tokens.
punctuation_tokens bool yes false Treat each punctuation sign as a token. (if false, punctuation is stripped from input)
  • Torch MLLib
Parameter Type Optional Default Description
self_supervised string yes "" self-supervised mode: "mask" for masked language model [TODO Add options : "next" = Next token prediction for GPT2?]
embedding_size int yes 768 embedding size for NLP models
freeze_traced bool yes false Freeze the traced part of the net during finetuning (e.g. for classification)

@Bycob
Copy link
Collaborator Author

Bycob commented Sep 10, 2019

TODO before merge

  • Improve TxtInputConnector support for unsupervised/self-supervised datasets (ie without classes)
  • Fix TxtInputConnector regenerating vocab during finetuning
  • Option to chose whether to freeze BERT weights or not
  • Add models to the trace_pytorch_transformers script
  • Add unit tests for classification training and masked lm training

Then

  • Interrupt training from the platform
  • Block predict while training
  • Best model / Best metrics
  • Resume training with correct data (iteration count, remaining iterations)
  • Save template and model parameters, so that we don't have to specify them again when recreating the service
  • "template" use model from external source instead of using the model already in the repository
  • f1 sparse for masked lm testing
  • Export DistilBERT and other models (once pytorch-transformers bug is solved)
  • Add vocabulary generation with BPE if finetuning = false
  • Implement perplexity as measure for language models (BERT & GPT-2)

@Bycob
Copy link
Collaborator Author

Bycob commented Sep 10, 2019

Example: Finetune a classification model

  • Trace pytorch pretrained bert
pip3 install --user transformers
mkdir classif_training
./trace_pytorch_transformers.py bert --output-dir classif_training --vocab --verbose
  • Run dede
  • Start the service
curl -X PUT "http://localhost:8080/services/torch_bert_training" -d '{
    "description": "News20 classification service using BERT",
    "mllib": "torch",
    "model": {
        "repository": "./classif_training/"
    },
    "parameters": {
        "input": {
            "connector": "txt",
            "ordered_words": true,
            "wordpiece_tokens": true,
            "punctuation_tokens": true,
            "sequence": 512
        },
        "mllib": {
            "template":"bert",
            "nclasses": 20,
            "finetuning":true,
            "gpu": true
        }
    },
    "type": "supervised"
}
'
  • Train the model on news20 dataset
curl -X POST "http://localhost:8080/train" -d '{
    "service": "torch_bert_training", 
    "parameters": { 
         "mllib": {
            "solver": {
              "iterations":3000,
              "test_interval":250,
              "base_lr":1e-5,
              "iter_size":4,
              "snapshot":250,
              "solver_type":"ADAM"
            },
            "net": {
              "batch_size":8,
              "test_batch_size":4
            }
        },
        "input": {
            "shuffle":true
        },
        "output": {
            "measure":["f1", "mcll", "acc", "cmdiag", "cmfull"]
        }
    }, 
    "data": ["/opt/data/news20/train/", "/opt/data/news20/test/"]
}
'

@Bycob
Copy link
Collaborator Author

Bycob commented Sep 17, 2019

Example: Finetune language model

  • Trace pytorch pretrained bert
pip3 install --user transformers
mkdir lm_training
./trace_pytorch_transformers.py bert -vo lm_training --vocab
  • Run dede
  • Start the service
curl -X PUT "http://localhost:8080/services/torch_bert_lm" -d '{
    "description": "BERT language model finetuning on News20 ",
    "mllib": "torch",
    "model": {
        "repository": "./lm_training/"
    },
    "parameters": {
        "input": {
            "connector": "txt",
            "ordered_words": true,
            "wordpiece_tokens": true,
            "punctuation_tokens": true,
            "sequence": 512
        },
        "mllib": {
            "template":"bert",
            "self_supervised":"mask",
            "finetuning": true,
            "gpu": true
        }
    },
    "type": "supervised"
}
'
  • Train the model on news20 dataset
curl -X POST "http://localhost:8080/train" -d '{
    "service": "torch_bert_lm", 
    "parameters": { 
         "mllib": {
            "solver": {
              "iterations":3000,
              "test_interval":250,
              "base_lr":1e-5,
              "iter_size":8,
              "snapshot":250,
              "solver_type":"ADAM"
            },
            "net": {
              "batch_size":4,
              "test_batch_size":4
            }
        },
        "input": {
            "shuffle":true,
            "test_split":0.03
        },
        "output": {
            "measure":["acc", "acc-5"]
        }
    }, 
    "data": ["/opt/data/news20/train/"]
}
'

@beniz beniz requested a review from fantes September 18, 2019 14:17
@beniz
Copy link
Collaborator

beniz commented Sep 25, 2019

Is the vocabulary matched against the wildcard tokens of BERT (the '##' tokens) ? From line

if ((it = _vocab.find(word)) != _vocab.end())
it seems that the vocab is matched only exactly, is my understanding correct ?

@beniz
Copy link
Collaborator

beniz commented Sep 25, 2019

* Add vocabulary generation with BPE if finetuning = false

What would be needed to train a model from scratch with a new vocabulary ?

@beniz
Copy link
Collaborator

beniz commented Sep 25, 2019

"save_period":250,

Please replace with snapshot_interval that is the default name in DD API.

@beniz
Copy link
Collaborator

beniz commented Sep 25, 2019

"width": 512

The txt input connector already defines sequence for the max character sequence size, maybe we'd like to use the same here.

@beniz
Copy link
Collaborator

beniz commented Sep 25, 2019

Example: Finetune a classification model

This fails for me, with error:

{"status":{"code":500,"msg":"InternalError","dd_code":1007,"dd_msg":"\narguments for call are not valid:\n  \n  for operator aten::mean(Tensor self) -> Tensor:\n  expected at most 1 arguments but found 3 positional arguments.\n 

and server log:

[2019-09-25 06:50:06.470] [api] [info] Running DeepDetect HTTP server on 10.10.77.61:8501
[2019-09-25 06:50:09.859] [torch_bert_training] [info] loaded vocabulary of size=30522
[2019-09-25 06:50:11.417] [torch_bert_training] [info] Loading ml model from file /data1/beniz/torch_models/news20/bert-pretrained.pt.
[2019-09-25 06:50:12.942] [torch_bert_training] [error] service creation call failed
[2019-09-25 06:50:12.944] [api] [error] 10.10.77.61 "PUT /services/torch_bert_training" 500 3120

I believe this is due to the attention head of the traced model.

@BynaryCobweb can you make sure that ./trace_pytorch_transformers.py bert --output-dir classif_training --vocab --verbose is the right command ? If I copy the bert-pretrained.pt from one of your trained news20 models, it seems to work. Your model has different size (510MB) than the one I obtain by tracing with your command (514MB) and the attention head input size differs.

Then I get this error:

[2019-09-25 07:24:03.593] [api] [error] {"code":500,"msg":"InternalError","dd_code":500,"dd_msg":"Libtorch error:isTuple() ASSERT FAILED at /home/beniz/projects/deepdetect/dev/deepdetect/build_bert/pytorch/src/pytorch/torch/include/ATen/core/ivalue.h:246, please report a bug to PyTorch. (toTuple at /home/beniz/projects/deepdetect/dev/deepdetect/build_bert/pytorch/src/pytorch/torch/include/ATen/core/ivalue.h:246)\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x66 (0x7f2d32266cd6 in /home/beniz/projects/deepdetect/dev/deepdetect/build_bert/pytorch/src/pytorch/torch/lib/libc10.so)\nframe #1: c10::IValue::toTuple() const & + 0x222 (0x76f122 in ./dede)\nframe #2: dd::TorchModule::forward(std::vector<c10::IValue, std::allocatorc10::IValue >) + 0xa1d (0x75977d in ./dede)\nframe #3: dd::TorchLib<dd::TxtTorchInputFileConn, dd::SupervisedOutput, dd::TorchModel>::train(dd::APIData const&, dd::APIData&) + 0xd40 (0x776790 in ./dede)\nframe #4: std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result, std::__future_base::_Result_base::_Deleter>, std::_Bind_simple<dd::MLService<dd::TorchLib, dd::TxtTorchInputFileConn, dd::SupervisedOutput, dd::TorchModel>::train_job(dd::APIData const&, dd::APIData&)::{lambda()#1} ()>, int> >::_M_invoke(std::_Any_data const&) + 0xbd (0x69fbad in ./dede)\nframe #5: std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) + 0x29 (0x686719 in ./dede)\nframe #6: + 0xea99 (0x7f2d39b4ca99 in /lib/x86_64-linux-gnu/libpthread.so.0)\nframe #7: std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) + 0x95 (0x6878b5 in ./dede)\nframe #8: std::thread::_Impl<std::_Bind_simple<std::__future_base::_Async_state_impl<std::_Bind_simple<dd::MLService<dd::TorchLib, dd::TxtTorchInputFileConn, dd::SupervisedOutput, dd::TorchModel>::train_job(dd::APIData const&, dd::APIData&)::{lambda()#1} ()>, int>::_Async_state_impl(dd::MLService<dd::TorchLib, dd::TxtTorchInputFileConn, dd::SupervisedOutput, dd::TorchModel>::train_job(dd::APIData const&, dd::APIData&)::{lambda()#1} (&&)())::{lambda()#1} ()> >::_M_run() + 0x5a (0x68abaa in ./dede)\nframe #9: + 0xb8c80 (0x7f2d2de1cc80 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)\nframe #10: + 0x76ba (0x7f2d39b456ba in /lib/x86_64-linux-gnu/libpthread.so.0)\nframe #11: clone + 0x6d (0x7f2d2d36041d in /lib/x86_64-linux-gnu/libc.so.6)\n"}

@Bycob
Copy link
Collaborator Author

Bycob commented Sep 25, 2019

Is the vocabulary matched against the wildcard tokens of BERT (the '##' tokens) ? From line

if ((it = _vocab.find(word)) != _vocab.end())

it seems that the vocab is matched only exactly, is my understanding correct ?

Tokenization (including wordpiece) is done in src/txtinputfileconn.cc:349 (see also: WordpieceTokenizer::append_input()). Here we just convert the previously found tokens to ids.

@Bycob
Copy link
Collaborator Author

Bycob commented Sep 25, 2019

* Add vocabulary generation with BPE if finetuning = false

What would be needed to train a model from scratch with a new vocabulary ?

The model has to be traced without initial weights:

./trace_pytorch_transformers.py bert --vocab --verbose --not-pretrained

When using a traced model, the size of the vocabulary is fixed (as well as the sequence length), so I may add an option to the script to chose the size of the vocabulary.

Then the vocabulary should be generated, the best option is to use Byte Pair Encoding I think (RoBERTa uses BPE for its vocabulary). A BPE tokenizer may be required.

Beside of this, the training process should be similar to finetuning, with different hyperparameters.

@Bycob Bycob force-pushed the bert_training branch 3 times, most recently from ba54cba to 7061257 Compare September 25, 2019 16:00
@beniz beniz changed the base branch from master to bert_training September 26, 2019 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants