This is a repository for those who are trying KorQuAD for the first time. I got f1 88.468 and EM(ExactMatch) 68.947, just with the BERT-multilingual(single) model, Only google-research's BERT github.
While it was not difficult to achieve results simply by fine-tuning KorQuAD data, using codalab
was very difficult for the first time. So this repository will also explain in detail how to use codalab
, how to upload your result to KorQuAD leaderboard.
It is very easy to fine-tuning if you use google-research's BERT github. Everything is possible with one line of this command. If you still use Google-research's BERT, Korean subtokens are processed as UNK because of unicodedata.normalize("NFD", text)
. Please see here and pull request code, if you want to more detail. I have already applied Korean Issue in my code. Also I used BERT-Base, Multilingual Cased (New, recommended) for pre-trained weight(104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters).
$ python run_squad.py \
--vocab_file=multi_cased_L-12_H-768_A-12/vocab.txt \
--bert_config_file=multi_cased_L-12_H-768_A-12/bert_config.json \
--init_checkpoint=multi_cased_L-12_H-768_A-12/bert_model.ckpt \
--do_train=True \
--train_file=config/KorQuAD_v1.0_train.json \
--do_predict=True \
--predict_file=config/KorQuAD_v1.0_dev.json \
--train_batch_size=4 \
--num_train_epochs=3.0 \
--max_seq_length=384 \
--output_dir=./output \
--do_lower_case=False
I trained in GeForce RTX 2080 and 11GB in memory, and it took about three to four hours.
Then you can evaluate F1 and EM score.
$ python evaluate-v1.0.py config/KorQuAD_v1.0_dev.json output/predictions.json
{"f1": 88.57661204608746, "exact_match": 69.05091790786284}
Put your checkpoint files in config folder
Sign up worksheets, not completion
The CLI only works with Python 2.7 right now, so if Python 3 is your default, I recommend Anaconda
configure setting, rather than virtualenv.
$ conda create -n py27 python=2.7 anaconda
$ conda activate py27
(py27) $ pip install codalab -U --user
Click the New Worksheet
button in the upper right corner and name the worksheet.
If you don't want to see evaluation of Dev set, Just skip this part.
-
Click
Upload
Button and uploadoutput/predictions.json
that is created on fine-tuning. -
In the web interface terminal at the top of the page, type the following command:
<name-of-your-uploaded-prediction-bundle>
is uuid[0:8] ofoutput/predictions.json
.# web interface terminal cl macro korquad-utils/dev-evaluate-v1.0 <name-of-your-uploaded-prediction-bundle>
The reason for this division is that it takes a long time to upload the config folder (checkpoint files) every time.
- src folder : only source files to run in Codalab, ex)
run_squad.py
- config folder :
bert_config.json
,KorQuAD_v1.0_dev.json
,KorQuAD_v1.0_dev.json
,vocab.txt
,checkpoint files(fine-tuning checkpoint file)
Upload folders using Codalab Cli
. You can also upload using the Web UI, but use cli because it's slow for folders.
# command on Anaconda-Python2.7
# cl upload <folder-name> -n <bundle-name>
$ cl upload Leaderboard -n src
$ cl upload config -n config
# web interface terminal
cl add bundle korquad-data//KorQuAD_v1.0_dev.json .
You can see bundle spec . doesn't match any bundles
, but It's not a problem to proceed.
Then, Run learning model for dev set!
:src, :config meaning of : <bundle-name>
# command on Anaconda-Python2.7
$ cl run :KorQuAD_v1.0_dev.json :src :config "python src/run_KorQuAD.py --bert_config_file=config/bert_config.json --vocab_file=config/vocab.txt --init_checkpoint=config/model.ckpt-45000 --do_predict=True --output_dir=output config/KorQuAD_v1.0_dev.json predictions.json" -n run-predictions --request-docker-image tensorflow/tensorflow:1.12.0-gpu-py3 --request-memory 11g --request-gpus 1
-
About argument : Original KorQuAD tutorial, They recommend python arguments like this.
python src/<path-to-prediction-program> <input-data-json-file> <output-prediction-json-path>
above command match below:
<bundle-name>
/file-namepython src/run_KorQuAD.py --bert_config_file=config/bert_config.json --vocab_file=config/vocab.txt --init_checkpoint=config/model.ckpt-45000 --do_predict=True --output_dir=output config/KorQuAD_v1.0_dev.json predictions.json
-
About ImportError: libcuda.so.1 error : using
tensorflow 1.12.0
version docker withpython3
and--request-gpus 1
to clearly specify the pin number. -
About out of memory :
--request-memory 11g
Add the prediction file to the bundle.: MODELNAME
can't contain spaces and special characters.
# web interface terminal
cl make run-predictions/predictions.json -n predictions-{MODELNAME}
Let's see if we can evaluate the prediction file of dev set.
# web interface terminal
cl macro korquad-utils/dev-evaluate-v1.0 predictions-{MODELNAME}
After this, To submit your result in leader board, see Original KorQuAD tutorial of part 3.
These Repository are all released under the same license as the source code (Apache 2.0) with google-research BERT, Tae Hwan Jung
- Tae Hwan Jung(Jeff Jung) @graykode, Kyung Hee Univ CE(Undergraduate).
- Author Email : nlkey2022@gmail.com
- Reference : Original KorQuAD tutorial, The Land Of Galaxy Blog