We present LANCE(Log stAtemeNt reCommEnder), a DL-based approach for supporting the task of log statement generation and injection in the context of Java. LANCE is built on the recently proposed Text-To-Text Transfer Transformer (T5) architecture
-
How to train a new SentencePiece Model
Before training the T5 small, namely the core of LANCE, it is important to also train a new tokenizer (sentencepiece model) to accomodate the expanded vocabulary given by the java programming language. For such, we used the raw pre-training instances(Java corpus) + English sentences from the well known C4 dataset
Pythonic way
pip install sentencepiece==0.1.96 import sentencepiece as spm spm.SentencePieceTrainer.train('--input=all_sp.txt --model_prefix=LOG_SP --vocab_size=32000 --bos_id=-1 --eos_id=1 --unk_id=2 --pad_id=0 --shuffle_input_sentence=true --character_coverage=1.0 --user_defined_symbols=“<LOG_STMT>”')
Under this path we also provide our trained tokenizer: https://github.com/antonio-mastropaolo/LANCE/tree/main/Code
-
To setup a new GCS Bucket for training and fine-tuning a T5 Model, please follow the original guide provided by Google: Here the link: https://cloud.google.com/storage/docs/quickstart-console
-
The datasets for pre-training, fine-tuning, validating and finally testing LANCE can be found at this link: https://drive.google.com/drive/folders/1D12y-CIJTYLxMeSmGQjxEXjTEzQImgaH?usp=sharing
-
To pre-train and then, fine-tune LANCE, please use the following:
-
Under Miscellaneous, you can find the additional script used for the data analysis and the exact hyper-parameters configuration we employed in the study.