Our research aimed to develop a log anomaly detector that can be used effectively immediately on new logs. We propose UnifiedLog, a log anomaly detection framework that consists of two parts: a transformer based encoder for a general log representation, that utilizes only the semantic information in logs and a transformer based detector model which is capable of anomaly detection in sequences of representations. Since we use an approach that allows us to represent all types of logs in a single representation space, we can build an anomaly detection model to detect anomalies on multiple and different datasets. To train the encoder part of UnifiedLog we use masked language modelling (MLM), a self-supervised training method to learn a unified representation for any line of log. A general log representation approach must utilize unlabeled data as the availability of labeled datasets is not satisfying in this field. The detector part of UnifiedLog is inspired by NeuralLog (https://github.com/LogIntelligence/NeuralLog), which is a transformer based anomaly detector designed for the classification task on a sequence of representations. By unified representations, the detector can be trained on multiple datasets simultaneously, improving predictive capabilities by the cross-domain information. We experiment with varying the training datasets of both the encoder and detector part of UnifiedLog to show that they can generalize to unseen datasets. This is the first comprehensive model to detect anomalies on datasets it has never been trained on.
To summarize our main contributions are as follows:
- We propose UnifiedLog, a framework capable of log anomaly detection on multiple datasets simultaneously.
- UnifiedLog is the first model that aims to predict anomalies on datasets not used in training.
- We confirm that instead of log parsing to represent raw log messages, it is better to use a single unified language model.
- We suggest a new approach to evaluating log anomaly detection systems by combining performance metrics on multiple
In this training configuration all 17 datasets from the loghub project are utilized to train both the encoder and detector parts of the model. The first 5 million lines were used from each dataset in a 80-10-10 train/val/test split. Each row represents a unique run. This experiment set the baseline performance.
NUM Tokenization Strategy | 0-9 Tokenization Strategy | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Embed Dim | BGL | Hadoop | HDFS | OpenStack | Thunderbird | BGL | Hadoop | HDFS | OpenStack | Thunderbird |
16 | 0.945 | 0.994 | 0.430 | 0.210 | 0.999 | 0.999 | 0.205 | 0.053 | 0.887 | 0.999 |
32 | 0.945 | 0.993 | 0.470 | 0.211 | 0.999 | 0.995 | 0.123 | 0.087 | 0.810 | 0.999 |
64 | 0.999 | 0.996 | 0.643 | 0.211 | 0.999 | 0.999 | 0.845 | 0.560 | 0.733 | 0.999 |
128 | 0.999 | 0.996 | 0.921 | 0.211 | 0.999 | 0.999 | 0.842 | 0.633 | 0.725 | 0.999 |
In this training configuration one dataset is ommitted from the training sets of both the encoder and detector. Each value represents a unique run.
NUM Tokenization Strategy | 0-9 Tokenization Strategy | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Embed Dim | BGL | Hadoop | HDFS | OpenStack | Thunderbird | BGL | Hadoop | HDFS | OpenStack | Thunderbird |
16 | 0.921 | 0.956 | 0.083 | 0.087 | 0.548 | 0.098 | 0.918 | 0.134 | 0.105 | 0.529 |
32 | 0.736 | 0.937 | 0.109 | 0.117 | 0.952 | 0.931 | 0.863 | 0.134 | 0.167 | 0.446 |
64 | 0.943 | 0.948 | 0.082 | 0.163 | 0.940 | 0.302 | 0.955 | 0.134 | 0.163 | 0.868 |
128 | 0.597 | 0.958 | 0.082 | 0.176 | 0.954 | 0.616 | 0.920 | 0.134 | 0.271 | 0.968 |
To replicate our results using the UnifiedLog framework, we suggest the following hardware specifications:
- GPU: NVIDIA A100 with 80GB of VRAM
- Storage: At least 200GB of available disk space
- RAM: 50GB or more
conda env create -f environment.yml
python3 loghub_downloader.py -s <save-folder>
python3 data_preproceess.py -d <path-to-downloaded-logs> -s <save-folder> -l <maximum-lines-per-dataset> -v <num-of-tokens> -a <ASCII-policy> -n <number-policy>
The script removes or replaces non-ASCII characters with a special token based on the ASCII-policy, then either replaces all numeric characters with a combined [NUM] token or flags the 0-9 characters as special tokens. The script then trains a Wordpiece tokenizer on all the datasets and saves the tokenized versions of them.
Three folders are created by this scipt:
- tokenized
- tokenized_for_detector
- labels.
python3 run.py -c <conf-file> -t <cpu-threads>
name: example # Name of the run in neptune
neptune_logging: false # Export NEPTUNE_API_KEY as environment variable if set to true
transformer_encoder:
train_paths: "path_to/tokenized/" # Folder containing data tokenized for the encoder (also accepts a list directly containing files)
load_path: "path_to/saved_model" # Load encoder from previous save
save_path: "path_to/model_name" # Save path of the encoder
save_every_epoch: True # If True a model with save_path + _epoch_n.pkl will be saved every epoch
train_val_test_split: [0.8, 0.9]
mask_prob: 0.15
replace_prob: 0.9
num_tokens: 1004
max_seq_len: 128
attn_layers:
dim: 16 # This affects the detector part's embedding also
depth: 4
heads: 6
batch_size: 4096
lr: 0.00003
epochs: 5
mask_token_id: 1002
pad_token_id: 1003
max_train_data_size: 10000000 # Cap maximum lines used from one dataset for training
anomaly_detector:
train_paths: "path_to/tokenized_for_detector" # folder created by data_preprocess.py
label_paths: "path_to/labels" # folder created by data_preprocess.py
test_data_paths: "path_to/tokenized_for_detector"
test_labels: "path_to/labels"
load_path: null # Load detector from previous save
save_path: null # Save path of the detector
train_val_test_split: [0.8, 0.9]
lr_decay_step_size: 25
lr_decay_gamma: 0.9
early_stop_tolerance: 3
early_stop_min_delta: 0
batch_size: 64
epochs: 200
embed_dim: 64
ff_dim: 256
max_len: 20
num_heads: 8
dropout: 0.5
lr: 0.00003
balancing_ratio: 1
If you use this code in your research, please cite the corresponding paper:
Instert Citation Here
- Lajos Muzsai (muzsailajos@protonmail.com)
- András Lukács (andras.lukacs@ttk.elte.hu)
This project is licensed under the MIT License - see the LICENSE file for details.