Skip to content

aielte-research/UnifiedLog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UnifiedLog

logo

Project description

Our research aimed to develop a log anomaly detector that can be used effectively immediately on new logs. We propose UnifiedLog, a log anomaly detection framework that consists of two parts: a transformer based encoder for a general log representation, that utilizes only the semantic information in logs and a transformer based detector model which is capable of anomaly detection in sequences of representations. Since we use an approach that allows us to represent all types of logs in a single representation space, we can build an anomaly detection model to detect anomalies on multiple and different datasets. To train the encoder part of UnifiedLog we use masked language modelling (MLM), a self-supervised training method to learn a unified representation for any line of log. A general log representation approach must utilize unlabeled data as the availability of labeled datasets is not satisfying in this field. The detector part of UnifiedLog is inspired by NeuralLog (https://github.com/LogIntelligence/NeuralLog), which is a transformer based anomaly detector designed for the classification task on a sequence of representations. By unified representations, the detector can be trained on multiple datasets simultaneously, improving predictive capabilities by the cross-domain information. We experiment with varying the training datasets of both the encoder and detector part of UnifiedLog to show that they can generalize to unseen datasets. This is the first comprehensive model to detect anomalies on datasets it has never been trained on.

To summarize our main contributions are as follows:

  • We propose UnifiedLog, a framework capable of log anomaly detection on multiple datasets simultaneously.
  • UnifiedLog is the first model that aims to predict anomalies on datasets not used in training.
  • We confirm that instead of log parsing to represent raw log messages, it is better to use a single unified language model.
  • We suggest a new approach to evaluating log anomaly detection systems by combining performance metrics on multiple

Benchmark performance

In this training configuration all 17 datasets from the loghub project are utilized to train both the encoder and detector parts of the model. The first 5 million lines were used from each dataset in a 80-10-10 train/val/test split. Each row represents a unique run. This experiment set the baseline performance.

NUM Tokenization Strategy 0-9 Tokenization Strategy
Embed Dim BGL Hadoop HDFS OpenStack Thunderbird BGL Hadoop HDFS OpenStack Thunderbird
16 0.945 0.994 0.430 0.210 0.999 0.999 0.205 0.053 0.887 0.999
32 0.945 0.993 0.470 0.211 0.999 0.995 0.123 0.087 0.810 0.999
64 0.999 0.996 0.643 0.211 0.999 0.999 0.845 0.560 0.733 0.999
128 0.999 0.996 0.921 0.211 0.999 0.999 0.842 0.633 0.725 0.999

Performance on unseen datasets

In this training configuration one dataset is ommitted from the training sets of both the encoder and detector. Each value represents a unique run.

NUM Tokenization Strategy 0-9 Tokenization Strategy
Embed Dim BGL Hadoop HDFS OpenStack Thunderbird BGL Hadoop HDFS OpenStack Thunderbird
16 0.921 0.956 0.083 0.087 0.548 0.098 0.918 0.134 0.105 0.529
32 0.736 0.937 0.109 0.117 0.952 0.931 0.863 0.134 0.167 0.446
64 0.943 0.948 0.082 0.163 0.940 0.302 0.955 0.134 0.163 0.868
128 0.597 0.958 0.082 0.176 0.954 0.616 0.920 0.134 0.271 0.968

Requirements

To replicate our results using the UnifiedLog framework, we suggest the following hardware specifications:

  • GPU: NVIDIA A100 with 80GB of VRAM
  • Storage: At least 200GB of available disk space
  • RAM: 50GB or more

Installation

conda env create -f environment.yml

Download the datasets available on Loghub

python3 loghub_downloader.py -s <save-folder>

Preprocess datasets

python3 data_preproceess.py -d <path-to-downloaded-logs> -s <save-folder> -l <maximum-lines-per-dataset> -v <num-of-tokens> -a <ASCII-policy> -n <number-policy>

The script removes or replaces non-ASCII characters with a special token based on the ASCII-policy, then either replaces all numeric characters with a combined [NUM] token or flags the 0-9 characters as special tokens. The script then trains a Wordpiece tokenizer on all the datasets and saves the tokenized versions of them.

Three folders are created by this scipt:

  • tokenized
  • tokenized_for_detector
  • labels.

Train and evaluate UnifiedLog

python3 run.py -c <conf-file> -t <cpu-threads>

Example config file:

name: example # Name of the run in neptune
neptune_logging: false # Export NEPTUNE_API_KEY as environment variable if set to true
transformer_encoder:
  train_paths: "path_to/tokenized/" # Folder containing data tokenized for the encoder (also accepts a list directly containing files)
  load_path: "path_to/saved_model" # Load encoder from previous save
  save_path: "path_to/model_name" # Save path of the encoder
  save_every_epoch: True # If True a model with save_path + _epoch_n.pkl will be saved every epoch
  train_val_test_split: [0.8, 0.9]
  mask_prob: 0.15
  replace_prob: 0.9
  num_tokens: 1004
  max_seq_len: 128
  attn_layers:
    dim: 16 # This affects the detector part's embedding also
    depth: 4
    heads: 6
  batch_size: 4096
  lr: 0.00003
  epochs: 5
  mask_token_id: 1002
  pad_token_id: 1003
  max_train_data_size: 10000000      # Cap maximum lines used from one dataset for training
anomaly_detector:
  train_paths: "path_to/tokenized_for_detector" # folder created by data_preprocess.py
  label_paths: "path_to/labels" # folder created by data_preprocess.py
  test_data_paths: "path_to/tokenized_for_detector"
  test_labels: "path_to/labels" 
  load_path: null # Load detector from previous save
  save_path: null # Save path of the detector
  train_val_test_split: [0.8, 0.9] 
  lr_decay_step_size: 25
  lr_decay_gamma: 0.9
  early_stop_tolerance: 3
  early_stop_min_delta: 0
  batch_size: 64
  epochs: 200
  embed_dim: 64
  ff_dim: 256
  max_len: 20
  num_heads: 8
  dropout: 0.5
  lr: 0.00003
  balancing_ratio: 1

Citation

If you use this code in your research, please cite the corresponding paper:

Instert Citation Here

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published