Skip to content

Commit

Permalink
Add ULP from ICSME'22
Browse files Browse the repository at this point in the history
  • Loading branch information
zhujiem committed Sep 7, 2023
1 parent fcb019d commit a90854d
Show file tree
Hide file tree
Showing 13 changed files with 460 additions and 7 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ Logparser provides a machine learning toolkit and benchmarks for automated log p
| ICWS'17 | [Drain](https://github.com/logpai/logparser/tree/main/logparser/Drain#drain) | [Drain: An Online Log Parsing Approach with Fixed Depth Tree](https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf), by Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu.|
| ICPC'18 | [MoLFI](https://github.com/logpai/logparser/tree/main/logparser/MoLFI#molfi) | [A Search-based Approach for Accurate Identification of Log Message Formats](http://publications.uni.lu/bitstream/10993/35286/1/ICPC-2018.pdf), by Salma Messaoudi, Annibale Panichella, Domenico Bianculli, Lionel Briand, Raimondas Sasnauskas. |
| TSE'20 | [Logram](https://github.com/logpai/logparser/tree/main/logparser/Logram#logram) | [Logram: Efficient Log Parsing Using n-Gram Dictionaries](https://arxiv.org/pdf/2001.03038.pdf), by Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun (Peter) Chen. |
| ICSME'22 | [ULP](https://github.com/logpai/logparser/tree/main/logparser/ULP#ULP) | [An Effective Approach for Parsing Large Log Files](https://users.encs.concordia.ca/~abdelw/papers/ICSME2022_ULP.pdf), by Issam Sedki, Abdelwahab Hamou-Lhadj, Otmane Ait-Mohamed, Mohammed A. Shehab. |

:bulb: Welcome to submit a PR to push your parser code to logparser and add your paper to the table.

Expand Down
1 change: 1 addition & 0 deletions THIRD_PARTIES.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ The logparser package is built on top of the following third-party libraries:
| MoLFI | https://github.com/SalmaMessaoudi/MoLFI | Apache-2.0 |
| alignment (LogMine) | https://gist.github.com/aziele/6192a38862ce569fe1b9cbe377339fbe | GPL |
| Logram | https://github.com/BlueLionLogram/Logram | NA |
| ULP | https://github.com/SRT-Lab/ULP | MIT |
2 changes: 0 additions & 2 deletions docs/tools/Drain.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,6 @@ Drain is one of the representative algorithms for log parsing. It can parse logs

Drain first preprocess logs according to user-defined domain knowledge, ie. regex. Second, Drain starts from the root node of the parse tree with the preprocessed log message. The 1-st layer nodes in the parse tree represent log groups whose log messages are of different log message lengths. Third, Drain traverses from a 1-st layer node to a leaf node. Drain selects the next internal node by the tokens in the beginning positions of the log message.Then Drain calculate similarity between log message and log event of each log group to decide whether to put the log message into existing log group. Finally, Drain update the Parser Tree by scaning the tokens in the same position of the log message and the log event.



Read more information about Drain from the following paper:

+ Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. [Drain: An Online Log Parsing Approach with Fixed Depth Tree](https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf), *IEEE International Conference on Web Services (ICWS)*, 2017.
4 changes: 2 additions & 2 deletions docs/tools/LKE.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
LKE
===

LKE (Log Key Extraction) is one of the representative algorithms for log parsing. It first leverages empirical rules for preprocessing and then uses weighted edit distance for hierarchical clustering of log messsages. After further group splitting with fine tuning, log keys are generated from the resulting clusters.
LKE (Log Key Extraction) is one of the representative algorithms for log parsing. It first leverages empirical rules for preprocessing and then uses weighted edit distance for hierarchical clustering of log messages. After further group splitting with fine-tuning, log keys are generated from the resulting clusters.

**Step 1**: Log clustering. Weighted edit distance is designed to evaluate the similarity between two logs, WED=\sum_{i=1}^{n}\frac{1}{1+e^{x_{i}-v}} . n is the number of edit operations to make two logs the same, x_{i} is the column index of the word which is edited by the i-th operation, v is a parameter to control weight. LKE links two logs if the WED between them is less than a threshold \sigma . After going through all pairs of logs, each connected component is regarded as a cluster. Threshold \sigma is automatically calculated by utilizing K-means clustering to separate all WED between all pair of logs into 2 groups, and the largest distance from the group containing smaller WED is selected as the value of \sigma .

**Step 2**: Cluster splitting. In this step, some clusters are further partitioned. LKE firstly finds out the longest common sequence (LCS) of all the logs in the same cluster. The rests of the logs are dynamic parts separated by common words, such as “/10.251.43.210:55700” or “blk_904791815409399662”. The number of unique words in each dynamic part column, which is denoted as |DP| , is counted. For example, |DP|=2 for the dynamic part column between “src:” and “dest:” in log 2 and log 3. If the smallest |DP| is less than threshold \phi , LKE will use this dynamic part column to partition the cluster.

**Step 3**: Log template extraction. This step is similar to the step 4 of IPLoM. The only difference is that LKE removes all variables when they generate log templates, instead of representing them by wildcards.
**Step 3**: Log template extraction. This step is similar to step 4 of IPLoM. The only difference is that LKE removes all variables when they generate log templates, instead of representing them by wildcards.

Read more information about LKE from the following paper:

Expand Down
2 changes: 1 addition & 1 deletion logparser/LKE/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# LKE

LKE (Log Key Extraction) is one of the representative algorithms for log parsing. It first leverages empirical rules for preprocessing and then uses weighted edit distance for hierarchical clustering of log messsages. After further group splitting with fine tuning, log keys are generated from the resulting clusters.
LKE (Log Key Extraction) is one of the representative algorithms for log parsing. It first leverages empirical rules for preprocessing and then uses weighted edit distance for hierarchical clustering of log messages. After further group splitting with fine tuning, log keys are generated from the resulting clusters.

Read more information about LKE from the following paper:

Expand Down
59 changes: 59 additions & 0 deletions logparser/ULP/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# ULP

ULP (Universal Log Parsing) is a highly accurate log parsing tool, the ability to extract templates from unstructured log data. ULP learns from sample log data to recognize future log events. It combines pattern matching and frequency analysis techniques. First, log events are organized into groups using a text processing method. Frequency analysis is then applied locally to instances of the same group to identify static and dynamic content of log events. When applied to 10 log datasets of the Loghub benchmark, ULP achieves an average accuracy of 89.2%, which outperforms the accuracy of four leading log parsing tools, namely Drain, Logram, Spell and AEL. Additionally, ULP can parse up to four million log events in less than 3 minutes. ULP can be readily used by practitioners and researchers to parse effectively and efficiently large log files so as to support log analysis tasks.

Read more information about Drain from the following paper:

+ Issam Sedki, Abdelwahab Hamou-Lhadj, Otmane Ait-Mohamed, Mohammed A. Shehab. [An Effective Approach for Parsing Large Log Files](https://users.encs.concordia.ca/~abdelw/papers/ICSME2022_ULP.pdf), *Proceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME)*, 2022.

### Running

The code has been tested in the following enviornment:
+ python 3.7.6
+ regex 2022.3.2
+ pandas 1.0.1
+ numpy 1.18.1
+ scipy 1.4.1

Run the following scripts to start the demo:

```
python demo.py
```

Run the following scripts to execute the benchmark:

```
python benchmark.py
```

### Benchmark

Running the benchmark script on Loghub_2k datasets, you could obtain the following results.

| Dataset | F1_measure | Accuracy |
|:-----------:|:----------|:--------|
| HDFS | 0.999984 | 0.9975 |
| Hadoop | 0.999923 | 0.9895 |
| Spark | 0.994593 | 0.922 |
| Zookeeper | 0.999876 | 0.9925 |
| BGL | 0.999453 | 0.93 |
| HPC | 0.994433 | 0.9505 |
| Thunderbird | 0.998665 | 0.6755 |
| Windows | 0.989051 | 0.41 |
| Linux | 0.476099 | 0.3635 |
| Android | 0.971417 | 0.838 |
| HealthApp | 0.993431 | 0.9015 |
| Apache | 1 | 1 |
| Proxifier | 0.739766 | 0.024 |
| OpenSSH | 0.939796 | 0.434 |
| OpenStack | 0.834337 | 0.4915 |
| Mac | 0.981294 | 0.814 |


### Citation

:telescope: If you use our logparser tools or benchmarking results in your publication, please kindly cite the following papers.

+ [**ICSE'19**] Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, Michael R. Lyu. [Tools and Benchmarks for Automated Log Parsing](https://arxiv.org/pdf/1811.03509.pdf). *International Conference on Software Engineering (ICSE)*, 2019.
+ [**DSN'16**] Pinjia He, Jieming Zhu, Shilin He, Jian Li, Michael R. Lyu. [An Evaluation Study on Log Parsing and Its Use in Log Mining](https://jiemingzhu.github.io/pub/pjhe_dsn2016.pdf). *IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)*, 2016.
223 changes: 223 additions & 0 deletions logparser/ULP/ULP.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
# =========================================================================
# This file is modified from https://github.com/SRT-Lab/ULP
#
# MIT License
# Copyright (c) 2022 Universal Log Parser
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# =========================================================================

import os
import pandas as pd
import regex as re
import time
import warnings
from collections import Counter
from string import punctuation

warnings.filterwarnings("ignore")


class LogParser:
def __init__(self, log_format, indir="./", outdir="./result/", rex=[]):
"""
Attributes
----------
rex : regular expressions used in preprocessing (step1)
path : the input path stores the input log file name
logName : the name of the input file containing raw log messages
savePath : the output path stores the file containing structured logs
"""
self.path = indir
self.indir = indir
self.outdir = outdir
self.logName = None
self.savePath = outdir
self.df_log = None
self.log_format = log_format
self.rex = rex

def tokenize(self):
event_label = []
# print("\n============================Removing obvious dynamic variables======================\n\n")
for idx, log in self.df_log["Content"].iteritems():
tokens = log.split()
tokens = re.sub(r"\\", "", str(tokens))
tokens = re.sub(r"\'", "", str(tokens))
tokens = tokens.translate({ord(c): "" for c in "!@#$%^&*{}<>?\|`~"})

re_list = [
"([\da-fA-F]{2}:){5}[\da-fA-F]{2}",
"\d{4}-\d{2}-\d{2}",
"\d{4}\/\d{2}\/\d{2}",
"[0-9]{2}:[0-9]{2}:[0-9]{2}(?:[.,][0-9]{3})?",
"[0-9]{2}:[0-9]{2}:[0-9]{2}",
"[0-9]{2}:[0-9]{2}",
"0[xX][0-9a-fA-F]+",
"([\(]?[0-9a-fA-F]*:){8,}[\)]?",
"^(?:[0-9]{4}-[0-9]{2}-[0-9]{2})(?:[ ][0-9]{2}:[0-9]{2}:[0-9]{2})?(?:[.,][0-9]{3})?",
"(\/|)([a-zA-Z0-9-]+\.){2,}([a-zA-Z0-9-]+)?(:[a-zA-Z0-9-]+|)(:|)",
]

pat = r"\b(?:{})\b".format("|".join(str(v) for v in re_list))
tokens = re.sub(pat, "<*>", str(tokens))
tokens = tokens.replace("=", " = ")
tokens = tokens.replace(")", " ) ")
tokens = tokens.replace("(", " ( ")
tokens = tokens.replace("]", " ] ")
tokens = tokens.replace("[", " [ ")
event_label.append(str(tokens).lstrip().replace(",", " "))

self.df_log["event_label"] = event_label

return 0

def getDynamicVars2(self, petit_group):
petit_group["event_label"] = petit_group["event_label"].map(
lambda x: " ".join(dict.fromkeys(x.split()))
)
petit_group["event_label"] = petit_group["event_label"].map(
lambda x: " ".join(
filter(None, (word.strip(punctuation) for word in x.split()))
)
)

lst = petit_group["event_label"].values.tolist()

vec = []
big_lst = " ".join(v for v in lst)
this_count = Counter(big_lst.split())

if this_count:
max_val = max(this_count, key=this_count.get)
for word in this_count:
if this_count[word] < this_count[max_val]:
vec.append(word)

return vec

def remove_word_with_special(self, sentence):
sentence = sentence.translate(
{ord(c): "" for c in "!@#$%^&*()[]{};:,/<>?\|`~-=+"}
)
length = len(sentence.split())

finale = ""
for word in sentence.split():
if (
not any(ch.isdigit() for ch in word)
and not any(not c.isalnum() for c in word)
and len(word) > 1
):
finale += word

finale = finale + str(length)
return finale

def outputResult(self):
self.df_log.to_csv(
os.path.join(self.savePath, self.logName + "_structured.csv"), index=False
)

def load_data(self):
headers, regex = self.generate_logformat_regex(self.log_format)

self.df_log = self.log_to_dataframe(
os.path.join(self.path, self.logname), regex, headers, self.log_format
)

def generate_logformat_regex(self, logformat):
"""Function to generate regular expression to split log messages"""
headers = []
splitters = re.split(r"(<[^<>]+>)", logformat)
regex = ""
for k in range(len(splitters)):
if k % 2 == 0:
splitter = re.sub(" +", "\\\s+", splitters[k])
regex += splitter
else:
header = splitters[k].strip("<").strip(">")
regex += "(?P<%s>.*?)" % header
headers.append(header)
regex = re.compile("^" + regex + "$")
return headers, regex

def log_to_dataframe(self, log_file, regex, headers, logformat):
"""Function to transform log file to dataframe"""
log_messages = []
linecount = 0
with open(log_file, "r") as fin:
for line in fin.readlines():
try:
match = regex.search(line.strip())
message = [match.group(header) for header in headers]
log_messages.append(message)
linecount += 1
except Exception as e:
print("[Warning] Skip line: " + line)
logdf = pd.DataFrame(log_messages, columns=headers)
logdf.insert(0, "LineId", None)
logdf["LineId"] = [i + 1 for i in range(linecount)]
return logdf

def parse(self, logname):
start_timeBig = time.time()
print("Parsing file: " + os.path.join(self.path, logname))

self.logname = logname

regex = [r"blk_-?\d+", r"(\d+\.){3}\d+(:\d+)?"]

self.load_data()
self.df_log = self.df_log.sample(n=2000)
self.tokenize()
self.df_log["EventId"] = self.df_log["event_label"].map(
lambda x: self.remove_word_with_special(str(x))
)
groups = self.df_log.groupby("EventId")
keys = groups.groups.keys()
stock = pd.DataFrame()
count = 0

re_list2 = ["[ ]{1,}[-]*[0-9]+[ ]{1,}", ' "\d+" ']

generic_re = re.compile("|".join(re_list2))

for i in keys:
l = []
slc = groups.get_group(i)

template = slc["event_label"][0:1].to_list()[0]
count += 1
if slc.size > 1:
l = self.getDynamicVars2(slc.head(10))
pat = r"\b(?:{})\b".format("|".join(str(v) for v in l))
if len(l) > 0:
template = template.lower()
template = re.sub(pat, "<*>", template)

template = re.sub(generic_re, " <*> ", template)
slc["event_label"] = [template] * len(slc["event_label"].to_list())

stock = stock.append(slc)
stock = stock.sort_index()

self.df_log = stock

self.df_log["EventTemplate"] = self.df_log["event_label"]
if not os.path.exists(self.savePath):
os.makedirs(self.savePath)
self.df_log.to_csv(
os.path.join(self.savePath, logname + "_structured.csv"), index=False
)
elapsed_timeBig = time.time() - start_timeBig
print(f"Parsing done in {elapsed_timeBig} sec")
return 0
1 change: 1 addition & 0 deletions logparser/ULP/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .ULP import *
Loading

0 comments on commit a90854d

Please sign in to comment.