-
Notifications
You must be signed in to change notification settings - Fork 1
Tutorial
Marc-Olivier Buob edited this page Jun 24, 2022
·
14 revisions
The tutorial considers the /var/log/Xorg.0.log
as an input file (example).
The presented methodology holds for any text file (e.g., a file storing a system command output).
- The following command shows how to run the clustering:
pattern-clustering -i /var/log/Xorg.0.log
- To save the results (in
results.json
) and get human readable results (inresults.html
), run:
pattern-clustering -i /var/log/Xorg.0.log -o results.json -H results.html
- To get more details about the supported options, run:
pattern-clustering --help
- Load the input file:
LOG_FILENAME = "/var/log/Xorg.0.log" # Or any arbitrary log file
with open(LOG_FILENAME) as f:
LINES = [line.strip() for line in f.readlines()]
- Compute the pattern-based clustering. Note you could tune parameters to customize the patterns considered by the pattern-based distance and the threshold used by the clustering algorithm.
from pattern_clustering import pattern_clustering
CLUSTERS = pattern_clustering(LINES)
print(CLUSTERS)
- For a more readable output, run in Jupyter:
from pattern_clustering import pattern_clustering_to_html
from pybgl.html import html
html(pattern_clustering_to_html(LINES, CLUSTERS))
It is possible to define a custom patterns and specify the threshold value used during the clustering.
- The threshold is a value between 0.0 and 1.0 that defines the maximum distance between the cluster representative (the first line joining the cluster) and the other ones. The lower the threshold, the smaller the clusters
- The patterns indicates how input line but be decomposed at the pattern scale. The library uses by default some default patterns (see details below).
- To set the
threshold
value (that defaults to 0.6), use the--threshold
(in short,-t
) option:
pattern-clustering -i /var/log/Xorg.0.log -t 0.1
pattern-clustering -i /var/log/Xorg.0.log -t 0.5
pattern-clustering -i /var/log/Xorg.0.log -t 0.9
-
To specify custom patterns, create a JSON file containing the algorithm's parameters. Note that in the example below, if a key is omitted, the program uses the default parameters.
- You could start with a more complete configuration file (here, arbitrarily named
/tmp/conf.json
) as follows:
- You could start with a more complete configuration file (here, arbitrarily named
pattern-clustering-mkconf > /tmp/conf.json
- You may obviously discard, add or modify some patterns from
/tmp/conf.json
. The example below defines the"error level"
pattern and only keep relevant defaults patterns for/var/log/Xorg.0.log
.
{
"threshold": 0.5,
"patterns": {
"int": "(-|[+])?[0-9]+",
"float": "(-|[+])?[0-9]+([.][0-9]+)?",
"spaces": "\\s+",
"path": "(/[-/:._a-zA-Z0-9]+)",
"word": "\\S+",
"error level": "\\((EE|--|==|\\*\\*)\\)"
}
}
- Then, run the pattern clustering algorithm by passing the
/tmp/conf.json
configuration file:
pattern-clustering -i /var/log/Xorg.0.log -c /tmp/conf.json
The snippet below reproduces what we have done in the previous section:
from pattern_clustering import *
from pybgl.html import html
# Loads the input file
FILENAME = "/var/log/Xorg.0.log" # Or any arbitrary log file
with open(FILENAME) as f:
LINES = [line.strip() for line in f.readlines()]
# Reuses some patterns defined in the module
map_name_re = {
name : regexp
for (name, regexp) in MAP_NAME_RE.items()
if name in {"int", "float", "spaces", "path", "word"}
}
# Adds a custom pattern
map_name_re["error level"] = "\((EE|--|==|\*\*)\)"
# Updates the defaults parameters
env = PatternClusteringEnv()
env.set_patterns(map_name_re)
# Runs the clustering and displays the results
CLUSTERS = pattern_clustering(LINES, max_dist=0.5)
html(pattern_clustering_to_html(LINES, CLUSTERS))