Skip to content

Tutorial

Marc-Olivier Buob edited this page Jun 24, 2022 · 14 revisions

Preliminaries

The tutorial considers the /var/log/Xorg.0.log as an input file (example).

The presented methodology holds for any text file (e.g., a file storing a system command output).

Basic usage

In command line

  1. The following command shows how to run the clustering:
pattern-clustering -i /var/log/Xorg.0.log
  1. To save the results (in results.json) and get human readable results (in results.html), run:
pattern-clustering -i /var/log/Xorg.0.log -o results.json -H results.html
  1. To get more details about the supported options, run:
pattern-clustering --help

In python

  1. Load the input file:
LOG_FILENAME = "/var/log/Xorg.0.log" # Or any arbitrary log file
with open(LOG_FILENAME) as f:
    LINES = [line.strip() for line in f.readlines()]
  1. Compute the pattern-based clustering. Note you could tune parameters to customize the patterns considered by the pattern-based distance and the threshold used by the clustering algorithm.
from pattern_clustering import pattern_clustering

CLUSTERS = pattern_clustering(LINES)
print(CLUSTERS)
  1. For a more readable output, run in Jupyter:
from pattern_clustering import pattern_clustering_to_html
from pybgl.html         import html

html(pattern_clustering_to_html(LINES, CLUSTERS))

Advanced usage

It is possible to define a custom patterns and specify the threshold value used during the clustering.

  • The threshold is a value between 0.0 and 1.0 that defines the maximum distance between the cluster representative (the first line joining the cluster) and the other ones. The lower the threshold, the smaller the clusters
  • The patterns indicates how input line but be decomposed at the pattern scale. The library uses by default some default patterns (see details below).

In command line

  • To set the threshold value (that defaults to 0.6), use the --threshold (in short, -t) option:
pattern-clustering -i /var/log/Xorg.0.log -t 0.1
pattern-clustering -i /var/log/Xorg.0.log -t 0.5
pattern-clustering -i /var/log/Xorg.0.log -t 0.9
  • To specify custom patterns, create a JSON file containing the algorithm's parameters. Note that in the example below, if a key is omitted, the program uses the default parameters.

    • You could start with a more complete configuration file (here, arbitrarily named /tmp/conf.json) as follows:
pattern-clustering-mkconf > /tmp/conf.json
  • You may obviously discard, add or modify some patterns from /tmp/conf.json. The example below defines the "error level" pattern and only keep relevant defaults patterns for /var/log/Xorg.0.log.
{
    "threshold": 0.5,
    "patterns": {
        "int": "(-|[+])?[0-9]+",
        "float": "(-|[+])?[0-9]+([.][0-9]+)?",
        "spaces": "\\s+",
        "path": "(/[-/:._a-zA-Z0-9]+)",
        "word": "\\S+",
        "error level": "\\((EE|--|==|\\*\\*)\\)"
    }
}
  • Then, run the pattern clustering algorithm by passing the /tmp/conf.json configuration file:
pattern-clustering -i /var/log/Xorg.0.log -c /tmp/conf.json

In python

The snippet below reproduces what we have done in the previous section:

from pattern_clustering import *
from pybgl.html         import html

# Loads the input file
FILENAME = "/var/log/Xorg.0.log" # Or any arbitrary log file
with open(FILENAME) as f:
    LINES = [line.strip() for line in f.readlines()]
    
# Reuses some patterns defined in the module
map_name_re = {
    name : regexp
    for (name, regexp) in MAP_NAME_RE.items()
    if name in {"int", "float", "spaces", "path", "word"}
}

# Adds a custom pattern
map_name_re["error level"] = "\((EE|--|==|\*\*)\)"

# Updates the defaults parameters
env = PatternClusteringEnv()
env.set_patterns(map_name_re)

# Runs the clustering and displays the results
CLUSTERS = pattern_clustering(LINES, max_dist=0.5)
html(pattern_clustering_to_html(LINES, CLUSTERS))