String Resemblance Grouping (SRG) is designed to find a subset of representative strings within a a large collection of messages. These representative strings create groupings with which to categorize the messages for further exploration or triage.
1.0
SRG requires an environment set-up to use Rapids.
- Problem Background
- Use Case
- Technique Overview
- Model Overview
- Training Input
- Inference Input
- Inference Output
- Future Work
- References
When approaching the problem of categorizing computer logs into groups with an assigned representative, there are two major considerations: the run time of the algorithm and hyperparameter selection. When confronted with millions of log entries with such a large number being unique, the primary approach for many of these data sets is reactive analysis: a problem has emerged in the network and the data is searched for relevant information to resolve the issue. What is being proposed here is a way to proactively approach the data for situational awareness and potentially uncovering problems in the network that current heuristics and approaches have not discovered. The large volume of these logs necessitates run time complexity less than
The second consideration is one of hyperparameters. In most clustering approaches, the number of clusters,
These two considerations drive many of the design decisions of the SRG approach.
SRG is agnostic to log type. It can be trained over a single log source or multiple log sources in a single data set. Models can be trained and fit over the same set to provide immediate insight into a given set or alternatively a model can be trained and saved to categorize an ongoing flow of log messages.
The breadth of literature on string resemblance provides a good starting point to solve the problem at hand. So the primary focuses of the problem become the time complexity and hyperparameter selection as discussed in the Problem Background. This means that the approach explored tries to balance time complexity with data driven hyperparameter selection. For a large number of clustering algorithms, the number of clusters,
In order to keep the time complexity low when selecting the number of clusters, SRG works by trying to subset the logs based on different parameters. The number of resulting disjoint subsets becomes the number of representatives,
Next the strings are shingled by either
The last piece of subsetting can be applied using domain knowledge about the specific network logs being analyzed, such as pre-grouping HTTP URL's based on the returned status code. Metadata associated with the logs are typically correlated with the log and logs with similar metadata can be more similar to each other than to strings with different metadata. This can provide more focused situational awareness and better clustering when domain knowledge can be leveraged for the network logs.
A benefit to this approach is that instead of fixing the number of groups,
The model stores the representatives and the means for assigning new messages to a representative group.
A collection of logs. These can be in a text file to be loaded into a Rapids cudf or an already loaded collection.
A single log instance or collection of logs or text files containing logs.
The representative and a corresponding numeric label for the log instance or a dataframe containing the original logs and the assigned representative and numeric group.
See this notebook for an example on building and inferencing a SRG model.
Currently SRG representatives and groups are output as the final result. Future work will instead leverage these representatives as initial "centroids" in a
Further work will look into bootstrapping the length and FastMap 1-D grouping using ensemble clustering, which is an approach that finds a metacluster label for data that has multiple clustering labels assigned to each data point. The benefit to this, especially in the FastMap grouping, is to smooth out the variance in the 1-D grouping. The impact to runtime is offset by the fact that all of the 1-D groupings can be run simultaneously. This means that the largest impact to the added runtime is just from the ensemble clustering algorithm.