-
Notifications
You must be signed in to change notification settings - Fork 47
Clustering
Because of the way most programs work, slightly different backtraces (function call paths) can lead to the same end result - a crash caused by the same bug. Naturally, it makes sense to identify these and group them together. The purpose of the create-problems
action is just that - to create these groups called Problem
s from backtraces associated with Report
s.
uReports sent by ABRT client to FAF contain backtraces from multiple threads, with the only one we're concerned here being the crash thread. A Report
object should contain exactly one backtrace (ReportBacktrace
) associated with a crash thread (ReportBtThread
where crashtread == True
).
Futher, ReportBacktrace
s have multiple ReportBtFrame
s which constitute the backtrace itself. The frames are ordered by ReportBtFrame.order
.
The Report
objects have a foreign key to a Problem
object. This foreign key should point to the same Problem
iff the underlying cause (the bug) of Report
s is the same. Unfortunately this characteristic cannot be objectively verified, so we need to take a best-effort approach (dendrogram clusters with backtrace edit distance metric).
The core of the clustering is implemented in the satyr library. The create-problems
actions goes through the following steps:
- Load backtraces of the specified problem type (core, kerneloops etc.) from the database.
- Convert the backtraces to satyr backtraces.
- Quickly create buckets of at most 2000 backtraces.
- Call satyr clustering routines on each bucket separately.
- Retrieve the resulting clusters.
- Save clusters to the database, keeping the changes to a minimum.
See pyfaf.actins.create_problems.CreateProblems._create_clusters
method for details. Note the terminology is a little confusing. Buckets are called clusters and backtraces threads.
The clustering itself happens separately on each bucket of at most 2000 backtraces.
First satyr is used to compute distances between each of the backtraces in the cluster. The metric used is the Levenstein edit distance. This process has o(n^2) complexity, so that's why the bucketing is necessary.
Then satyr is used to create a dendrogram of the distances. The resulting cluster are formed by cutting off the dendrogram at 0.3.
The result of clustering is a list of lists of backtraces, e.g. [[bt1,bt2], [bt3], [bt4,bt5]]
, forming three problems. Imagine this is saved in the DB. Now a new backtrace bt6
appears that is similar to bt4
and bt5
. The result of clustering should be [[bt1,bt2], [bt3], [bt4,bt5,bt6]]
. We need to save this result to the DB while changing the data as little as possible (And the changes should really be very small, I'd say 95% of problems should stay exactly the same). We need to keep the first two Problem
s intact and add the new backtrace to the third problem. (Note there is no direct connection between problems resulting from the clustering and Problem
s in the DB, i.e. the lists don't carry the DB Problem
's id)
The _iter_problems
takes care of this task. For it to work efficiently we need to keep the state of Problem
s before the clustering in the reuse_problems
dict. The key of the dict is a tuple of sorted backtrace ids, the value is the DB Problem
id. This way, perfect matches (the first two problems in our example) can be take care of in constant time by looking up the key. If the key is not found, the best match is used, but to get it, all problems in the previous DB state must be iterated, which is slower. Refer to the _iter_problems
method code and comments for details.