This repository supplies additional material for the Malware Similarity paper.
This work is aimed to:
- Study malware similarity techniques and their limitations.
- Provide some insights on how we could overcome some challenges.
This work was developed by Marcus Botacin, under supervision of Prof. Dr. Paulo Lício de Geus and Prof. Dr. André Ricardo Abed Grégio.
The functions here mentioned were obtained from dynamic, transparent traces obtained using our BranchMonitor solution.
We tackled the similarity matching problem from two perspectives: i) The used features, and ii) The used matching metrics.
In particular, we are interested on approaches which make use of function as feature, as shown below:
LdrGetProcedureAddress -> LdrLoadDll
LdrGetDllHandle -> LdrLoadDll
NtOpenMutant -> ZwMapViewOfSection
NtCreateMutant -> ZwMapViewOfSection
This kind of approach presents a drawback: Same-behavior function replacement, as shown on the figures below:
Function-Based 1 | Function-Based 2 |
---|---|
Despite having the same behavior, these samples would have been classified as non-similar by a function-based approach.
As a solution for this case, we have adopted a behavior-based approach. This way, the above samples would be considered as similar, as shown below:
Function-Based 1 | Function-Based 2 | Our Approach |
---|---|---|
The usual metric for similarity measurement is the following:
In this metric, the score will be minimum (0.0) when the inputs are totally distinct, and maximum (1.0) when the inputs are exactly the same.
Using this metrics also presents a drawback: When a sample is embbed inside another, as in the example shown below:
Original Sample | Embedded Sample |
---|---|
In this example, the similarity score is 50%, despite the fact the sample 1 is completely embedded on sample 2. This way, we need to find a similarity metric which could provide more information about the similarity quality.
This way, our proposal is to adopt the following metric:
In this metric, the similarity will be maximum not only when the two samples are equal but also when one is inside another, as desired.
The repository is organized as follows:
- Classes : Behavior classes associated to DLL functions.
- Examples: Graphs examplifying the aforementioned approaches.
- Behaviors: Behavior-based graphs.
- Functions: Function-based graphs.
- Code: Python scripts to handle graphs and trace data.
- Function.to.behavior: Given a function, return its behavior class.
- Generate.graph: Given an edge list, draw the graph.
- Graph.Match: Given two edge lists, compare the resulting graphs.
- Data: Data used on our experiments, so you can reproduce it.
- Functions: Function traces for selected samples.
- Results: Graph similarity results for selected samples.
- Papers: Research written material.
The graphs below exemplify the differences between the original approach and our one.
Function-Based | Behavior-Based |
---|---|
An important task empowered by our approach is sample clustering. The figures below show the clustering scores for the following datasets: Mimail, Klez, and a mix of them.
We can notice small thresholds are not able to properly cluster the mix dataset, which is achieved for thresholds higher than 80%. In addition, these thresholds are also able to provide a good clustering result for the same-family datasets.
Thio work was published at SBSEG 2019.