Using Artificial Intelligence Algorithms
dike ( pronounced /ˈdaɪkiː/) is an open-source platform combining the fields of malware analysis with the one of artificial intelligence, more precisely the machine learning subfield.
At the moment, dike is capable only of analyzing Portable Executable and Object Linking and Embedding formats. Besides this limitation, it has three main objectives:
- Regression of malice
- Classification in malware families
- Similarity analysis.
The software enables the creation of analysis pipelines (named in the context of the platform models), which deals with the specific steps of the malware analysis and data engineering:
- Dataset management, where it uses three main sources of labeled PE and OLE files:
- The open-source dataset DikeDataset
- Accurate results of analysis made by the analysts of the organization in which the platform is set up
- Results of automatic VirusTotal scans
- Features extraction, in which extractors are used to obtain relevant information such as:
- Strings
- Characteristics of the file format
- Opcodes
- Windows API calls
- Macros
- Features preprocessing, where preprocessors are used to transform the features into a more friendly format for the machine learning algorithms
- Transformations
- Binarization
- Discretization
- Counting (and in a special approach, for categories of opcodes and API calls)
- Vectorization
- NGrams
- Scaling
- Dimensionality reduction
- Transformations
- Training of machine learning models with included cross-validation and evaluation (regression-wise and classification-wise).
dike is part of my Bachelor thesis, which aims at demonstrating that the artificial intelligence techniques can improve the malware analysis. The document and the presentation (in Romanian 🇷🇴 only) can be found in a separate repository.
At the moment, this is the only place where some relevant information can be found:
- Software requirements
- Architecture (more detailed than the description above)
- Testing
- Evaluation
- Further development.
- Download the script
manage.sh
from the folderinfrastructure
. - Obtain a VirusTotal API key.
- Create and host (on a server which the platform can access) a TGZ archive containing two folders,
ghidra
(with a Ghidra project) andqiling
(with the dynamically linked libraries needed by Qiling). - Run the script and follow the instructions.
If the repository hosting the platform is private, there are two steps that needs to be performed before:
- Generate an asymmetric key pair via
ssh-keygen -t ed25519 -C "EMAIL_ADDRESS"
, whereEMAIL_ADDRESS
need to be populated with your email address. - Add the public one into the GitHub's deployment key section.
A powerful command line interface can be used by the administrators, by running the dike
command on a leader server. Some available commands are demonstrated in the recording below.
The administrators use also manual editing of YAML files, respecting a schema depending on the context in which the file is used. Some existing files (one per type, only for exampling purposes) has comments to document these schemas as follows:
Other systems of the organization can use the scan services of the platform, creating HTTP or HTTPS (depending on the configuration) requests to the following API endpoints.
Route | Action |
---|---|
/get_malware_families |
Retrieves the used malware families. |
/get_evaluation/MODEL_NAME |
Retrieves the evaluation of a model. |
/get_configuration/MODEL_NAME |
Retrieves the configuration. |
/get_features/MODEL_NAME/FILE_HASH |
Retrieves the features of a file from the platform's dataset. |
/create_ticket/MODEL_NAME |
Creates a prediction ticket. |
/get_ticket/TICKET_NAME |
Retrieves the content of a prediction ticket. |
/publish/MODEL_NAME |
Publishes for a specific model the results of a scan. |
The most important used resources are listed in the table below.
Name | Description | Link |
---|---|---|
Ghidra | Software reverse engineering framework | repository |
VirusTotal API | Scanning API that aggregates multiple antivirus engines | website |
Qiling | Python 3 emulation framework | repository |
Pandas | Python 3 data analysis and manipulation library | repository |
scikit-learn | Python 3 machine learning library | repository |
Python 3 | General-purpose programming language | website |
Docker | Software product for OS-level virtualization | website |
Docker Compose | Tool for running multi-container applications on Docker | repository |
GitHub | Git repository hosting service | website |
YAML | Data-serialization language | website |