This repository contains the code for the Malwords experiments.
From the abstract:
In this work we present a novel approach to malicious software behavioral modeling. Employing an emulation-based dynamic analysis platform, we isolate all the textual content which is found in the memory areas accessed by malware samples processes during execution. The frequency of these natural language words is counted, thus generating a bag-of-words model for each analyzed sample. This modeling allows us to adapt techniques derived from the domain of text mining and document classification to the extraction and wighting of relevant features. Finally we test our new models on a dataset composed by more than 60.000 samples gathered over a two year period. We experiment with several supervised and unsupervised Machine Learning algorithms, and show that our textual models can be used to obtain a remarkable fine-grained classification accuracy using a Neural Network based classifier.