Malicious software or Malware is a program or code that aims to harm, damage, or disable computers, applications, systems, or mobile phones. So, we need to classify the running software if its benign or malware and if its malware, so which type of malware it is.
The dataset was created to represent as close to a real-world situation as possible using malware that is prevalent in the real world. Made up of Spyware, Ransomware and Trojan Horse malware, it provides a balanced dataset that can be used to test obfuscated malware detection systems.
CIC-MalMem-2022: https://www.unb.ca/cic/datasets/malmem-2022.html
The data set features are: pslist.nproc pslist.nppid pslist.avg_threads pslist.nprocs64bit pslist.avg_handlers dlllist.ndlls dlllist.avg_dlls_per_proc handles.nhandles handles.avg_handles_per_proc handles.nport handles.nfile handles.nevent handles.ndesktop handles.nkey handles.nthread handles.ndirectory handles.nsemaphore handles.ntimer handles.nsection handles.nmutant ldrmodules.not_in_load ldrmodules.not_in_init ldrmodules.not_in_mem ldrmodules.not_in_load_avg ldrmodules.not_in_init_avg ldrmodules.not_in_mem_avg malfind.ninjections malfind.commitCharge malfind.protection malfind.uniqueInjections psxview.not_in_pslist psxview.not_in_eprocess_pool psxview.not_in_ethread_pool psxview.not_in_pspcid_list psxview.not_in_csrss_handles psxview.not_in_session psxview.not_in_deskthrd psxview.not_in_pslist_false_avg psxview.not_in_eprocess_pool_false_avg psxview.not_in_ethread_pool_false_avg psxview.not_in_pspcid_list_false_avg psxview.not_in_csrss_handles_false_avg psxview.not_in_session_false_avg psxview.not_in_deskthrd_false_avg modules.nmodules svcscan.nservices svcscan.kernel_drivers svcscan.fs_drivers svcscan.process_services svcscan.shared_process_services svcscan.interactive_process_services svcscan.nactive callbacks.ncallbacks callbacks.nanonymous callbacks.ngeneric
import numpy as np
def importdata():
MalMem2022 = pd.read_csv('/content/Obfuscated-MalMem2022.csv')
# Printing the dataset shape
print ("Dataset of Malware_memory_Analysis_2022 Length is: ", len(MalMem2022))
print ("Dataset of Malware_memory_Analysis_2022 Shape is: ", MalMem2022.shape)
Dataset of Malware_memory_Analysis_2022 Length is: 58596
Dataset of Malware_memory_Analysis_2022 Shape is: (58596, 58)
# Printing the dataset obseravtions
print ("Dataset of Malware_memory_Analysis_2022 is: ",MalMem2022.head())
return MalMem2022
Dataset of Malware_memory_Analysis_2022 is: pslist.nproc pslist.nppid pslist.avg_threads pslist.nprocs64bit
0 45 17 10.555556 0
1 47 19 11.531915 0
2 40 14 14.725000 0
3 32 13 13.500000 0
4 42 16 11.452381 0
pslist.avg_handlers dlllist.ndlls dlllist.avg_dlls_per_proc
0 202.844444 1694 38.500000
1 242.234043 2074 44.127660
2 288.225000 1932 48.300000
3 264.281250 1445 45.156250
4 281.333333 2067 49.214286
handles.nhandles handles.avg_handles_per_proc handles.nport ...
0 9129 212.302326 0 ...
1 11385 242.234043 0 ...
2 11529 288.225000 0 ...
3 8457 264.281250 0 ...
4 11816 281.333333 0 ...
svcscan.process_services svcscan.shared_process_services
0 24 116
1 24 118
2 27 118
3 27 118
4 24 118
svcscan.interactive_process_services svcscan.nactive
0 0 121
1 0 122
2 0 120
3 0 120
4 0 124
callbacks.ncallbacks callbacks.nanonymous callbacks.ngeneric Class
0 87 0 8 Benign
1 87 0 8 Benign
2 88 0 8 Benign
3 88 0 8 Benign
4 87 0 8 Benign
Category SubCategory
0 Benign Benign
1 Benign Benign
2 Benign Benign
3 Benign Benign
4 Benign Benign
[5 rows x 58 columns]
# Function to split the dataset
def splitdataset(MalMem2022):
# Separating the target variable
X = MalMem2022.values[:, 0:55]
Y = MalMem2022.values[:, 56]
# Splitting the dataset into 75% training and 25% testing
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.25, random_state = 0)
return X, Y, X_train, X_test, y_train, y_test
# Function to perform training with giniIndex (Origional)
def train_using_gini(X_train, X_test, y_train):
# Creating the classifier object
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 0)
# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini
# Function to make predictions
def prediction(X_test, clf_object):
# Predicton on test with giniIndex
y_pred = clf_object.predict(X_test)
print("Predicted values:")
print(y_pred)
return y_pred
# Function to calculate accuracy
def cal_accuracy(y_test, y_pred):
print("Confusion Matrix: ",
confusion_matrix(y_test, y_pred))
print ("Accuracy : ",
accuracy_score(y_test,y_pred)*100)
print("Report : ",
classification_report(y_test, y_pred))