Skip to content

Latest commit

 

History

History
 
 

data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Datasets

This research uses two datasets for its evaluation:

  1. Lastline dataset.
  2. HDFS dataset.

Lastline dataset

The real-world Lastline dataset consists of 20 international organizations that use 395 detectors to monitor 388K devices*. This resulted in 10.5M security events for 291 unique types of security events collected over a 5-month period. Events include policy violations (e.g., use of deprecated samba versions, remote desktop protocols, and the Tor browser), signature hits (e.g., Mirai, Ursnif, and Zeus) as well as heuristics on suspicious and malicious activity (e.g., beaconing activity, SQL injection, Shellshock Exploit Attempts and various CVEs). Of the 10.5M security events, a triaging system selected 2.7M events that were likely to be part of an attack. Of these 2.7M likely malicious events, 45.1K security events were confirmed to be part of an attack by security operators, and labeled as ATTACKS. These attacks include known malware, such as the XMRig crypto miner, or remote access Trojans, such as NanoCore. Another 46.4K events were classified as a HIGH security risk (e.g., successful web attacks and exploitation of known vulnerabilities such as CVE-2019-19781); 184.9K events classified as a MEDIUM risk (e.g., attempted binary downloads or less exploited vulnerabilities such as CVE-2020-0601) and 2.4M events as LOW risk (e.g., the use of BitTorrent or Gaming Clients). The remaining 7.8M events were not related to security risks, but were used to give security operators additional information about device activity, and are therefore labeled as INFO.

*These include devices in a bring-your-own-device setting which were only monitored for a small part of the 5 months. Therefore, the average number of 10.5M/388K = 27.06 events generated per device is significantly lower than the earlier reported 170 events per device per day.

Download


NOTE

The Lastline dataset was obtained under an NDA and therefore, unfortunately, we cannot share the dataset.


HDFS dataset

We also evaluate DeepCASE on the HDFS dataset [1] used in the evaluation of the related security log analysis tool DeepLog [2]. This dataset consists of 11.2M system log entries generated by over 200 Amazon EC2 nodes. The dataset was labeled by experts into normal and anomalous events, where 2.9% of events were labeled as anomalous. Unfortunately, this dataset lacks metadata about the risk level of security events and is therefore evaluated in terms of workload reduction, but not in terms of accuracy. Despite containing less information, we use the HDFS dataset to provide a reproducible comparison with state-of-the-art systems.

Download

The HDFS dataset as we used it in our research can be downloaded from https://github.com/wuyifan18/DeepLog/tree/master/data.

References

[1] Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. (2009). Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP) (pp. 117–132).

[2] Du, M., Li, F., Zheng, G., & Srikumar, V. (2017). Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS) (pp. 1285-1298).