CLX has been deprecated in favor of Morpheus, a new highly optimized AI framework that includes pre-trained AI capabilities for cybersecurity. This repo will therefore no longer be updated. Many of the CLX use cases and examples have already been migrated to Morpheus. The Morpheus framework also allows you to build your own pipelines for cybersecurity and information security use cases.
- Morpheus GitHub repo: https://github.com/nv-morpheus/Morpheus
- Full documentation for the latest official Morpheus release: https://docs.nvidia.com/morpheus
NOTE: For the latest stable README.md ensure you are on the main
branch.
CLX ("clicks") provides a collection of RAPIDS examples for security analysts, data scientists, and engineers to quickly get started applying RAPIDS and GPU acceleration to real-world cybersecurity use cases.
The goal of CLX is to:
- Allow cyber data scientists and SecOps teams to generate workflows, using cyber-specific GPU-accelerated primitives and methods, that let them interact with code using security language,
- Make available pre-built use cases that demonstrate CLX and RAPIDS functionality that are ready to use in a Security Operations Center (SOC),
- Accelerate log parsing in a flexible, non-regex method. and
- Provide SIEM integration with GPU compute environments via RAPIDS and effectively extend the SIEM environment.
Python API documentation can be found here or generated from docs directory.
There are 4 ways to get started with CLX :
Please see the Demo Docker Repository, choosing a tag based on the NVIDIA CUDA version you’re running. This provides a ready to run Docker container with CLX and its dependencies already installed.
Pull image:
docker pull rapidsai/rapidsai-clx:22.04-cuda11.5-runtime-ubuntu18.04-py3.8
Nightly images for current development version can be pulled from https://hub.docker.com/r/rapidsai/rapidsai-clx-nightly.
docker run -it --gpus '"device=0"' \
--rm -d \
-p 8888:8888 \
-p 8787:8787 \
-p 8686:8686 \
rapidsai/rapidsai-clx:22.04-cuda11.5-runtime-ubuntu18.04-py3.8
docker run -it --runtime=nvidia \
--rm -d \
-p 8888:8888 \
-p 8787:8787 \
-p 8686:8686 \
rapidsai/rapidsai-clx:22.04-cuda11.5-runtime-ubuntu18.04-py3.8
The following ports are used by the runtime containers only (not base containers):
- 8888 - exposes a JupyterLab notebook server
- 8786 - exposes a Dask scheduler
- 8787 - exposes a Dask diagnostic web server
Prerequisites
- NVIDIA Pascal™ GPU architecture or better
- CUDA 11.5+ compatible NVIDIA driver
- Ubuntu 18.04/20.04 or CentOS 7
- Docker CE v18+
- nvidia-docker v2+
Pull the RAPIDS image suitable to your environment and build CLX image. Please see the rapidsai-dev or rapidsai-dev-nightly Docker repositories, choosing a tag based on the NVIDIA CUDA version you’re running. More information on getting started with RAPIDS can be found here.
docker pull rapidsai/rapidsai-dev:22.04-cuda11.5-devel-ubuntu18.04-py3.8
docker build -t clx:latest .
Start the container and the notebook server. There are multiple ways to do this, depending on what version of Docker you have.
docker run -it --gpus '"device=0"' \
--rm -d \
-p 8888:8888 \
-p 8787:8787 \
-p 8686:8686 \
clx:latest
docker run -it --runtime=nvidia \
--rm -d \
-p 8888:8888 \
-p 8787:8787 \
-p 8686:8686 \
clx:latest
The container will include scripts for your convenience to start and stop JupyterLab.
# Start JupyterLab
/rapids/utils/start_jupyter.sh
# Stop JupyterLab
/rapids/utils/stop_jupyter.sh
The following steps show how to use docker-compose
to create a CLX environment ready for SIEM integration. We will be using docker-compose
to start multiple containers running CLX, Kafka and Zookeeper.
First, make sure to have the following installed:
Add the following to /etc/docker/daemon.json
if not already there:
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
Run the following to start your containers. Modify port mappings in docker-compose.yml
if there are port conflicts.
docker-compose up
By default, all GPUs in your system will be visible to your CLX container. To choose which GPUs you want visible, you can add the following to the clx
section of your docker-compose.yml
:
environment:
- NVIDIA_VISIBLE_DEVICES=0,1
It is easy to install CLX using conda. You can get a minimal conda installation with Miniconda or get the full installation with Anaconda.
Install and update CLX using the conda command:
# Stable
conda install -c rapidsai -c nvidia -c pytorch -c conda-forge clx
# Nightly
conda install -c rapidsai-nightly -c nvidia -c pytorch -c conda-forge clx
See build instructions.
CLX is targeted towards cybersecurity data scientists, senior security analysts, threat hunters, and forensic investigators. Data scientists can use CLX in traditional Python files and Jupyter notebooks. The notebooks folder contains example use cases and workflow instantiations. It's also easy to get started using CLX with RAPIDS with Python. The code below reads cyber alerts, aggregates them by day, and calculates the rolling z-score value across multiple days to look for outliers in volumes of alerts. Expanded code is available in the alert analysis notebook.
import cudf
import requests
from os import path
# download data
if not path.exists('./splunk_faker_raw4'):
url = 'https://data.rapids.ai/cyber/clx/splunk_faker_raw4'
r = requests.get(url)
open('./splunk_faker_raw4', 'wb').write(r.content)
# read in alert data
gdf = cudf.read_csv('./splunk_faker_raw4')
gdf.columns = ['raw']
# parse the alert data using CLX built-in parsers
from clx.parsers.splunk_notable_parser import SplunkNotableParser
snp = SplunkNotableParser()
parsed_gdf = cudf.DataFrame()
parsed_gdf = snp.parse(gdf, 'raw')
# define function to round time to the day
def round2day(epoch_time):
return int(epoch_time/86400)*86400
# aggregate alerts by day
parsed_gdf['time'] = parsed_gdf['time'].astype(int)
parsed_gdf['day'] = parsed_gdf.time.applymap(round2day)
day_rule_gdf= parsed_gdf[['search_name','day','time']].groupby(['search_name', 'day']).count().reset_index()
day_rule_gdf.columns = ['rule', 'day', 'count']
# import the rolling z-score function from CLX statistics
from clx.analytics.stats import rzscore
# pivot the alert data so each rule is a column
def pivot_table(gdf, index_col, piv_col, v_col):
index_list = gdf[index_col].unique()
piv_gdf = cudf.DataFrame()
piv_gdf[index_col] = index_list
piv_groups = gdf[piv_col].unique().to_pandas()
for group in piv_groups:
temp_df = gdf[gdf[piv_col] == group]
temp_df = temp_df[[index_col, v_col]]
temp_df.columns = [index_col, group]
piv_gdf = piv_gdf.merge(temp_df, on=[index_col], how='left')
piv_gdf = piv_gdf.set_index(index_col)
return piv_gdf.sort_index()
alerts_per_day_piv = pivot_table(day_rule_gdf, 'day', 'rule', 'count').fillna(0)
# create a new cuDF with the rolling z-score values calculated
r_zscores = cudf.DataFrame()
for rule in alerts_per_day_piv.columns:
x = alerts_per_day_piv[rule]
r_zscores[rule] = rzscore(x, 7) #7 day window
In addition to traditional Python files and Jupyter notebooks, CLX also includes structure in the form of a workflow. A workflow is a series of data transformations performed on a GPU dataframe that contains raw cyber data, with the goal of surfacing meaningful cyber analytical output. Multiple I/O methods are available, including Kafka and on-disk file stores.
Example flow workflow reading and writing to file:
from clx.workflow import netflow_workflow
source = {
"type": "fs",
"input_format": "csv",
"input_path": "/path/to/input",
"schema": ["firstname","lastname","gender"],
"delimiter": ",",
"required_cols": ["firstname","lastname","gender"],
"dtype": ["str","str","str"],
"header": "0"
}
dest = {
"type": "fs",
"output_format": "csv",
"output_path": "/path/to/output"
}
wf = netflow_workflow.NetflowWorkflow(source=source, destination=dest, name="my-netflow-workflow")
wf.run_workflow()
For additional examples, browse our complete API documentation, or check out our more detailed notebooks.
Please see our guide for contributing to CLX.
Find out more details on the RAPIDS site
The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.