Skip to content

Latest commit

 

History

History
402 lines (286 loc) · 27.8 KB

FCParser_user_manual.md

File metadata and controls

402 lines (286 loc) · 27.8 KB

FCParser User Manual

1. OVERVIEW

Raw data collected from a network is often not suitable for statistical analysis. Thus, parsing the data is a necessary step in order to obtain observations for further analysis. The FCparser is a tool that provides a comfortable, general and highly configurable parsing of network data originating from different sources. It has been designed to transform large amounts of heterogeneous network data into a single matrix of data observations suitable for multivariate analysis, without losing relevant information in the process. Also, it provides the opposite process of parsing: deparsing. Using both detection and diagnosis information, the original raw data records related to an anomaly or trend found in the analysis are identified and presented to the analyst.

2. PARSER

The parser transforms heterogeneous data into arrays of observations. This program is designed to take text files as input, alongside configuration files appropriately customized for each data source. Data sources, such as logs from network and security related programs, have an uneven format. They can be structured (e.g csv files) or unstructured (e.g one log entry each paragraph). With expertise in regular expressions, both of them can be handled with this parsing approach.

Feature as a counter ( FaaC ) is the algorithm chosen to generate observations. Each feature contains the number of times a given event (e.g the presence of a word in a log file) takes place. This general definition makes it possible to handle most data sources [1].

To provide flexibility to the tool, observation can be grouped according to specific criteria. This grouping is called aggregation (e.g aggregate observation by source IP) and it is defined in the general configuration file.

Figure 1: Block diagram of the parser

Temporal sampling is a process included in the parsing program. This procedure is performed by splitting input files into smaller files, each of those contains log entries from a specific time window. This way, output files (counters) will be grouped according the temporal sampling configuration. Time window and other splitting parameters are defined in the general configuration file as well.

The program then generates one file for each observation following this naming scheme: output-yyyymmddhhmm.dat, according to the specified time window. Those files contain the observation array and aggregation keys if they are used. Also, the program generates a header file with a list of feature names and a stats file. The directory where these files are saved is defined in the general configuration file.

2.1. GENERAL CONFIGURATION FILE

The program is fully configurable using only configuration files. These files are in YAML (YAML Ain't Markup Language) format. YAML is a human friendly data serialization standard for all programming languages. For python, YAML files are loaded as nested dictionaries depending on the indentation.
In the FCParser, there are two types of configuration files: general configuration file and data sources configuration files.

The General configuration file contains the main information for the parsing process: datasources, aggregation keys, output directories and split configuration, along with some processing parameters. An empty general configuration file looks like this:

Figure 2: General configuration file [configuration.yaml]

DataSources: In this field, different data sources must be specified. For each data source, the name, the specific configuration file for that data source and the data path have to be specified. Input data can be in csv format, text based log files or nfcapd binary data. Please, take into account that the FCParser in offline mode will store all the parsed data in RAM memory during computation, so that results are only written to disk at the end. In several tests, attempts to parse groups of files with several GBs made the parser crash down. Therefore, for large data sets it is highly recommended to split data and apply the FCParser independently for each of the splits. Legacy computers may require smaller splits, while multiprocessing may allow handling larger ones.

Keys: In this field, none, one or more aggregation keys are defined. These keys are the variables chosen to aggregate observation. For each unique value of said keys, observations are grouped (e.g source IP: for each unique value of source ip, one observation of features is generated). Aggregation keys must be variables from the data sources. If the chosen aggregation key is not a variable for a data source, that data source won’t be parsed. If the field is empty, aggregation will not occur, so it is analyzed by timestamp by default.
In case you define two data sources in the configuration file, please note that keys variables must be present in both data sources configuration files so that aggregation keys work. If it is not the case, you have to parse every data source independently. Therefore, this is the recommended way to do it when using the keys parameter in order to avoid possible errors.
You can see how the data is parsed using keys by checking the Example chapter.

Online: Boolean variable to determine if online or offline mode. Online mode is set for real time application (only one process) while offline mode is used for processing already stored data sets (multiprocess).

Processes: Number of processes used by the program. Use a number between 1 and the number of cores of your system. You can know this number by executing lscpu command in linux and checking CPU(s) field, or by checking logical processors in TaskManager>performance tab in Windows. If this parameter is not specified, 80% of maximum possible cores of the system will be set as the default value.

Split: In this field, the temporal sampling parameters are specified. Time window in minutes, as well as start time and end time for sampling interval. Time parameters format must be YYYY-MM-DD hh:mm:ss. If no time window is defined, 5 minutes is considered as the default value.

-Parsing parameters-

Parsing_Output: In this field, the output directory for the parsed data and the stats file (which contains lines, records, matches) and weights file are defined. Headers.dat (containing a list of feature names) and weights.dat files are generated by default if no names are specified.

Incremental_output: Boolean parameter for incremental features. If true and output files exist, new counters are added to the old ones. The default value for this parameter is False in case it is not defined.

All: Boolean variable to consider either all possible matches for a variable or only the first one. It is set to False (consider only first match) by default. This parameter is important when dealing with certain kind of unstructured sources in which the regular expression for a variable might match more than one entity in our data. Therefore, All parameter should be set to True in the configuration file for these scenarios.

Max_chunck: Maximum chunk size in megabytes. When processing every data file, it is splitted into chunks for parallel processing. Each chunk size is usually calculated as the max_chunk size parameter divided by the number of cores used. If the max_chunk parameter is not defined, chunks of 100MB are considered by default. Note that smaller chunks can slow down the parsing process while larger chunks would increase the processing speed but might overload your memory. Therefore, if the data size is not too big according to your free memory, it is highly recommendable to set up the highest value for max_chunk parameter. See one example in Figure 3 of how performance changes with the modification of max_chunk.

Figure 3: Time performance for an example of parsing one 1GB file

-Deparsing parameters-

Deparsing_output: In this field the output directory for the deparsed raw data and the stats file are defined. This configuration is only used for deparsing process.

Threshold: The upper limit of log entries per datasource that will appear in the output file.

-Learning parameters-

Lperc/Endlperc: Proportion (percentage) of data used for learning.

2.2. DATA SOURCES CONFIGURATION FILES

Data sources configuration files are designed to easily adapt different data sources to the program without knowledge of the internal programing. They are located in the same directory along the general configuration file.

The structure of these configuration files is shown in Figure 4:

Figure 4: Data Source configuration file

Attributes:

  • Tag: Key to identify data source, it is recommended to use simple tags (e.g fw for firewall).
  • Structured: boolean variable to identify if a source is structured or unstructured.
  • Timestamp_format: timestamp format for the logs in the files of the data source in python datetime format [2].
  • Separator: Char that delimits the log entries of the source, i.e, the separator between records. It is mandatory for unstructured sources while for structured sources /n is considered if none is specified.
  • Timearg: Timearg is the name of the timestamp variable. If the timearg attribute is not defined, the timestamp will be considered to be defined in a variable named 'timestamp'.

Variables: Variables are fields that are extracted from a log entry (e.g source ip, destination port, etc). Variables attributes differ depending if the data source is structured or not, but the skeleton remains the same. For structured sources, raw data is extracted from a determined position (e.g fourth field of a csv entry). For unstructured source, fields are extracted with regular expressions.

  • Name: Variable name chosen by the user.
  • Matchytype is the type of the field extracted (string, number, ip, time or duration). For structured sources, it is the type of the raw data extracted, whereas for unstructured sources, matchtype is the type of the match for the regular expression.
  • Where is the key to find the field. It is the position of the field in the line (first position corresponds to zero value) for structured sources, while for unstructured sources it is the regular expression that match the wanted field.

A clear explanation of these variable fields can be found in the Example section.

Note: There must be a variable (whose name is defined in timearg attribute) which collects the timestamps, and it must be defined as time matchtype.

Features: Based on the defined variables, different features are specified in order to be considered afterwards during multivariate analysis. For instance, if a variable src_port is extracted from logs, multiple features can be created depending on the value of that variable. Thus, we might have features such as port_http, port_ssh or port_smtp if the value of the variable src_port in logs matches 80, 22 or 25, respectively. Thus, some fields must be specified:

  • Name: Feature name chosen by the user.
  • Variable: Name of the variable from which the feature is calculated.
  • Matchtype: Kind of association between the variable and the feature:
    • Single: The feature is defined for a specific value of the variable.
    • Multiple: When multiple values of a variable match the same feature.
    • Range: The feature is represented by all the values within a specific range.
    • Default: Useful for considering different values than those previously defined. So, value field will be empty for this case.
    • Regexp: It is used when dealing with unstructured sources. Value field will match a certain string.
  • Value: Depending on the matchtype field, one or more values will be defined. Value is empty if matchtype is default.
  • Weight: Optional parameter to assign different weights to the features. It might be useful for deparsing process. If weight is not specified for a feature, it is automatically set to one.

Note: A script creatconf.py is also given in config directory in order to build a FCParser config file for a data source out of a json with counts. This is useful to automatically decide the features of the system. Also, fclearning in bin directory.

3. DEPARSER

Using both detection and diagnosis information, the original raw data records related to anomaly are identified and presented to the analyst. This process is called deparsing and it is a straight forward process, that reduce the challenge of searching logs, surgically extracting data related to anomalies.

The program reverses the parsing criteria. It takes as input the same configuration files used by the parser, along with an input file where a list of timestamps and a list of features are specified. It outputs, for each data source, a file including the log entries that contains those features and occurred in those timestamps.

To delimit the maximum number of total log entries extracted for every data source, the threshold parameter is considered, which is defined in the general configuration file. Log entries that contain more selected features are prioritized. However, this threshold is not absolute and log entries with the same amount of features should not be dismissed. For this reason, the threshold is checked after processing an entire block of log entries with the same number of features appearances.

The input file format is adapted to the output of the MEDA-Toolbox [3]. This toolbox is a tool that can be used to analyze the parsed data. The format of the deparsing input file look like this:

Figure 5: Format of deparsing input file

4. DEBUGGER

The debugger execution mode has been implemented by running fcparser with -d (--debug) option. This mode allows us to process the data source line by line, showing how every log is being parsed into records (according to our defined variables), as well as generating the observation vector according to our defined features. This way, if there is some mismatch between the log and the configuration file (eg. if the regular expression for a variable is not correct), we can detect and correct this mistake by running this mode. For incorrectly parsed variables, their content is set to None and then highlighted by the debugger.
Therefore, it is recommendable to run fcparser in debugging mode to check if the configuration files are correct before running the parsing process for the first time.

When running the debugger, the data source is loaded and the number of logs in it is depicted.

$ python bin/fcparser.py -d example/config/configuration.yaml

Figure 6: Example - Debugging mode. Initialization message

Then, a simple terminal is given to the user with two possibilites:

  • To process the next log entry by pressing ENTER. Then, the entry log, the parsed variables (records) and observation vector are printed. Also, the feature names for those features with non-zero counters are shown. In the next picture we can see a parsing example of an unstructured entry log.

Figure 6: Example - Debugger output

At the top of the picture we can see the raw entry log and how it is parsed into variables. In the lower part of the figure, the observation counters for that entry log are depicted, ordered according to our defined features, along with the feature names of those non-zero counters. In this example, all non-zero counters are set to 1, but it is possible to have counters with higher values if the event (represented with that counter) occurs more than once in a entry log.
  • To process an specific log entry. In this case, we can "jump" to a certain log by specifying the log entry number with go command, eg. "go 34" will show the entry log number 34 (which would correspond with the line 34 for structured sources).

Figure 8: Example - Debugger output. Unstructured source

In this example, all the data is correctly parsed but dst_port variable, which is set to None. This happened because the regular expression used to define this variable was not properly defined. Also, online tools as [7] might help you to understand why a regular expression is not correct.

Also, we can type some string with search command, and the debugger will then return the next entry log containing that string, eg. "search 10:15:45".

Figure 9: Example - Debugger output. Structured source

In this picture, we can see an example of a structured source (comma-separated value data), in which we use the search command to load directly the log entry with a certain timestamp. For structured sources, the variables are defined in order according to the different data fields. Eg: timestamp is the first field, then duration, source ip, etc.
Sometimes, we can find malformed logs in which some of the fields are empty, as it happens in this example for src_ip variable. Then, this variable is set to None.

This command is really useful if some errors are detected during the parsing process and we want to know why they did happen.

The program allows us to surf between all the entry logs by combining the presented functionalities. When reaching the end of the file, it is loaded again. In case you define two data sources in the configuration file, please note you can only process one data source when running in debug mode. We can end the execution at any time by pressing the letter 'q'.

5. EXAMPLE

5.1. PARSING

An example with structured and unstructured sources can be found in the FCParser example directory. For this example, data sources are netflow (structured data, in csv file) and ids logs (unstructured log entries, where one log each entry consists of a paragraph).

Fig. 10 shows the general configuration file for this example.

Figure 10: Example - General configuration file

In the upper part, the different data sources are specified (netflow and IDS), including the data directory and the path where data source configuration file is located. Keys field is unused, aggregation is not performed and observations will then be grouped by timestamp. At the end of the file, we can find the configuration for temporal sampling. In this case, the sampling rate is one minute.

Let's see now what the data looks like so that we can specify the configuration files for these data sources. So, for netflow data, we can find the data file (nf.csv) in the Examples_data directory and netflow configuration file in the config directory (netflow.yaml). If we take a look at the data, it consists of comma separated values where every line is an entry.

An example of the structure of some entries (lines) is depicted in Figure 11:

Figure 11: Example - Netflow datasource

Based on this structure, we can create our data source configuration file, where the attributes, variables and features will be defined according to our goals:

Figure 12: Example - Netflow configuration file

In figure 11, we can see a fragment of the configuration file for a _netflow_ data source. It shows the mandatory attributes for structured sources. As we mentioned before, timearg attribute represents the instant at which a certain event occurred, and it is defined by means of the timestamp variable. The _where_ value of the timestamp variable matches the position in the entry line (i.e. zero because the first field of every entry log corresponds with the time instant). Timestamp format is also specified according to the data file.
For more information about the timestamp format, check: https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior


In the case of IDS data source, the data file in Examples_data directory is named ids.log and IDS configuration file in the config directory is named ids.yaml.

In Figure 13, we can see an example of unstructured data, where each log entry consists of a paragraph.

Figure 13: Example - IDS log entries

In Figure 14, an unstructured source configuration file is depicted, where some variables and features are defined using regular expressions. When building the IDS config file, we can notice that it differs from netflow in that regular expressions are now needed to extract the data. These regular expressions are used to search variables in the log entries, including the timestamp variable. You can check if the regular expressions are properly defined by running fcparser in debugging mode and checking if the data is being parsed correctly into the defined variables (none values will result if a regular expression does not match the data, as it was shown in chapter 4).

Figure 14: Example - IDS configuration file

For this case, it is necessary to define the separator between entries (two blank lines). We can also see the definition of the variables and some features in the figure, for which differents weights are assigned. Using these weight values will be useful for the deparsing process.

In order to run the example and parse the data, just run the following command in the FCParser directory:

$ python bin/fcparser.py example/config/configuration.yaml

Figure 15: Example output - Parsing data

The parser output consists of:

  • The stats file, which includes number of variables, features, data size and logs and processed logs:

Figure 16: Example output - Stats file

In this case all the lines have been processed but sometimes the number of logs and processed logs will differ (eg. if there are some logs with invalid timestamp or empty lines).
  • The header file, with a list of the features from all data sources:

Figure 17: Example output - Headers file

Where the 143 ids feature names are concatenated after the 142 netflow feature names.

  • The weights file, which includes the weight associated to each feature. If a feature has no weight defined in configuration file, weight is automatically set to one, as depicted in the next figure:

Figure 18: Example output - Weights file

  • The parsed data with the timestamp in the file name (output-yyyymmddhhmm.dat). As time window was set to one minute, the number of files generated will match the number of minutes for which we have records.

The structure of these files look as depicted:

Figure 19: Example output - Parsed data (feature counters)

Where the number of times a given event (feature) occurs is recorded as comma-separated values, for each of the 285 features.

In case we define some keys in general configuration file, eg. 'Keys: src_ip', then counters are broken down according to the variable src_ip for each defined time interval. In the next picture we can see an example fragment for timestamp 201212121030:

Figure 20: Example output - Parsed data using keys

5.2. DEPARSING

The deparsing process uses the same configuration files used before. Using both detection and diagnosis information, the original raw data records related to anomaly are identified and presented to the analyst. Detection and diagnosis information is specified in the deparsing input file.

An example of deparsing input file is depicted in figure 21:

Figure 21: Example - Deparsing input file

To run the program, use the following command:

$ python bin/fcdeparser.py example/config/configuration.yaml example/deparsing_input

The deparsing program generates one file for each data source with the extracted logs related to the anomalies detected, as well as stats file with the number of structured and unstructured logs deparsed.

Figure 22: Example - Deparsing output

6. INSTALLATION REQUIREMENTS

FCParser is designed to work on UNIX system with python installed. Almost every linux distribution come with python pre-installed. For library compatibility, python 3 is the required version. If you are using an older version of Python, you might update your Python libraries or create a Python3 virtual environment using virtualenv, for example:

$ virtualenv --python=python3 FCParser/env
$ source FCParser/env/bin/activate

The program requires some python modules to work properly. Before using this tool, install the following packages:

  • Ipy – Python module for handling IPv4 and IPv6 addresses and networks [4]
    $ pip install IPy
    
  • PyYAML – YAML analyzer for python [5].
    $ pip install PyYAML
    

Nfdump [6]. In order to work with netflow data in nfcapd format, the netflow processing tool Nfdump is required.

REFERENCES

[1] Camacho, José & Pérez-Villegas, Alejandro & García-Teodoro, Pedro & Maciá- Fernández, Gabriel. (2016). PCA-based multivariate statistical network monitoring for anomaly detection. Computers & Security. 59.. 10.1016/j.cose.2016.02.008.

[2] Python datetime module for manipulating dates and times. Available at: https://docs.python.org/2/library/datetime.html

[3] Camacho, J., Pérez, A., Rodríguez, R., Jiménez-Mañas, E. Multivariate Exploratory Data Analysis (MEDA) Toolbox. Chemometrics and Intelligent Laboratory Systems, 2015, 143: 49-57, available at: https://github.com/josecamachop/MEDA-Toolbox

[4] IPy , python module for handling of IPv4 and IPv6 addresses and networks. Available at: https://pypi.python.org/pypi/IPy/

[5] PyYAML , The next generation YAML parser and emitter for Python. Available at: https://pypi.python.org/pypi/PyYAML

[6] NFDUMP , tools to collect and process netflow data on the command line. Available at: http://nfdump.sourceforge.net/

[7] regex101: build, test, and debug regex. Available at: https://regex101.com/