Added a feature extraction analyzer #785

kiddinn · 2019-01-14T14:32:44Z

This is the first version of a feature extraction analyzer.

The PR does the following things:

Adds the /etc/timesketch folder into the deployment and stores the configuration file for the feature extraction there.
Add a features.conf YAML file where all feature extractions can be defined
Adds an analyzer that reads the config file and extracts the features added there
Adds a single feature extraction for extracting email addresses from browser history.

berggren

Looks really good. One comment about returning the resulting object instead of just the config path.

timesketch/lib/analyzers/feature_extraction.py

berggren · 2019-01-14T15:13:36Z

timesketch/lib/analyzers/feature_extraction.py

+        if not os.path.isfile(self.CONFIG_FILE):
+            return 'Unable to read config file, no features extracted.'
+
+        with open(self.CONFIG_FILE, 'r') as fh:


Can't we do this in the interface function instead, having it return the config dictionary instead of a path? When we move other analyzers config this will get a bit repetitive to do in each one that needs to load config. WDYT?

berggren · 2019-01-14T15:20:42Z

timesketch/lib/analyzers/interface.py

@@ -33,6 +35,27 @@ def wrapper(self, *args, **kwargs):
    return wrapper


+def get_config(file_name):


See earlier comment, but consider returning the resulting object after yaml.safe_load here instead?

agree, dome

berggren · 2019-01-14T15:26:51Z

timesketch/lib/analyzers/feature_extraction.py

+                return 'No results, unable to parse config file.'
+
+        return_strings = []
+        for name, feature_config in config.iteritems():


Note: You can override the get_kwargs() method from the interface to have the system create individual tasks for you here. Example from the similarity scorer:

def get_kwargs(cls): """Keyword arguments needed to instantiate the class. In addition to the index_name passed to the constructor by default we need the data_type name as well. Furthermore we want to instantiate one task per data_type in order to run the analyzer in parallel. To achieve this we override this method and return a list of keyword argument dictionaries. Returns: List of keyword arguments (dict), one per data_type. """ kwargs_list = [] try: data_types = current_app.config['SIMILARITY_DATA_TYPES'] if data_types: for data_type in data_types: kwargs_list.append({'data_type': data_type}) except KeyError: return None return kwargs_list

No need to do that here if you don't want to, but good to know it exists.

yes, I don't think we need that here... not necessarily

berggren

Looks good, just a last small comment/nit then this is ready to merge.

berggren · 2019-01-14T16:24:52Z

config/features.yaml

+browser_usernames:
+        query_string: 'source_short:"WEBHIST"'
+        attribute: 'message'
+        store_as: 'username'


Isn't this an email address that is extracted? A username is a bit misleading I think.

berggren

LGTM

kiddinn added 8 commits January 12, 2019 12:09

Adding a new feature extraction analyzer, scaffolder output.

77c97d5

Adding a yaml example file.

5cd0414

Some updates.

e693f67

Making changes to the feature extraction.

94803a8

Updating deployment.

f164266

Minor adjustments.

e45f97b

Added aggregation.

0c6ca02

updating a docstring

7959f35

kiddinn added New feature Analyzers labels Jan 14, 2019

kiddinn assigned berggren and kiddinn Jan 14, 2019

kiddinn requested a review from berggren January 14, 2019 14:32

berggren requested changes Jan 14, 2019

View reviewed changes

Responding to code review.

9708953

kiddinn requested a review from berggren January 14, 2019 16:18

berggren requested changes Jan 14, 2019

View reviewed changes

Minor adjustment after code review.

81486de

kiddinn requested a review from berggren January 14, 2019 16:48

Merge branch 'master' into feature_extraction

808b087

berggren approved these changes Jan 14, 2019

View reviewed changes

berggren merged commit 2a74597 into google:master Jan 14, 2019

kiddinn mentioned this pull request Jan 14, 2019

Extract usernames or account email addresses from browser history #745

Closed

kiddinn deleted the feature_extraction branch January 14, 2019 20:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a feature extraction analyzer #785

Added a feature extraction analyzer #785

kiddinn commented Jan 14, 2019

berggren left a comment

berggren Jan 14, 2019

kiddinn Jan 14, 2019

berggren Jan 14, 2019

kiddinn Jan 14, 2019

berggren Jan 14, 2019

kiddinn Jan 14, 2019

berggren left a comment

berggren Jan 14, 2019

kiddinn Jan 14, 2019

berggren left a comment

		@@ -33,6 +35,27 @@ def wrapper(self, args, *kwargs):
		return wrapper


		def get_config(file_name):

Added a feature extraction analyzer #785

Added a feature extraction analyzer #785

Conversation

kiddinn commented Jan 14, 2019

berggren left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

berggren left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

berggren left a comment

Choose a reason for hiding this comment