Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a feature extraction analyzer #785

Merged
merged 11 commits into from
Jan 14, 2019
Merged

Conversation

kiddinn
Copy link
Contributor

@kiddinn kiddinn commented Jan 14, 2019

This is the first version of a feature extraction analyzer.

The PR does the following things:

  • Adds the /etc/timesketch folder into the deployment and stores the configuration file for the feature extraction there.
  • Add a features.conf YAML file where all feature extractions can be defined
  • Adds an analyzer that reads the config file and extracts the features added there
  • Adds a single feature extraction for extracting email addresses from browser history.

Copy link
Contributor

@berggren berggren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. One comment about returning the resulting object instead of just the config path.

timesketch/lib/analyzers/feature_extraction.py Outdated Show resolved Hide resolved
if not os.path.isfile(self.CONFIG_FILE):
return 'Unable to read config file, no features extracted.'

with open(self.CONFIG_FILE, 'r') as fh:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we do this in the interface function instead, having it return the config dictionary instead of a path? When we move other analyzers config this will get a bit repetitive to do in each one that needs to load config. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

@@ -33,6 +35,27 @@ def wrapper(self, *args, **kwargs):
return wrapper


def get_config(file_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See earlier comment, but consider returning the resulting object after yaml.safe_load here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, dome

return 'No results, unable to parse config file.'

return_strings = []
for name, feature_config in config.iteritems():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: You can override the get_kwargs() method from the interface to have the system create individual tasks for you here. Example from the similarity scorer:

    def get_kwargs(cls):
        """Keyword arguments needed to instantiate the class.
        In addition to the index_name passed to the constructor by default we
        need the data_type name as well. Furthermore we want to instantiate
        one task per data_type in order to run the analyzer in parallel. To
        achieve this we override this method and return a list of keyword
        argument dictionaries.
        Returns:
            List of keyword arguments (dict), one per data_type.
        """
        kwargs_list = []
        try:
            data_types = current_app.config['SIMILARITY_DATA_TYPES']
            if data_types:
                for data_type in data_types:
                    kwargs_list.append({'data_type': data_type})
        except KeyError:
            return None
        return kwargs_list

No need to do that here if you don't want to, but good to know it exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I don't think we need that here... not necessarily

@kiddinn kiddinn requested a review from berggren January 14, 2019 16:18
Copy link
Contributor

@berggren berggren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a last small comment/nit then this is ready to merge.

browser_usernames:
query_string: 'source_short:"WEBHIST"'
attribute: 'message'
store_as: 'username'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this an email address that is extracted? A username is a bit misleading I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kiddinn kiddinn requested a review from berggren January 14, 2019 16:48
Copy link
Contributor

@berggren berggren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@berggren berggren merged commit 2a74597 into google:master Jan 14, 2019
@kiddinn kiddinn deleted the feature_extraction branch January 14, 2019 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants