-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a feature extraction analyzer #785
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good. One comment about returning the resulting object instead of just the config path.
if not os.path.isfile(self.CONFIG_FILE): | ||
return 'Unable to read config file, no features extracted.' | ||
|
||
with open(self.CONFIG_FILE, 'r') as fh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we do this in the interface function instead, having it return the config dictionary instead of a path? When we move other analyzers config this will get a bit repetitive to do in each one that needs to load config. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
@@ -33,6 +35,27 @@ def wrapper(self, *args, **kwargs): | |||
return wrapper | |||
|
|||
|
|||
def get_config(file_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See earlier comment, but consider returning the resulting object after yaml.safe_load here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, dome
return 'No results, unable to parse config file.' | ||
|
||
return_strings = [] | ||
for name, feature_config in config.iteritems(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: You can override the get_kwargs() method from the interface to have the system create individual tasks for you here. Example from the similarity scorer:
def get_kwargs(cls):
"""Keyword arguments needed to instantiate the class.
In addition to the index_name passed to the constructor by default we
need the data_type name as well. Furthermore we want to instantiate
one task per data_type in order to run the analyzer in parallel. To
achieve this we override this method and return a list of keyword
argument dictionaries.
Returns:
List of keyword arguments (dict), one per data_type.
"""
kwargs_list = []
try:
data_types = current_app.config['SIMILARITY_DATA_TYPES']
if data_types:
for data_type in data_types:
kwargs_list.append({'data_type': data_type})
except KeyError:
return None
return kwargs_list
No need to do that here if you don't want to, but good to know it exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I don't think we need that here... not necessarily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a last small comment/nit then this is ready to merge.
config/features.yaml
Outdated
browser_usernames: | ||
query_string: 'source_short:"WEBHIST"' | ||
attribute: 'message' | ||
store_as: 'username' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this an email address that is extracted? A username is a bit misleading I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This is the first version of a feature extraction analyzer.
The PR does the following things: