Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Warn if ML categorization job is using data that does not categorize well #50749

Closed
sophiec20 opened this issue Jan 8, 2020 · 2 comments · Fixed by #52195
Closed

[ML] Warn if ML categorization job is using data that does not categorize well #50749

sophiec20 opened this issue Jan 8, 2020 · 2 comments · Fixed by #52195
Assignees
Labels
>enhancement :ml Machine learning

Comments

@sophiec20
Copy link
Contributor

What
If an ML categorization job creates many many categories, it is probably not worth categorising. To be defensive, we should audit a warning message for jobs where the number of categories is high. This warning would be visible in job messages in the UI but would not be intended to stop the job from continuing.

It is difficult to figure out what "high" is because this is data dependent. This could be a ratio of categories to records_processed once a useful learning period has elapsed. Or it could be a hard upper limit on total number of categories (taking into account multiple partitions if they are configured). Or both.

Ideally this check can be performed in the early stages of the job after it has had a chance to analyze a useful amount of data. This could be at the end of a lookback (before starting real-time) or say after 100 buckets or 1 day (whichever sooner) for real-time only jobs.

Re-assessing this warning during the lifetime of a real-time job would also have some value in cases where the input data changes - however this could get annoying if done too frequently.

Why
Log categorization will group unstructured log messages into categories. For example, Fred accessed file bananas.txt and Wilma accessed file apples.txt would be considered the same message category. From here, you can use current anomaly detection to model and identify unusual counts of categories of log message and/or rare log message categories.

To create a ML categorization job, it requires a timestamp and a message field. Categorization works best on machine written log messages, typically logging written by a developer for the purpose of system troubleshooting. For example, we would get very poor results trying to categorize each sentence in the complete works of Shakespeare because sentences are different and do not share similar structure. However we would generally get good results if categorizing applications logs with repeated messages (where certain fields changing in each doc e.g. hostname, IP addr, username).

Consequently, an ML categorization job is worth using providing the data it is analyzing is suitable for categorizing. This is not necessarily immediately obvious to all potential users of the system, therefore we should attempt to warn users if the job is not categorizing well.

When
Log categorization has been part of ML anomaly detection for a long time, but has been a bit of a hidden feature. This is now changing.

In 7.6 (tbc) we are working on a new ML UI Wizard elastic/kibana#53009 which will make it easier to create categorization jobs. Logs UI Observability team are also working on integrating with ML elastic/kibana#53004.

With more visibility of the categorization feature, we should look at seeing how we can enhance its usability so users get a better experience of the functionality.

@sophiec20 sophiec20 added the :ml Machine learning label Jan 8, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195
Copy link
Contributor

droberts195 commented Jan 20, 2020

#51146 added a rudimentary check into 7.6. An audit message is created if 1000 or more categories exact for a job before 100 buckets of results have been created.

For 7.7 the intention is to add extra fields into the model_size_stats that can be used to determine whether categorization is doing a good job. One field will be a high level categorization_status enum with values ok and warn (similar to how we have memory_status for model memory). Other fields will report the numbers that underlie that overall status, such as categorized_doc_count, total_category_count, frequent_category_count, rare_category_count and dead_category_count.

droberts195 added a commit that referenced this issue Feb 6, 2020
This change adds support for the following new model_size_stats
fields:

- categorized_doc_count
- total_category_count
- frequent_category_count
- rare_category_count
- dead_category_count
- categorization_status

Relates #50749
droberts195 added a commit to elastic/ml-cpp that referenced this issue Feb 7, 2020
This change adds support for the following new model_size_stats
fields:

- categorized_doc_count
- total_category_count
- frequent_category_count
- rare_category_count
- dead_category_count
- categorization_status

Relates elastic/elasticsearch#50749
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Feb 11, 2020
In elastic#51146 a rudimentary check for poor categorization was added to
7.6.

This change replaces that warning based on a Java-side check with
a new one based on the categorization_status field that the ML C++
sets.  categorization_status was added in 7.7 and above by elastic#51879,
so this new warning based on more advanced conditions will also be
in 7.7 and above.

Closes elastic#50749
droberts195 added a commit that referenced this issue Feb 11, 2020
…2195)

In #51146 a rudimentary check for poor categorization was added to
7.6.

This change replaces that warning based on a Java-side check with
a new one based on the categorization_status field that the ML C++
sets.  categorization_status was added in 7.7 and above by #51879,
so this new warning based on more advanced conditions will also be
in 7.7 and above.

Closes #50749
droberts195 added a commit that referenced this issue Feb 11, 2020
…2195)

In #51146 a rudimentary check for poor categorization was added to
7.6.

This change replaces that warning based on a Java-side check with
a new one based on the categorization_status field that the ML C++
sets.  categorization_status was added in 7.7 and above by #51879,
so this new warning based on more advanced conditions will also be
in 7.7 and above.

Closes #50749
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants