[ML] Warn if ML categorization job is using data that does not categorize well #50749

sophiec20 · 2020-01-08T15:59:01Z

What
If an ML categorization job creates many many categories, it is probably not worth categorising. To be defensive, we should audit a warning message for jobs where the number of categories is high. This warning would be visible in job messages in the UI but would not be intended to stop the job from continuing.

It is difficult to figure out what "high" is because this is data dependent. This could be a ratio of categories to records_processed once a useful learning period has elapsed. Or it could be a hard upper limit on total number of categories (taking into account multiple partitions if they are configured). Or both.

Ideally this check can be performed in the early stages of the job after it has had a chance to analyze a useful amount of data. This could be at the end of a lookback (before starting real-time) or say after 100 buckets or 1 day (whichever sooner) for real-time only jobs.

Re-assessing this warning during the lifetime of a real-time job would also have some value in cases where the input data changes - however this could get annoying if done too frequently.

Why
Log categorization will group unstructured log messages into categories. For example, Fred accessed file bananas.txt and Wilma accessed file apples.txt would be considered the same message category. From here, you can use current anomaly detection to model and identify unusual counts of categories of log message and/or rare log message categories.

To create a ML categorization job, it requires a timestamp and a message field. Categorization works best on machine written log messages, typically logging written by a developer for the purpose of system troubleshooting. For example, we would get very poor results trying to categorize each sentence in the complete works of Shakespeare because sentences are different and do not share similar structure. However we would generally get good results if categorizing applications logs with repeated messages (where certain fields changing in each doc e.g. hostname, IP addr, username).

Consequently, an ML categorization job is worth using providing the data it is analyzing is suitable for categorizing. This is not necessarily immediately obvious to all potential users of the system, therefore we should attempt to warn users if the job is not categorizing well.

When
Log categorization has been part of ML anomaly detection for a long time, but has been a bit of a hidden feature. This is now changing.

In 7.6 (tbc) we are working on a new ML UI Wizard elastic/kibana#53009 which will make it easier to create categorization jobs. Logs UI Observability team are also working on integrating with ML elastic/kibana#53004.

With more visibility of the categorization feature, we should look at seeing how we can enhance its usability so users get a better experience of the functionality.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-01-08T15:59:04Z

Pinging @elastic/ml-core (:ml)

droberts195 · 2020-01-20T11:52:15Z

#51146 added a rudimentary check into 7.6. An audit message is created if 1000 or more categories exact for a job before 100 buckets of results have been created.

For 7.7 the intention is to add extra fields into the model_size_stats that can be used to determine whether categorization is doing a good job. One field will be a high level categorization_status enum with values ok and warn (similar to how we have memory_status for model memory). Other fields will report the numbers that underlie that overall status, such as categorized_doc_count, total_category_count, frequent_category_count, rare_category_count and dead_category_count.

This change adds support for the following new model_size_stats fields: - categorized_doc_count - total_category_count - frequent_category_count - rare_category_count - dead_category_count - categorization_status Relates #50749

This change adds support for the following new model_size_stats fields: - categorized_doc_count - total_category_count - frequent_category_count - rare_category_count - dead_category_count - categorization_status Relates elastic/elasticsearch#50749

In elastic#51146 a rudimentary check for poor categorization was added to 7.6. This change replaces that warning based on a Java-side check with a new one based on the categorization_status field that the ML C++ sets. categorization_status was added in 7.7 and above by elastic#51879, so this new warning based on more advanced conditions will also be in 7.7 and above. Closes elastic#50749

…2195) In #51146 a rudimentary check for poor categorization was added to 7.6. This change replaces that warning based on a Java-side check with a new one based on the categorization_status field that the ML C++ sets. categorization_status was added in 7.7 and above by #51879, so this new warning based on more advanced conditions will also be in 7.7 and above. Closes #50749

sophiec20 added the :ml Machine learning label Jan 8, 2020

droberts195 added the >enhancement label Jan 10, 2020

droberts195 self-assigned this Jan 20, 2020

droberts195 mentioned this issue Jan 20, 2020

[ML] Add audit warning for 1000 categories found early in job #51146

Merged

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

This was referenced Feb 4, 2020

[ML] Add new categorization stats to model_size_stats #51879

Merged

[ML] Add new categorization stats to model_size_stats elastic/ml-cpp#989

Merged

droberts195 mentioned this issue Feb 11, 2020

[ML] Switch poor categorization audit warning to use status field #52195

Merged

droberts195 closed this as completed in #52195 Feb 11, 2020

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 2) elastic/elasticsearch-net#4533

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Warn if ML categorization job is using data that does not categorize well #50749

[ML] Warn if ML categorization job is using data that does not categorize well #50749

sophiec20 commented Jan 8, 2020

elasticmachine commented Jan 8, 2020

droberts195 commented Jan 20, 2020 •

edited

Loading

[ML] Warn if ML categorization job is using data that does not categorize well #50749

[ML] Warn if ML categorization job is using data that does not categorize well #50749

Comments

sophiec20 commented Jan 8, 2020

elasticmachine commented Jan 8, 2020

droberts195 commented Jan 20, 2020 • edited Loading

droberts195 commented Jan 20, 2020 •

edited

Loading