add root cause localization transformer #4925

suxi-ms · 2020-03-10T01:30:25Z

The goal of this pull request is to provide a decision tree based algorithm to localize the root cause of an incident on multi-dimensional time series on a specified timestamp.

Fixes Add root cause localization algorithm #4960 .

dnfclas · 2020-03-10T01:30:38Z

All CLA requirements met. #Resolved

mstfbl

Hi @suxi-ms , thank you for your PR! Can you please briefly describe the changes you've made according to our contribution guide?

frank-dong-ms-zz · 2020-03-10T23:46:02Z

@suxi-ms, please do following thing then we can start code review:

brief describe what you are going to do at the PR comments
how many stages/PRs you plan to submit
what is purpose of each PRs
add reviewers that you need sign off from who have best knowledge for your change (you can find that from code history), you can always add harishsk to review as Harish is the manager of ML.NET
always add necessary tests #Resolved

yaeldekel · 2020-03-24T10:04:04Z

src/Microsoft.ML.TimeSeries/DTRootCauseLocalization.cs

+        private static IRowMapper Create(IHostEnvironment env, ModelLoadContext ctx, DataViewSchema inputSchema)
+            => Create(env, ctx).MakeRowMapper(inputSchema);
+
+        private protected override void SaveModel(ModelSaveContext ctx)


SaveModel [](start = 40, length = 9)

Don't you need to save (and load) _beta? #Resolved

SaveModel [](start = 40, length = 9)

Don't you need to save (and load) _beta?

Thank you for the suggestion, I have added changeds to save _beta #Resolved

yaeldekel · 2020-03-24T10:06:36Z

test/Microsoft.ML.TimeSeries.Tests/TimeSeriesDirectApi.cs

+        }
+
+        [Fact]
+        public void RootCauseLocalizationWithDT()


RootCauseLocalizationWithDT [](start = 20, length = 27)

Please also test serialization/deserialization. #Resolved

RootCauseLocalizationWithDT [](start = 20, length = 27)

Please also test serialization/deserialization.

Could you help explain more, I did't exactly understand the suggestion. Which part needs serialization/deserialization? #Resolved

Part of the DTRootCauseLocalizationTransformer class is methods that save the model to a file, and that load the model back from a file. This test should exercise these methods, by using the ml.Model.Save and ml.Model.Load APIs, and checking that the values returned by the deserialized model are the same as the ones returned by the original model.

In reply to: 399045379 [](ancestors = 399045379)

Part of the DTRootCauseLocalizationTransformer class is methods that save the model to a file, and that load the model back from a file. This test should exercise these methods, by using the ml.Model.Save and ml.Model.Load APIs, and checking that the values returned by the deserialized model are the same as the ones returned by the original model.

In reply to: 399045379 [](ancestors = 399045379)

Do I need to save the model, I find in the document, the save menthod is used for training related data? #Resolved

Sorry, I was unclear. You should save the model, not the data. Using this API:
https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Data/Model/ModelOperationsCatalog.cs#L112
You then need to load the model using this API:
https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Data/Model/ModelOperationsCatalog.cs#L211
and then run the loaded model on the data, and verify that it gives the same outputs as the model before serialization.

Here is a test that contains an example of how to save and load the model:
https://github.com/dotnet/machinelearning/blob/master/test/Microsoft.ML.Tests/Transformers/NormalizerTests.cs#L961

(the saving and loading is done in lines 969 and 970).

In reply to: 399795392 [](ancestors = 399795392)

Part of the DTRootCauseLocalizationTransformer class is methods that save the model to a file, and that load the model back from a file. This test should exercise these methods, by using the ml.Model.Save and ml.Model.Load APIs, and checking that the values returned by the deserialized model are the same as the ones returned by the original model.

In reply to: 399045379 [](ancestors = 399045379)

I have tried to use this logic to save and load model,

var modelPath = "temp.zip"; //Save model to a file ml.Model.Save(model, data.Schema, modelPath); //Load model from a file ITransformer serializedModel; using (var file = File.OpenRead(modelPath)) { serializedModel = ml.Model.Load(file, out var serializedSchema); TestCommon.CheckSameSchemas(data.Schema, serializedSchema); }

However, the Save() method will hit an exception because Save.IsColumnSavable(type) returns false. The root cause is we have a customized dataviewtype, which is neither VectorDataViewType nor PrimitiveDataViewType, the check failed so no columns is added for save. Do you have any guidance on this? #Resolved

Part of the DTRootCauseLocalizationTransformer class is methods that save the model to a file, and that load the model back from a file. This test should exercise these methods, by using the ml.Model.Save and ml.Model.Load APIs, and checking that the values returned by the deserialized model are the same as the ones returned by the original model.
In reply to: 399045379 [](ancestors = 399045379)

I have tried to use this logic to save and load model,

var modelPath = "temp.zip"; //Save model to a file ml.Model.Save(model, data.Schema, modelPath); //Load model from a file ITransformer serializedModel; using (var file = File.OpenRead(modelPath)) { serializedModel = ml.Model.Load(file, out var serializedSchema); TestCommon.CheckSameSchemas(data.Schema, serializedSchema); }

However, the Save() method will hit an exception because Save.IsColumnSavable(type) returns false. The root cause is we have a customized dataviewtype, which is neither VectorDataViewType nor PrimitiveDataViewType, the check failed so no columns is added for save. Do you have any guidance on this?

Has updated according to your suggestion in email discussion #Resolved

yaeldekel · 2020-03-24T10:37:53Z

src/Microsoft.ML.TimeSeries/DTRootCauseLocalization.cs

+                var src = default(RootCauseLocalizationInput);
+                var getSrc = input.GetGetter<RootCauseLocalizationInput>(input.Schema[ColMapNewToOld[iinfo]]);
+
+                disposer =


disposer [](start = 16, length = 8)

I think this is not necessary. This is used by components that use resources that are IDisposable, for example, ImageResizer that uses a Bitmap for its computations. #Resolved

disposer [](start = 16, length = 8)

I think this is not necessary. This is used by components that use resources that are IDisposable, for example, ImageResizer that uses a Bitmap for its computations.

Have changed, please help review whether the change is right #Resolved

You can also assign null instead.

In reply to: 399045482 [](ancestors = 399045482)

You can also assign null instead.

In reply to: 399045482 [](ancestors = 399045482)

Have updated to null #Resolved

yaeldekel · 2020-03-24T10:54:31Z

src/Microsoft.ML.TimeSeries/DTRootCauseLocalization.cs

+        /// <param name="columns">The name of the columns (first item of the tuple), and the name of the resulting output column (second item of the tuple).</param>
+        /// <param name="beta">The weight for generating score in output result.</param>
+        [BestFriend]
+        internal DTRootCauseLocalizationEstimator(IHostEnvironment env, double beta = Defaults.Beta, params (string outputColumnName, string inputColumnName)[] columns)


[] [](start = 157, length = 2)

Since this transformer is non-trainable (meaning, the estimator doesn't need to look at the data in order to create a transformer), there is no advantage to supporting multiple input/output columns here - if there is ever a scenario where there are multiple input columns, the user can create an estimator chain instead. #Resolved

[] [](start = 157, length = 2)

Since this transformer is non-trainable (meaning, the estimator doesn't need to look at the data in order to create a transformer), there is no advantage to supporting multiple input/output columns here - if there is ever a scenario where there are multiple input columns, the user can create an estimator chain instead.

I find that ImageLoading and TextNormalizing are also non-trainable, however they defines this columns parameter, could you help explain when to use it and when not? #Resolved

[] [](start = 157, length = 2)

Since this transformer is non-trainable (meaning, the estimator doesn't need to look at the data in order to create a transformer), there is no advantage to supporting multiple input/output columns here - if there is ever a scenario where there are multiple input columns, the user can create an estimator chain instead.

I find that ImageLoading and TextNormalizing are also non-trainable, however they defines this columns parameter, could you help explain when to use it and when not?

I have updated the columns to outputColumnand inputColumn, could you help check whether they are right? #Resolved

yaeldekel · 2020-03-24T10:56:59Z

src/Microsoft.ML.TimeSeries/DTRootCauseLocalization.cs

+            _beta = beta;
+        }
+
+        // Factory method for SignatureDataTransform.


SignatureDataTransform [](start = 30, length = 22)

This (and the matching attribute) is only needed if you want to use this component from the command line. If this is the intention, please add a test for running this using maml. #Resolved

SignatureDataTransform [](start = 30, length = 22)

This (and the matching attribute) is only needed if you want to use this component from the command line. If this is the intention, please add a test for running this using maml.

I have no idea about the maml, could you help provide more information? #Resolved

SignatureDataTransform [](start = 30, length = 22)

This (and the matching attribute) is only needed if you want to use this component from the command line. If this is the intention, please add a test for running this using maml.

I have no idea about the maml, could you help provide more information?

maml is not needed and I have removed this method #Resolved

yaeldekel · 2020-03-24T11:02:30Z

src/Microsoft.ML.TimeSeries/DTRootCauseLocalization.cs

+                if (!(col.ItemType is RootCauseLocalizationInputDataViewType) || col.Kind != SchemaShape.Column.VectorKind.Scalar)
+                    throw Host.ExceptSchemaMismatch(nameof(inputSchema), "input", colInfo.inputColumnName, new RootCauseLocalizationInputDataViewType().ToString(), col.GetTypeString());
+
+                result[colInfo.outputColumnName] = new SchemaShape.Column(colInfo.outputColumnName, col.Kind, col.ItemType, col.IsKey, col.Annotations);


col.Kind, col.ItemType, col.IsKey, col.Annotations [](start = 100, length = 50)

These should be the values for the output column. The Kind should be SchemaShape.Column.VectorKind.Scalar (this should match col.Kind since you are checking above that it is a scalar), but what should the ItemType be? Also, seems like col.IsKey is always false, and are there any annotations that need to be passed from the input to the output? #Resolved

col.Kind, col.ItemType, col.IsKey, col.Annotations [](start = 100, length = 50)

These should be the values for the output column. The Kind should be SchemaShape.Column.VectorKind.Scalar (this should match col.Kind since you are checking above that it is a scalar), but what should the ItemType be? Also, seems like col.IsKey is always false, and are there any annotations that need to be passed from the input to the output?

made some udpates, could you help review whether they are right? #Resolved

Why was this check removed - col.Kind != SchemaShape.Column.VectorKind.Scalar? Isn't it correct?

In reply to: 399045854 [](ancestors = 399045854)

Ha

Why was this check removed - col.Kind != SchemaShape.Column.VectorKind.Scalar? Isn't it correct?

In reply to: 399045854 [](ancestors = 399045854)

Have updated. #Resolved

ganik · 2020-03-24T23:13:31Z

src/Microsoft.ML.TimeSeries/ExtensionsCatalog.cs

@@ -146,6 +146,23 @@ public static class TimeSeriesCatalog
            int windowSize=64, int backAddWindowSize=5, int lookaheadWindowSize=5, int averageingWindowSize=3, int judgementWindowSize=21, double threshold=0.3)
            => new SrCnnAnomalyEstimator(CatalogUtils.GetEnvironment(catalog), outputColumnName, windowSize, backAddWindowSize, lookaheadWindowSize, averageingWindowSize, judgementWindowSize, threshold, inputColumnName);

+        /// <summary>
+        /// Create <see cref="DTRootCauseLocalizationEstimator"/>, which localizes root causess using decision tree algorithm.


causess [](start = 88, length = 7)

typo #Resolved

causess [](start = 88, length = 7)

typo

Thank you for pointing it out, has fixed it. #Resolved

harishsk · 2020-05-08T00:18:54Z

src/Microsoft.ML.TimeSeries/RootCauseAnalyzer.cs

+            return subDim;
+        }
+
+        protected List<RootCauseItem> LocalizeRootCauseByDimension(PointTree anomalyTree, PointTree pointTree, Dictionary<string, Object> anomalyDimension, List<string> aggDims)


suggestion: Please consider something along the lines of the following for the GetSubDim function.

return new Dictionary<string, object>(keyList.Select(dim => new KeyValuePair<string, object>(dim, dimension[dim])));

#Resolved

suggestion: Please consider something along the lines of the following for the GetSubDim function.

return new Dictionary<string, object>(keyList.Select(dim => new KeyValuePair<string, object>(dim, dimension[dim])));

updated #Resolved

harishsk · 2020-05-08T00:22:15Z

src/Microsoft.ML.TimeSeries/RootCauseAnalyzer.cs

+        }
+
+        private double Log2(double val)
+        {


Please consider using the following attribute for the function:

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)] ``` #Resolved

Please consider using the following attribute for the function:

[MethodImplAttribute(MethodImplOptions.AggressiveInlining)]

updated #Resolved

src/Microsoft.ML.TimeSeries/RootCauseAnalyzer.cs

harishsk · 2020-05-08T00:40:55Z

src/Microsoft.ML.TimeSeries/RootCauseAnalyzer.cs

+                    if (dimension.Key.AnomalyDis.Count > 1)
+                    {
+                        if (best == null || (!Double.IsNaN(valueRatioMap[best]) && (best.AnomalyDis.Count != 1 && (isLeavesLevel ? valueRatioMap[best].CompareTo(dimension.Value) <= 0 : valueRatioMap[best].CompareTo(dimension.Value) >= 0))))
+                        {


suggestion: For readability, can you please avoid the long line and break up the conditions on separate lines? Same comment for line 487. #Resolved

suggestion: For readability, can you please avoid the long line and break up the conditions on separate lines? Same comment for line 487.

made some updates #Resolved

src/Microsoft.ML.TimeSeries/RootCauseAnalyzer.cs

harishsk · 2020-05-08T16:28:28Z

src/Microsoft.ML.TimeSeries/ExtensionsCatalog.cs

+
+            //check beta
+            if (beta < 0 || beta > 1) {
+                host.CheckUserArg(beta >= 0 && beta <= 1, nameof(beta), "Must be in [0,1]");


You don't need this if check. The CheckUserArg is performing the check. #Resolved

You don't need this if check. The CheckUserArg is performing the check.

will update #Resolved

src/Microsoft.ML.TimeSeries/ExtensionsCatalog.cs

src/Microsoft.ML.TimeSeries/RootCauseAnalyzer.cs

harishsk

harishsk

gvashishtha · 2020-05-22T04:46:25Z

docs/api-reference/time-series-root-cause-localization.md

@@ -0,0 +1,49 @@
+At Mircosoft, we develop a decision tree based root cause localization method which helps to find out the root causes for an anomaly incident at a specific timestamp incrementally. 


Typo, "Microsoft." Also, it's a bit nonstandard to use present tense for "we develop." I would expect "we have developed" if the work is completed or "we maintain" if the work is ongoing.

gvashishtha · 2020-05-22T04:47:12Z

docs/api-reference/time-series-root-cause-localization.md

+At Mircosoft, we develop a decision tree based root cause localization method which helps to find out the root causes for an anomaly incident at a specific timestamp incrementally. 
+
+## Multi-Dimensional Root Cause Localization
+It's a common case that one measure is collected with many dimensions (*e.g.*, Province, ISP) whose values are categorical(*e.g.*, Beijing or Shanghai for dimension Province). When a measure's value deviates from its expected value, this measure encounters anomalies. In such case, operators would like to localize the root cause dimension combinations rapidly and accurately. Multi-dimensional root cause localization is critical to troubleshoot and mitigate such case.


Let's use "users" instead of "operators."

gvashishtha · 2020-05-22T04:48:38Z

docs/api-reference/time-series-root-cause-localization.md

+
+### Decision Tree
+
+[Decision tree](https://en.wikipedia.org/wiki/Decision_tree) algorithm chooses the highest information gain to split or construct a decision tree.  We use it to choose the dimension which contributes the most to the anomaly. Following are some concepts used in decision tree.


It's non-standard to omit articles here. Try something like "The Decision Tree algorithm chooses..." and "Below are some concepts used in decision trees"

gvashishtha · 2020-05-22T04:50:21Z

docs/api-reference/time-series-root-cause-localization.md

+
+Where $Ent(D^v)$ is the entropy of set points in D for which dimension $a$ is equal to $v$, $|D|$ is the total number of points in dataset $D$.  $|D^V|$ is the total number of points in dataset $D$ for which dimension $a$ is equal to $v$.
+
+For all aggregated dimensions, we calculate the information for each dimension. The greater the reduction in this uncertainty, the more information is gained about D from dimension $a$.


D should be in dollar signs?

gvashishtha · 2020-05-22T04:54:38Z

docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/TimeSeries/LocalizeRootCause.cs

+{
+    public static class LocalizeRootCause
+    {
+        private static string AGG_SYMBOL = "##SUM##";


What is AGG_SYMBOL for here? I notice that on line 19, you have both AGG_SYMBOL and AggregateType.Sum, and that some of the points have AGG_SYMBOL passed in instead of strings like "DC1."

Can you add a few comments explaining what AGG_SYMBOL is and why it is used?

gvashishtha · 2020-05-22T04:55:50Z

src/Microsoft.ML.TimeSeries/ExtensionsCatalog.cs

@@ -143,9 +147,53 @@ public static class TimeSeriesCatalog
        /// </format>
        /// </example>
        public static SrCnnAnomalyEstimator DetectAnomalyBySrCnn(this TransformsCatalog catalog, string outputColumnName, string inputColumnName,
-            int windowSize=64, int backAddWindowSize=5, int lookaheadWindowSize=5, int averageingWindowSize=3, int judgementWindowSize=21, double threshold=0.3)
+            int windowSize = 64, int backAddWindowSize = 5, int lookaheadWindowSize = 5, int averageingWindowSize = 3, int judgementWindowSize = 21, double threshold = 0.3)


I believe it's spelled "averaging" (without an "e"). If this is an easy change to make, would be a good way to keep our code base looking high quality.

gvashishtha · 2020-05-22T04:57:34Z

src/Microsoft.ML.TimeSeries/ExtensionsCatalog.cs

+        /// </summary>
+        /// <param name="catalog">The anomaly detection catalog.</param>
+        /// <param name="src">Root cause's input. The data is an instance of <see cref="Microsoft.ML.TimeSeries.RootCauseLocalizationInput"/>.</param>
+        /// <param name="beta">Beta is a weight parameter for user to choose. It is used when score is calculated for each root cause item. The range of beta should be in [0,1]. For a larger beta, root cause point which has a large difference between value and expected value will get a high score. On the contrary, for a small beta, root cause items which has a high relative change will get a high score.</param>


You say "on the contrary," but the two scenarios you describe don't seem to be opposites. One is about "relative change" and one is about difference between expected value and actual value. Can you make this explanation clearer?

Additionally, you mention "score," but it's not clear what score is, exactly.

add root cause localization transformer

d5ee205

suxi-ms requested a review from a team as a code owner March 10, 2020 01:30

mstfbl suggested changes Mar 10, 2020

View reviewed changes

suxi-ms added 15 commits March 16, 2020 22:26

add test cases

f727a79

revert sln changes

92de1dc

add evaluation

798289c

temp save for internal review

f2e128d

rename function

51569e3

temp save bottom up points for switch desktop

59c6e89

update from laptop

29216e0

save for add test

69da330

add root cause localization algorithm

e1c5432

add root cause localization algorithm

3a1d1c5

print score, path and directions in sample

8f97602

merge with master

48123f4

extract root cause analyzer

c47302f

refine code

b07ad28

merge with master

c729877

yaeldekel reviewed Mar 24, 2020

View reviewed changes

ganik reviewed Mar 24, 2020

View reviewed changes

suxi-ms added 3 commits March 26, 2020 16:48

update for algorithm

ebbdb0d

add evaluatin

0d43b0d

some refine for code

5778eed