feat: add streaming API for MVAD #1893

serena-ruan · 2023-03-27T14:44:00Z

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Briefly describe the changes included in this Pull Request.

How is this patch tested?

I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

Does this PR change any dependencies?

No. You can skip this section.
Yes. Make sure the dependencies are resolved correctly, and list changes here.

Does this PR add a new feature? If so, have you added samples on website?

No. You can skip this section.
Yes. Make sure you have added samples following below steps.

Find the corresponding markdown file for your new feature in website/docs/documentation folder.
Make sure you choose the correct class estimators/transformers and namespace.
Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
Make sure the DocTable points to correct API link.
Navigate to website folder, and run yarn run start to make sure the website renders correctly.
Don't forget to add  before each python code blocks to enable auto-tests for python samples.
Make sure the WebsiteSamplesTests job pass in the pipeline.

github-actions · 2023-03-27T14:44:17Z

Hey @serena-ruan 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

fix: Fix LightGBM crashes with empty partitions
feat: Make HTTP on Spark back-offs configurable
docs: Update Spark Serving usage
build: Add codecov support
perf: improve LightGBM memory usage
refactor: make python code generation rely on classes
style: Remove nulls from CNTKModel
test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

serena-ruan · 2023-03-28T04:34:18Z

/azp run

azure-pipelines · 2023-03-28T04:34:28Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov-commenter · 2023-03-28T04:50:10Z

Codecov Report

Merging #1893 (685f0e6) into master (87d5bc5) will increase coverage by 0.00%.
The diff coverage is 93.05%.

@@           Coverage Diff           @@
##           master    #1893   +/-   ##
=======================================
  Coverage   85.83%   85.83%           
=======================================
  Files         301      301           
  Lines       15622    15677   +55     
  Branches      813      815    +2     
=======================================
+ Hits        13409    13457   +48     
- Misses       2213     2220    +7

Impacted Files	Coverage Δ
...gnitive/anomaly/MultivariateAnomalyDetection.scala	`90.34% <92.85%> (+1.16%)`	⬆️
...e/anomaly/MultivariateAnomalyDetectorSchemas.scala	`59.25% <100.00%> (+3.25%)`	⬆️

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

serena-ruan · 2023-04-11T06:25:33Z

/azp run

azure-pipelines · 2023-04-11T06:25:43Z

Azure Pipelines successfully started running 1 pipeline(s).

serena-ruan · 2023-04-11T07:43:27Z

/azp run

azure-pipelines · 2023-04-11T07:43:37Z

Azure Pipelines successfully started running 1 pipeline(s).

mhamilton723 · 2023-04-12T11:47:02Z

...in/scala/com/microsoft/azure/synapse/ml/cognitive/anomaly/MultivariateAnomalyDetection.scala

+      MADUtils.checkModelStatus(getUrl, getModelId, getSubscriptionKey)
+
+      val convertTimeFormatUdf = UDFUtils.oldUdf(
+        { value: String => convertTimeFormat("Timestamp column", value) },


What is the purpose of "Timestamp column" here doesent look like its necessary

mhamilton723 · 2023-04-12T13:25:34Z

...in/scala/com/microsoft/azure/synapse/ml/cognitive/anomaly/MultivariateAnomalyDetection.scala

+      val window = Window.partitionBy("group").rowsBetween(-getBatchSize, 0)
+      var collectedDF = formattedDF
+      var columnNames = Array(getTimestampCol) ++ getInputVariablesCols
+      for (columnName <- columnNames) {
+        collectedDF = collectedDF.withColumn(s"${columnName}_list", collect_list(columnName).over(window))
+      }
+      collectedDF = collectedDF.drop("group")
+      columnNames = columnNames.map(name => s"${name}_list")
+
+      val testDF = getInternalTransformer(collectedDF.schema).transform(collectedDF)
+
+      testDF
+        .withColumn("isAnomaly", when(col(getOutputCol).isNotNull,
+          col(s"$getOutputCol.results.value.isAnomaly")(0)).otherwise(null))
+        .withColumn("DetectDataTimestamp", when(col(getOutputCol).isNotNull,
+          col(s"$getOutputCol.results.timestamp")(0)).otherwise(null))
+        .drop(columnNames: _*)
+


Looks like alot of the column names here are hard-coded. Will there be any instances where this doesent work with a given input df or setting of the params?

The hard-coded part "{name}_list" is because I use collect_list to aggregate those values in different rows to a single row with a list of values. Only the suffix '_list' is hard-coded, and finally those columns will be dropped and values will be mapped back to the original dataframe.

mhamilton723 · 2023-04-12T13:25:52Z

...in/scala/com/microsoft/azure/synapse/ml/cognitive/anomaly/MultivariateAnomalyDetection.scala

+  override protected def getInternalTransformer(schema: StructType): PipelineModel = {
+    val dynamicParamColName = DatasetExtensions.findUnusedColumnName("dynamic", schema)
+    val lambda = Lambda(_.withColumn(dynamicParamColName, struct(
+      s"${getTimestampCol}_list", getInputVariablesCols.map(name => s"${name}_list"): _*)))


likewise here

mhamilton723 · 2023-04-12T13:26:45Z

...ala/com/microsoft/azure/synapse/ml/cognitive/anomaly/MultivariateAnamolyDetectionSuite.scala

+  lazy val df: DataFrame = spark.read.format("csv")
+    .option("header", "true").schema(fileSchema).load(fileLocation)


Does this not infer proper schema here?

Yes... I don't know why but it seems fails to infer the schema :(

mhamilton723 · 2023-04-12T13:27:55Z

...ala/com/microsoft/azure/synapse/ml/cognitive/anomaly/MultivariateAnamolyDetectionSuite.scala

+  val hc: Configuration = spark.sparkContext.hadoopConfiguration
+  hc.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
+  hc.set(s"fs.azure.account.keyprovider.$storageAccount.blob.core.windows.net",
+    "org.apache.hadoop.fs.azure.SimpleKeyProvider")
+  hc.set(s"fs.azure.account.key.$storageAccount.blob.core.windows.net", storageKey)


Make this lazy then print it in test you need to trigger it

mhamilton723 · 2023-04-12T13:29:21Z

...ala/com/microsoft/azure/synapse/ml/cognitive/anomaly/MultivariateAnamolyDetectionSuite.scala

+  val sfma: SimpleFitMultivariateAnomaly = new SimpleFitMultivariateAnomaly()
+    .setSubscriptionKey(anomalyKey)
+    .setLocation(anomalyLocation)
+    .setOutputCol("result")
+    .setStartTime(startTime)
+    .setEndTime(endTime)
+    .setIntermediateSaveDir(intermediateSaveDir)
+    .setTimestampCol(timestampColumn)
+    .setInputCols(inputColumns)
+    .setSlidingWindow(50)
+
+  val model: SimpleDetectMultivariateAnomaly = sfma.fit(df)
+  val modelId: String = model.getModelId
+
+  MADUtils.CreatedModels += modelId


make this stuff lazy, add the

lazy val modelId = { val model = sfma.fit(df) MADUtils.CreatedModels += model.getModelId model.getModelId }

mhamilton723 · 2023-04-12T13:31:46Z

...ala/com/microsoft/azure/synapse/ml/cognitive/anomaly/MultivariateAnamolyDetectionSuite.scala

+  test("Error if batch size is smaller than sliding window") {
+    val result = dlma.setBatchSize(10).transform(df.limit(50))
+    result.show(50, truncate = false)
+    assert(result.collect().head.getAs[StringType](dlma.getErrorCol).toString.contains("NotEnoughData"))
+  }


Can you provide more details on why this is a failure? Im a little confused about whats going on here

Because the number of data points we use to infer should be larger than sliding window, say your sliding window is 30, then you could only use the first 30 data points to infer the 31st data point. In this test case, batchSize is 10, but the model sfma has sliding window 50 (.setSlidingWindow(50)), so it couldn't infer any datapoints using 10 data points.

mhamilton723

Lovely work! A few open questions for you

serena-ruan · 2023-04-18T06:09:31Z

/azp run

azure-pipelines · 2023-04-18T06:09:41Z

Azure Pipelines successfully started running 1 pipeline(s).

feat: add streaming API for MVAD

b9c72c4

serena-ruan requested a review from mhamilton723 as a code owner March 27, 2023 14:44

Merge branch 'master' into serena/mvad

42ad7e3

serena-ruan and others added 5 commits March 28, 2023 14:55

fix test

7bb6245

Merge branch 'master' into serena/mvad

d4f3cd8

Merge branch 'master' into serena/mvad

f850905

convert DetectLastAnomaly to use sliding window

8cc205f

Merge branch 'master' into serena/mvad

9529be5

fix param type

3b38b63

mhamilton723 reviewed Apr 12, 2023

View reviewed changes

mhamilton723 requested changes Apr 12, 2023

View reviewed changes

serena-ruan and others added 3 commits April 17, 2023 14:55

address comments

23aa74a

Merge branch 'master' into serena/mvad

ea495ff

Merge branch 'master' into serena/mvad

685f0e6

Merge branch 'master' into serena/mvad

5244159

mhamilton723 approved these changes Apr 21, 2023

View reviewed changes

mhamilton723 merged commit 0d0d10c into microsoft:master Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add streaming API for MVAD #1893

feat: add streaming API for MVAD #1893

serena-ruan commented Mar 27, 2023

github-actions bot commented Mar 27, 2023

serena-ruan commented Mar 28, 2023

azure-pipelines bot commented Mar 28, 2023

codecov-commenter commented Mar 28, 2023 •

edited

Loading

serena-ruan commented Apr 11, 2023

azure-pipelines bot commented Apr 11, 2023

serena-ruan commented Apr 11, 2023

azure-pipelines bot commented Apr 11, 2023

mhamilton723 Apr 12, 2023

mhamilton723 Apr 12, 2023

serena-ruan Apr 17, 2023

mhamilton723 Apr 12, 2023

mhamilton723 Apr 12, 2023

serena-ruan Apr 17, 2023

mhamilton723 Apr 12, 2023

mhamilton723 Apr 12, 2023 •

edited

Loading

mhamilton723 Apr 12, 2023

serena-ruan Apr 17, 2023

mhamilton723 left a comment

serena-ruan commented Apr 18, 2023

azure-pipelines bot commented Apr 18, 2023

		lazy val df: DataFrame = spark.read.format("csv")
		.option("header", "true").schema(fileSchema).load(fileLocation)

feat: add streaming API for MVAD #1893

feat: add streaming API for MVAD #1893

Conversation

serena-ruan commented Mar 27, 2023

Related Issues/PRs

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change any dependencies?

Does this PR add a new feature? If so, have you added samples on website?

github-actions bot commented Mar 27, 2023

serena-ruan commented Mar 28, 2023

azure-pipelines bot commented Mar 28, 2023

codecov-commenter commented Mar 28, 2023 • edited Loading

Codecov Report

serena-ruan commented Apr 11, 2023

azure-pipelines bot commented Apr 11, 2023

serena-ruan commented Apr 11, 2023

azure-pipelines bot commented Apr 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhamilton723 Apr 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhamilton723 left a comment

Choose a reason for hiding this comment

serena-ruan commented Apr 18, 2023

azure-pipelines bot commented Apr 18, 2023

codecov-commenter commented Mar 28, 2023 •

edited

Loading

mhamilton723 Apr 12, 2023 •

edited

Loading