Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add streaming API for MVAD #1893

Merged
merged 12 commits into from
Apr 21, 2023
Merged

Conversation

serena-ruan
Copy link
Contributor

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Briefly describe the changes included in this Pull Request.

How is this patch tested?

  • I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

Does this PR change any dependencies?

  • No. You can skip this section.
  • Yes. Make sure the dependencies are resolved correctly, and list changes here.

Does this PR add a new feature? If so, have you added samples on website?

  • No. You can skip this section.
  • Yes. Make sure you have added samples following below steps.
  1. Find the corresponding markdown file for your new feature in website/docs/documentation folder.
    Make sure you choose the correct class estimators/transformers and namespace.
  2. Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
  3. Make sure the DocTable points to correct API link.
  4. Navigate to website folder, and run yarn run start to make sure the website renders correctly.
  5. Don't forget to add <!--pytest-codeblocks:cont--> before each python code blocks to enable auto-tests for python samples.
  6. Make sure the WebsiteSamplesTests job pass in the pipeline.

@github-actions
Copy link

Hey @serena-ruan 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

  • fix: Fix LightGBM crashes with empty partitions
  • feat: Make HTTP on Spark back-offs configurable
  • docs: Update Spark Serving usage
  • build: Add codecov support
  • perf: improve LightGBM memory usage
  • refactor: make python code generation rely on classes
  • style: Remove nulls from CNTKModel
  • test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov-commenter
Copy link

codecov-commenter commented Mar 28, 2023

Codecov Report

Merging #1893 (685f0e6) into master (87d5bc5) will increase coverage by 0.00%.
The diff coverage is 93.05%.

@@           Coverage Diff           @@
##           master    #1893   +/-   ##
=======================================
  Coverage   85.83%   85.83%           
=======================================
  Files         301      301           
  Lines       15622    15677   +55     
  Branches      813      815    +2     
=======================================
+ Hits        13409    13457   +48     
- Misses       2213     2220    +7     
Impacted Files Coverage Δ
...gnitive/anomaly/MultivariateAnomalyDetection.scala 90.34% <92.85%> (+1.16%) ⬆️
...e/anomaly/MultivariateAnomalyDetectorSchemas.scala 59.25% <100.00%> (+3.25%) ⬆️

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

MADUtils.checkModelStatus(getUrl, getModelId, getSubscriptionKey)

val convertTimeFormatUdf = UDFUtils.oldUdf(
{ value: String => convertTimeFormat("Timestamp column", value) },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of "Timestamp column" here doesent look like its necessary

Comment on lines +705 to +722
val window = Window.partitionBy("group").rowsBetween(-getBatchSize, 0)
var collectedDF = formattedDF
var columnNames = Array(getTimestampCol) ++ getInputVariablesCols
for (columnName <- columnNames) {
collectedDF = collectedDF.withColumn(s"${columnName}_list", collect_list(columnName).over(window))
}
collectedDF = collectedDF.drop("group")
columnNames = columnNames.map(name => s"${name}_list")

val testDF = getInternalTransformer(collectedDF.schema).transform(collectedDF)

testDF
.withColumn("isAnomaly", when(col(getOutputCol).isNotNull,
col(s"$getOutputCol.results.value.isAnomaly")(0)).otherwise(null))
.withColumn("DetectDataTimestamp", when(col(getOutputCol).isNotNull,
col(s"$getOutputCol.results.timestamp")(0)).otherwise(null))
.drop(columnNames: _*)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like alot of the column names here are hard-coded. Will there be any instances where this doesent work with a given input df or setting of the params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hard-coded part "{name}_list" is because I use collect_list to aggregate those values in different rows to a single row with a list of values. Only the suffix '_list' is hard-coded, and finally those columns will be dropped and values will be mapped back to the original dataframe.

override protected def getInternalTransformer(schema: StructType): PipelineModel = {
val dynamicParamColName = DatasetExtensions.findUnusedColumnName("dynamic", schema)
val lambda = Lambda(_.withColumn(dynamicParamColName, struct(
s"${getTimestampCol}_list", getInputVariablesCols.map(name => s"${name}_list"): _*)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here

Comment on lines +60 to +61
lazy val df: DataFrame = spark.read.format("csv")
.option("header", "true").schema(fileSchema).load(fileLocation)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not infer proper schema here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... I don't know why but it seems fails to infer the schema :(

Comment on lines 277 to 281
val hc: Configuration = spark.sparkContext.hadoopConfiguration
hc.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
hc.set(s"fs.azure.account.keyprovider.$storageAccount.blob.core.windows.net",
"org.apache.hadoop.fs.azure.SimpleKeyProvider")
hc.set(s"fs.azure.account.key.$storageAccount.blob.core.windows.net", storageKey)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this lazy then print it in test you need to trigger it

Comment on lines 283 to 297
val sfma: SimpleFitMultivariateAnomaly = new SimpleFitMultivariateAnomaly()
.setSubscriptionKey(anomalyKey)
.setLocation(anomalyLocation)
.setOutputCol("result")
.setStartTime(startTime)
.setEndTime(endTime)
.setIntermediateSaveDir(intermediateSaveDir)
.setTimestampCol(timestampColumn)
.setInputCols(inputColumns)
.setSlidingWindow(50)

val model: SimpleDetectMultivariateAnomaly = sfma.fit(df)
val modelId: String = model.getModelId

MADUtils.CreatedModels += modelId
Copy link
Collaborator

@mhamilton723 mhamilton723 Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this stuff lazy, add the

lazy val modelId = {
  val model = sfma.fit(df)
  MADUtils.CreatedModels += model.getModelId
  model.getModelId
}

Comment on lines +316 to +320
test("Error if batch size is smaller than sliding window") {
val result = dlma.setBatchSize(10).transform(df.limit(50))
result.show(50, truncate = false)
assert(result.collect().head.getAs[StringType](dlma.getErrorCol).toString.contains("NotEnoughData"))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more details on why this is a failure? Im a little confused about whats going on here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the number of data points we use to infer should be larger than sliding window, say your sliding window is 30, then you could only use the first 30 data points to infer the 31st data point. In this test case, batchSize is 10, but the model sfma has sliding window 50 (.setSlidingWindow(50)), so it couldn't infer any datapoints using 10 data points.

Copy link
Collaborator

@mhamilton723 mhamilton723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely work! A few open questions for you

@serena-ruan
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723 mhamilton723 merged commit 0d0d10c into microsoft:master Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants