Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run saveNativeModel for VWRegressionModel #1364

Closed
arka-nitd opened this issue Jan 25, 2022 · 5 comments · Fixed by #1366
Closed

Unable to run saveNativeModel for VWRegressionModel #1364

arka-nitd opened this issue Jan 25, 2022 · 5 comments · Fixed by #1366

Comments

@arka-nitd
Copy link

arka-nitd commented Jan 25, 2022

Environment
Databricks 10.1ML Runtime.

To Reproduce
I am trying to run the Linear Regression example provided

triazines = spark.read.format("libsvm")\
    .load("wasbs://publicwasb@mmlspark.blob.core.windows.net/triazines.scale.svmlight")
train, test = triazines.randomSplit([0.85, 0.15], seed=1)
from synapse.ml.vw import VowpalWabbitRegressor
model = (VowpalWabbitRegressor(numPasses=20, args="--holdout_off --loss_function quantile -q :: -l 0.1")
            .fit(train))

Now when trying to save the model using .saveNativeModel

model.saveNativeModel("dbfs:/mnt/analysis/arka/m4/testmodel")

Getting following error

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2828)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2775)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2769)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2769)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1305)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1305)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1305)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3036)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2977)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2965)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1067)
	at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2477)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2460)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:274)
	... 36 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:396)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:284)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:91)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1620)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
Caused by: java.lang.NullPointerException
	at com.microsoft.azure.synapse.ml.io.binary.BinaryOutputWriter.write(BinaryFileFormat.scala:231)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:143)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$2(FileFormatWriter.scala:375)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1654)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:383)
	... 19 more

When trying with .save it is working but the model saved is in some binary which is not readable.
directory structure

/analysis/arka/m4/model
-> testmodel
- - metadata
- - complexParams
- - - model
- - - performanceStatistics

End Goal
To be able to save the model in some readable format preserving the feature name along with the co-efficients.

I was referring to this PR #821 but the code to generate readable model with the feature name was removed in a commit. Not sure why ?
Also using model.getReadableModel() just shows the indexes and the co-efficients. If at all using VWFeaturizer the feature names can be derived, how can it be done ? I was unable to find any examples.

Tried with following environment and versions :
10.1ML + 0.9.4 & 0.9.5
9.1ML LTS + 0.9.4 & 0.9.5

So my questions are

  1. How to generate a readable version of the VWRegressionModel and dump it in Azure Blob (Location is mounted in dbfs).
  2. How to read back the generated model as a VMRegressionModel and do the predictions later on for evaluation or share?
  3. When using model.printReadableModel(), the output has a huge number of hashes and their weights. Assuming 1 hash is created for each feature how can this be possible ?

Thanks,
Arka

@arka-nitd
Copy link
Author

@memoryz Thanks for the PR. Do you also have an example on how to dump the model in readable state (feature name and their co-efficients in readable format) and deserialise it back to VWRegressionModel for prediction in future ?

@memoryz
Copy link
Contributor

memoryz commented Jan 28, 2022

@arka-nitd, the fix will allow you to save the native model as a binary file on your storage. To dump the model in readable state, does print(model.getReadableModel()) help?
For serialization and deserialization, can you just use the standard Spark ML pipeline interface?
Serialization:

model = ...
model.save(path)

Deserialization:

from synapse.ml.vw import VowpalWabbitRegressionModel
model = VowpalWabbitRegressionModel.load(path)

@arka-nitd
Copy link
Author

@memoryz I tried print(model.getReadableModel()) but this return the feature hash and their co-efficient. What i need is the feature names and their co-efficient, similar to what --invert-hash parameter would output.
What could be a way to achieve this ? An example could really help

@memoryz
Copy link
Contributor

memoryz commented Feb 3, 2022

Sorry I'm not familiar with the internals of VW. Can you post a new issue for your questions? Maybe @eisber can help answer.

@eisber
Copy link
Collaborator

eisber commented Feb 3, 2022

@arka-nitd can you share a small repro of you current feature/training pipeline. Unfortunately, when using VWFeaturizer the reverse mapping is lost. Since this request comes in repeated, this might be a good new feature. @jackgerrits and I brainstormed a bit, but it's not straight forward. The VWfeaturizer already hashes the data, but without namespaces (at least from what I remember). So there's the additional complexity on namespaces. Additionally it depends on the learning algo how the features are mapped to weights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants