Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support mixed model for RandomForestClassificationModel #200

Merged
merged 3 commits into from
Apr 12, 2023

Conversation

wbo4958
Copy link
Collaborator

@wbo4958 wbo4958 commented Apr 4, 2023

This PR implemented cpu() to convert Cuml RandomForestClassificationModel to Spark RandomForestClassificationModel, and a simple test to predict a single instance can pass. Still, this PR needs to be refactored.

For simplicity, This PR just returned the model JSON string (for CPU inference) alongside the model bytes (for GPU inference), but it increases the bandwidth when collecting from executor to driver which may cause perf issue. We need to find a way to combine them.

There's some difference between Cuml RandomForestClassificationModel and Spark RandomForestClassificationModel.

  1. Cuml RandomForestClassificationModel doesn't include GiniStat in the InternalNode
  2. Cuml RandomForestClassificationModel uses probability as the leaf value, while Spark RandomForestClassificationModel uses stats (how many instances each label contains), but we can convert the probs to the stats.
  3. Cuml RandomForestClassificationModel doesn't have impurity information, but we can calculate it according to the leaf value.

Right now, this PR only supports Gini impurity, I will file a followup to support entropy and variance.

@leewyang
Copy link
Collaborator

leewyang commented Apr 4, 2023

For simplicity, This PR just returned the model JSON string (for CPU inference) alongside the model bytes (for GPU inference), but it increases the bandwidth when collecting from executor to driver which may cause perf issue. We need to find a way to combine them.

Wonder if you can avoid serializing the model JSON string by just using treelite.dump_as_json() on the driver side, after collecting the trees from the executors. Not sure if that JSON format is compatible/sufficient, but it would avoid serializing the model essentially twice (in treelite and JSON formats).

from pyspark.ml.common import _py2java

example = _py2java(spark.sparkContext, Vectors.dense(1.0))
print(z.predict(example))
Copy link
Collaborator

@lijinf2 lijinf2 Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need assert statements to ensure the predicted results are the same as pyspark predictions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. The predictions between cuML and pyspark may be different. It's hard to compare the prediction results.

This PR only supports converting trees when impurity is gini.

Signed-off-by: Bobby Wang <wbo4958@gmail.com>
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Apr 11, 2023

build

@wbo4958 wbo4958 marked this pull request as ready for review April 11, 2023 07:58
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Apr 11, 2023

For simplicity, This PR just returned the model JSON string (for CPU inference) alongside the model bytes (for GPU inference), but it increases the bandwidth when collecting from executor to driver which may cause perf issue. We need to find a way to combine them.

Wonder if you can avoid serializing the model JSON string by just using treelite.dump_as_json() on the driver side, after collecting the trees from the executors. Not sure if that JSON format is compatible/sufficient, but it would avoid serializing the model essentially twice (in treelite and JSON formats).

Thx @leewyang, it seems that cuML model is not compatible with treelite model.

treelite_json = cu_rf.convert_to_treelite_model().dump_as_json()
AttributeError: 'cuml.fil.fil.TreeliteModel' object has no attribute 'dump_as_json'

@wbo4958 wbo4958 requested review from leewyang and eordentlich April 11, 2023 08:03
@leewyang
Copy link
Collaborator

leewyang commented Apr 11, 2023

Wonder if this might be useful: rapidsai/cuml#3853
Sounds like it'd require an extra serialize/deserialize to disk to get it to the actual treelite model class, but presumably this would only need to be done on the driver.

Otherwise, LGTM. (Also, pls remove "[DRAFT]" if ready for review).

"""
return self.cpu().predictProbability(value)

def evaluate(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider a gpu version of this where we have all the supporting metrics needed to construct the summary also implemented on gpu.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, Let me file a task for it.

@wbo4958 wbo4958 changed the title [Draft] Support mixed model for RandomForestClassificationModel Support mixed model for RandomForestClassificationModel Apr 11, 2023
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Apr 11, 2023

Wonder if this might be useful: rapidsai/cuml#3853 Sounds like it'd require an extra serialize/deserialize to disk to get it to the actual treelite model class, but presumably this would only need to be done on the driver.

Otherwise, LGTM. (Also, pls remove "[DRAFT]" if ready for review).

Really good, Seems we can just return the json strings of treelite in the executor side according to the example of that PR. Great. Thx, I will file a follow up for it.

leewyang
leewyang previously approved these changes Apr 11, 2023
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Apr 12, 2023

build

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Apr 12, 2023

build

@wbo4958 wbo4958 merged commit 34ce8bc into NVIDIA:branch-23.04 Apr 12, 2023
@wbo4958 wbo4958 deleted the rf-mixed-model branch April 12, 2023 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants