Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark logistic regression (for comparison) #20

Open
szilard opened this issue May 10, 2019 · 3 comments
Open

Spark logistic regression (for comparison) #20

szilard opened this issue May 10, 2019 · 3 comments

Comments

@szilard
Copy link
Owner

szilard commented May 10, 2019

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

val d_train = spark.read.parquet("spark_ohe-train.parquet").cache()
val d_test = spark.read.parquet("spark_ohe-test.parquet").cache()
(d_train.count(), d_test.count())

val lr = new LogisticRegression()
val pipeline = new Pipeline().setStages(Array(lr))

val now = System.nanoTime
val model = pipeline.fit(d_train)
val elapsed = ( System.nanoTime - now )/1e9
elapsed

val predictions = model.transform(d_test)

val evaluator = new BinaryClassificationEvaluator().setLabelCol("label").setRawPredictionCol("probability").setMetricName("areaUnderROC")
evaluator.evaluate(predictions)
@szilard
Copy link
Owner Author

szilard commented May 10, 2019

r4.8xlarge

10M:
data RAM 10G
20sec
0.7093
total RAM 22G

Screen Shot 2019-05-10 at 7 36 22 AM

100M:
data RAM 60G
155sec
0.7093
total RAM 110G

@szilard
Copy link
Owner Author

szilard commented May 10, 2019

compare to h2o:

library(h2o)
h2o.init()

dx_train <- h2o.importFile("train-10m.csv")
dx_test <- h2o.importFile("test.csv")

Xnames <- names(dx_train)[which(names(dx_train)!="dep_delayed_15min")]

system.time({
  md <- h2o.glm(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, family = "binomial")
})

h2o.auc(h2o.performance(md, dx_test))

10M:
data RAM 4G
6sec
0.7081992
total RAM 6G

100M:

dx_train0 <- h2o.importFile("train-10m.csv")
dx_train <- h2o.rbind(dx_train0, dx_train0, dx_train0, dx_train0, dx_train0, dx_train0, dx_train0, dx_train0, dx_train0, dx_train0)

data RAM 6G
36 sec
0.7081992
total RAM 11G

@szilard
Copy link
Owner Author

szilard commented May 10, 2019

  10M   100M  
  Spark h2o Spark h2o
time [s] 20 6 155 36
AUC 0.709 0.708 0.709 0.708
data RAM [GB] 10 4 60 6
data+train RAM [GB] 22 6 110 11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant