Skip to content

Commit

Permalink
remove parameter K and set degeult K on pyspark
Browse files Browse the repository at this point in the history
Signed-off-by: minmingzhu <minming.zhu@intel.com>
  • Loading branch information
minmingzhu committed Jul 10, 2023
1 parent 2f1a63f commit 54872c3
Show file tree
Hide file tree
Showing 4 changed files with 6 additions and 9 deletions.
9 changes: 4 additions & 5 deletions examples/python/pca-pyspark/pca-pyspark.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,21 +27,20 @@
.appName("PCAExample")\
.getOrCreate()

if (len(sys.argv) != 3) :
print("bin/spark-submit pca-pyspark.py <data_set.csv> <param_K>")
if (len(sys.argv) != 2) :
print("bin/spark-submit pca-pyspark.py <data_set.csv>")
sys.exit(1)

input = spark.read.load(sys.argv[1], format="csv", inferSchema="true", header="false")
K = int(sys.argv[2])


assembler = VectorAssembler(
inputCols=input.columns,
outputCol="features")

dataset = assembler.transform(input)
dataset.show()

pca = PCA(k=K, inputCol="features", outputCol="pcaFeatures")
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(dataset)

print("Principal Components: ", model.pc, sep='\n')
Expand Down
3 changes: 1 addition & 2 deletions examples/python/pca-pyspark/run-cpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ DATA_FILE=$HDFS_ROOT/data/pca_data.csv

DEVICE=CPU
APP_PY=pca-pyspark.py
K=3

time $SPARK_HOME/bin/spark-submit --master $SPARK_MASTER \
--num-executors $SPARK_NUM_EXECUTORS \
Expand All @@ -27,5 +26,5 @@ time $SPARK_HOME/bin/spark-submit --master $SPARK_MASTER \
--conf "spark.network.timeout=1200s" \
--conf "spark.task.maxFailures=1" \
--jars $OAP_MLLIB_JAR \
$APP_PY $DATA_FILE $K \
$APP_PY $DATA_FILE \
2>&1 | tee PCA-$(date +%m%d_%H_%M_%S).log
1 change: 0 additions & 1 deletion examples/python/pca-pyspark/run-gpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ EXECUTOR_GPU_AMOUNT=1
TASK_GPU_AMOUNT=1
APP_PY=pca-pyspark.py


# Should run in standalone mode
time $SPARK_HOME/bin/spark-submit --master $SPARK_MASTER \
--num-executors $SPARK_NUM_EXECUTORS \
Expand Down
2 changes: 1 addition & 1 deletion examples/scala/pca-scala/run-cpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,5 @@ time $SPARK_HOME/bin/spark-submit --master $SPARK_MASTER \
--conf "spark.task.maxFailures=1" \
--jars $OAP_MLLIB_JAR \
--class $APP_CLASS \
$APP_JAR $DATA_FILE $K \
$APP_JAR \
2>&1 | tee PCA-$(date +%m%d_%H_%M_%S).log

0 comments on commit 54872c3

Please sign in to comment.