TTp4

Proyecto 4 de Topicos Especiales de Telematica Pablo Cano

Spark text clustering with kmeans

Dependencies

This proyect requires a Spark and Hadoop File Distribution System environment to work.

Choose text collection

documents = sc.wholeTextFiles("hdfs:///distributed/file/directory")

Type in the files to be collected to documents variable.

To run enter the following command:

To run on client

$ spark-submit --master yarn --deploy-mode client sKmeans.py

To run on cluster

$ spark-submit --master yarn --deploy-mode cluster --executor-memory 4G --num-executors 4 sKmeans.py

Note: If a lot of resources a required, run on cluster.

If the app was run on mode client, one needs to check the results via logs. To read the logs enter the following command

$ yarn logs -applicationId <APP_ID>

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
gutenberg/test1		gutenberg/test1
test3		test3
ttp		ttp
README.md		README.md
sKmeans.py		sKmeans.py
serialK.py		serialK.py