About

Practice Course on Big Data

https://bigdatateam.org/big-data-course

Spark Configuration

Use the following comand to start interactive Jupyter PySpark session (set Python v.3.6 as the default version):

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_PYTHON=python3.6 PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=0.0.0.0 --port=port_1' pyspark --conf spark.ui.port=port_2 --driver-memory 512m --master yarn --num-executors 2 --executor-cores 1

Add the following rule to ssh forwarding:

-L port_1:localhost:port_1

Open the following URL in you favourite browser:

http://localhost:port_1

Spark Streaming and Kafka will require to add extra flags:

--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0

see more details at:

https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

Spark Cassandra will require two following flags:

--packages com.datastax.spark:spark-cassandra-connector_2.11:2.4.2
--conf spark.cassandra.connection.host=brain-node1

Useful Spark documentation links:

PySpark: https://spark.apache.org/docs/2.4.7/api.html
Python API: https://spark.apache.org/docs/2.4.7/api/python/index.html
PySpark SQL API: https://spark.apache.org/docs/2.4.7/api/python/pyspark.sql.html

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
public_examples		public_examples
.gitignore		.gitignore
README.md		README.md
hadoop_dataset.txt		hadoop_dataset.txt
hdfs_quiz_advanced.yml		hdfs_quiz_advanced.yml
hdfs_quiz_beginner.yml		hdfs_quiz_beginner.yml
hdfs_quiz_intermediate.yml		hdfs_quiz_intermediate.yml
hdfs_quiz_template.yml		hdfs_quiz_template.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Spark Configuration

About

Releases

Packages

Languages

nonflame/big-data-course

Folders and files

Latest commit

History

Repository files navigation

About

Spark Configuration

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages