From b5dddda0d3dbeba0f12b2baee8f39c9f4fe73f82 Mon Sep 17 00:00:00 2001 From: Wil Roberts <47739563+robertswh@users.noreply.github.com> Date: Wed, 13 Dec 2023 10:51:44 +0000 Subject: [PATCH] new local session section in Spark session guide (#106) * new local session section, first draft * taking on review comments * fixed link --- .../spark-overview/example-spark-sessions.md | 52 ++++++++++++++++--- 1 file changed, 46 insertions(+), 6 deletions(-) diff --git a/ons-spark/spark-overview/example-spark-sessions.md b/ons-spark/spark-overview/example-spark-sessions.md index d78c4c7e..58cc9dda 100644 --- a/ons-spark/spark-overview/example-spark-sessions.md +++ b/ons-spark/spark-overview/example-spark-sessions.md @@ -1,20 +1,59 @@ ## Example Spark Sessions -This document gives some example Spark sessions. For more information on Spark sessions and why you need to be careful with memory usage, please consult the [Guidance on Spark Sessions](../spark-overview/spark-session-guidance) and [Configuration Hierarchy and `spark-defaults.conf`](../spark-overview/spark-defaults). +This article gives some example Spark sessions, or Spark applications. For more information on Spark sessions and why you need to be careful with memory usage, please consult the [Guidance on Spark Sessions](../spark-overview/spark-session-guidance) and [Configuration Hierarchy and `spark-defaults.conf`](../spark-overview/spark-defaults). -Remember to only use a Spark session for as long as you need. It's good etiquette to use [`spark.stop()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.stop.html) (for PySpark) or [`spark_disconnect(sc)`](https://spark.rstudio.com/packages/sparklyr/latest/reference/spark-connections.html) (for sparklyr) in your scripts. Stopping the CDSW or Jupyter Notebook session will also close the Spark session if one is running. +Remember to only use a Spark session for as long as you need. It's good etiquette to use [`spark.stop()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.stop.html) (for PySpark) or [`spark_disconnect(sc)`](https://spark.rstudio.com/packages/sparklyr/latest/reference/spark-connections.html) (for sparklyr) in your scripts. Stopping the container or Jupyter Notebook session will also close the Spark session if one is running. -### Default/Blank Session +### Local mode -As a starting point you can create a Spark session with all the default options. This is the bare minimum you need to create a Spark session and will work fine for many DAP users. +As mentioned at the top of the article on [Guidance on Spark Sessions](../spark-overview/spark-session-guidance) there are two modes to running a Spark application, one is local mode (this example) and the other is cluster mode (all other examples below). Local mode can be used when running a Spark application on a single computer or node. -Please use this session by default or if unsure in any way about your resource requirements. +Details: +- Utilises resource of a single node or machine +- This example uses 2 cores +- Amount of memory used depends on the node or machine + +Use case: +- Developing code using dummy or synthetic data or a small sample of data +- Writing unit tests + +Example of actual usage: +- Pipeline development using dummy data +- Throughout this book + +````{tabs} +```{code-tab} py +from pyspark.sql import SparkSession -Note that for PySpark, `.config("spark.ui.showConsoleProgress", "false")` is still recommended for use with this session; this will stop the console progress in Spark, which sometimes obscures results from displaying properly. +spark = ( + SparkSession.builder.master("local[2]") + .appName("local_session") + .getOrCreate() +) +``` + +```{code-tab} r R +library(sparklyr) + +sc <- sparklyr::spark_connect( + master = "local[2]", + app_name = "local-session", + config = sparklyr::spark_config()) +``` +```` + +Note that all dependencies must also be in place to run a Spark application on your laptop, see Setting up resources section in the [Getting Started with Spark](../spark-overview/spark-start) article for further information. + +### Default Session + +As a starting point you can create a Spark session with all the default options. This is the bare minimum you need to create a Spark session and will work fine in most cases. + +Please use this session by default or if unsure in any way about your resource requirements. Details: - Will give you the default config options +- Amount of resource depends on your specific platform Use case: - When unsure of your requirements @@ -33,6 +72,7 @@ spark = ( .getOrCreate() ) ``` + ```{code-tab} r R library(sparklyr)