Skip to content

adaltas/ece-spark-2024-fall-gr03

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Spark

Introduction

Apache Spark is a powerful open-source framework designed for fast and versatile big data processing, enabling efficient large-scale data manipulation and real-time analytics. In this course, you will learn how to leverage Spark's capabilities to process massive datasets and perform complex data analysis tasks with ease.

Educational goals

  • Discover all the functionalities of Apache Spark and why it is everywhere.
  • Understand the internals of Spark.
  • Learn to use Spark for batch and streaming data analytics.

Prerequisites

Python programming knowledge, Linux/Unix shell basic knowledge.

Modules

Module 1 - Introduction to Spark & RDDs

  • Presentation
  • Spark in Hadoop ecosystem
  • Use cases
  • Spark ecosystem
  • Internals
  • Data structures
  • Operations
  • Resilient Distributed Datasets (RDDs)

Module 2 - Spark SQL and DataFrames

  • RDDs: Pros and Cons
  • DataFrames
  • RDDs vs DataFrames
  • Working with DataFrames
  • Why SQL?

Module 3 - Spark Structured Streaming

  • Streaming introduction
  • Difference between batch and stream processing
  • Stream processing models
  • Different processing semantics
  • Programming model
  • Event-time vs. processing time
  • Windows: tumbling, overlapping
  • Handling late data and how long to wait
  • Vocabulary

Resource

You can freely download a book, used for this course:

Learning Spark, 2nd Edition

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published