Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

# Brief History

Originally developed at the University of California, Berkeley’s AMPLab starting in 2009, in 2013, the Spark codebase was donated to the Apache Software Foundation, which has maintained it since.

# Overview

Sparks architecture is based on the concept of resilient distributed dataset (RDD), a immutable, distributed collection of objects that can be operated on in parallel over a cluster. The Dataframe API was released as an abstraction on top of the RDD, followed by the Dataset API.

In Spark 1.x, the RDD was the primary application programming interface (API), but as of Spark 2.x use of the Dataset API is encouraged even though the RDD API is not deprecated. The RDD technology still underlies the Dataset API

RDDs were developed in 2012 in response to the limitations in the Map Reduce {continue later}

Inside Apache Spark the workflow is managed as a directed acyclic graph (DAG). Nodes represent RDDs while edges represent the operations on the RDDs.

# Spark Core

# Outline

The below content is mapped according to Databricks Certified Associate Developer for Apache Spark certification by Databricks along with Apache Spark Documentation.

Furthermore I will be using the following courses provided by Databricks for free.

Section	Topic	Key Concepts & Components
1. Architecture	Core Infrastructure	Driver: Orchestrates execution. Worker/Executors: Runs tasks. Cluster Manager: Allocates resources.
	Data Structures	RDD: Low-level API. DataFrame/Dataset: High-level structured APIs.
	Execution	Lazy Evaluation: Plan created, execution delayed until an Action. Hierarchy: Application → Job → Stage → Task.
	Memory & Storage	Caching/Persistence: Storing data in memory/disk. Storage Levels: MEMORY_ONLY, DISK_ONLY, etc.
2. Spark SQL	Data Sources	Reading/Writing: JDBC, CSV, JSON, ORC, Parquet, and Delta Lake.
	Save Modes	`append`, `overwrite`, `errorIfExists`, `ignore`.
	Views & Tables	`createOrReplaceTempView` for SQL access; saving to persistent tables with `partitionBy`.
3. DataFrame API	Basic Operations	`select`, `filter`, `withColumn`, `drop`, `rename`, `explode`.
	Aggregations	`groupBy`, `count`, `approx_count_distinct`, `mean`, `summary`.
	Joins	Inner, Left, Cross, Union/Union All, and Broadcast Joins (small table in memory).
	Functions	UDFs: Custom logic. Date/Time: Converting Unix epoch, extracting year/month.
	Shared Variables	Broadcast Variables: Read-only data on all nodes. Accumulators: Write-only counters.
4. Tuning	Partitioning	Coalesce: Decreasing partitions. Repartition: Increasing/reshuffling partitions.
	Optimization	AQE (Adaptive Query Execution): Adjusts plans at runtime. Data Skew: Handling uneven data distribution.
	Monitoring	Analyzing Driver/Executor logs for OOM (Out of Memory) errors and resource usage.
5. Streaming	Logic	Micro-batching: Processing data in small increments. Exactly-once: Guarantees no data loss/duplication.
	State Management	Watermarking: Handling late data. Deduplication: Removing duplicates in the stream.
	Output Modes	`Append`, `Complete`, `Update`.
6. Deployment	Spark Connect	Decouples client applications from the Spark Server for better stability and remote access.
	Modes	Local: Single machine. Client: Driver on local. Cluster: Driver on the cluster.
7. Pandas API	Scaling Pandas	Run Pandas code on Spark distributed clusters using `pyspark.pandas`.
	Pandas UDFs	Vectorized UDFs: Higher performance using Apache Arrow for data transfer.

Second Brain

Explorer

Apache Spark

# Brief History

# Overview

# Spark Core

# Outline

Graph View

Table of Contents

Backlinks