Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

# Brief History

Originally developed at the University of California, Berkeley’s AMPLab starting in 2009, in 2013, the Spark codebase was donated to the Apache Software Foundation, which has maintained it since.

# Overview

Sparks architecture is based on the concept of resilient distributed dataset (RDD), a immutable, distributed collection of objects that can be operated on in parallel over a cluster. The Dataframe API was released as an abstraction on top of the RDD, followed by the Dataset API.

In Spark 1.x, the RDD was the primary application programming interface (API), but as of Spark 2.x use of the Dataset API is encouraged even though the RDD API is not deprecated. The RDD technology still underlies the Dataset API

RDDs were developed in 2012 in response to the limitations in the Map Reduce {continue later}

Inside Apache Spark the workflow is managed as a directed acyclic graph (DAG). Nodes represent RDDs while edges represent the operations on the RDDs.

# Spark Core

# Outline

The below content is mapped according to  Databricks Certified Associate Developer for Apache Spark certification by Databricks along with Apache Spark Documentation.

Furthermore I will be using the following courses provided by Databricks for free.

  1. Introduction to Apache Spark™
  2. Developing Applications with Apache Spark™
  3. Stream Processing and Analysis with Apache Spark™
  4. Monitoring and Optimizing Apache Spark™ Workloads on Databricks
SectionTopicKey Concepts & Components
1. ArchitectureCore InfrastructureDriver: Orchestrates execution. Worker/Executors: Runs tasks. Cluster Manager: Allocates resources.
Data StructuresRDD: Low-level API. DataFrame/Dataset: High-level structured APIs.
ExecutionLazy Evaluation: Plan created, execution delayed until an Action. Hierarchy: Application → Job → Stage → Task.
Memory & StorageCaching/Persistence: Storing data in memory/disk. Storage Levels: MEMORY_ONLY, DISK_ONLY, etc.
2. Spark SQLData SourcesReading/Writing: JDBC, CSV, JSON, ORC, Parquet, and Delta Lake.
Save Modesappend, overwrite, errorIfExists, ignore.
Views & TablescreateOrReplaceTempView for SQL access; saving to persistent tables with partitionBy.
3. DataFrame APIBasic Operationsselect, filter, withColumn, drop, rename, explode.
AggregationsgroupBy, count, approx_count_distinct, mean, summary.
JoinsInner, Left, Cross, Union/Union All, and Broadcast Joins (small table in memory).
FunctionsUDFs: Custom logic. Date/Time: Converting Unix epoch, extracting year/month.
Shared VariablesBroadcast Variables: Read-only data on all nodes. Accumulators: Write-only counters.
4. TuningPartitioningCoalesce: Decreasing partitions. Repartition: Increasing/reshuffling partitions.
OptimizationAQE (Adaptive Query Execution): Adjusts plans at runtime. Data Skew: Handling uneven data distribution.
MonitoringAnalyzing Driver/Executor logs for OOM (Out of Memory) errors and resource usage.
5. StreamingLogicMicro-batching: Processing data in small increments. Exactly-once: Guarantees no data loss/duplication.
State ManagementWatermarking: Handling late data. Deduplication: Removing duplicates in the stream.
Output ModesAppend, Complete, Update.
6. DeploymentSpark ConnectDecouples client applications from the Spark Server for better stability and remote access.
ModesLocal: Single machine. Client: Driver on local. Cluster: Driver on the cluster.
7. Pandas APIScaling PandasRun Pandas code on Spark distributed clusters using pyspark.pandas.
Pandas UDFsVectorized UDFs: Higher performance using Apache Arrow for data transfer.