Concepts

Understand Spark’s core components and how they work. This section covers how Spark operates internally and concepts you need to know for efficient distributed processing.

When Java/Spring Developers Learn Spark#

For Java/Spring developers, Spark is both familiar and a new paradigm. While it uses a functional style similar to Stream API, the nature of distributed environments requires different thinking in some areas.

Familiar Aspects

Functional APIs like filter(), map(), groupBy()
Using Java lambda expressions
SQL query support

New Concepts to Understand

Lazy Evaluation: Method calls don’t execute immediately
Serialization Constraints: Objects used in closures must be serializable
Shuffle Cost: Data movement involves network I/O, making it expensive
Immutable Data: RDDs/DataFrames cannot be modified, always return new objects

Once you understand these differences, Spark becomes a powerful tool for large-scale data processing. Each concept document explains in detail with Java code examples.

Learning Path#

Following this order will give you a systematic understanding from Spark basics to operations.

Fundamentals

First, understand Spark’s core structure and APIs:

Architecture - Roles and interactions of Driver, Executor, Cluster Manager
RDD Basics - Spark’s basic abstraction, distributed collection concepts
DataFrame and Dataset - Modern high-level APIs
Spark SQL - Distributed data processing with SQL
Transformations and Actions - Core of lazy evaluation and execution

Advanced Concepts

After understanding the basics, learn performance optimization and advanced features:

Partitioning and Shuffle - Core of distributed processing, data distribution strategies
Caching and Persistence - In-memory processing optimization
Structured Streaming - Real-time stream data processing
MLlib - Distributed machine learning

Operations

Knowledge for operating Spark in production environments:

Performance Tuning - Memory, partition, shuffle optimization
Deployment and Cluster Management - Standalone, YARN, Kubernetes environments
Spark Connect - Thin client architecture (Spark 3.4+)

Core Concepts Summary#

Brief introduction to essential concepts for understanding Spark. Detailed content for each concept is covered in individual documents.

Fundamentals — Core components and data processing model of Spark

Concept	Description
Driver	Runs the application’s main(), orchestrates work
Executor	Worker process that performs actual data processing
Cluster Manager	Resource allocation (Standalone, YARN, K8s)
Job	Unit of work corresponding to one Action
Stage	Set of Tasks divided by shuffle boundaries
Task	Smallest unit of work executed on a single partition

The Driver coordinates overall work, while actual data processing runs in parallel across multiple Executors.

Advanced Concepts — Execution model and mechanisms that affect performance

Concept	Description
Transformation	Lazy evaluation, returns new RDD/DataFrame (map, filter, groupBy)
Action	Immediate execution, returns value (collect, count, show)
Lazy Evaluation	Batch all Transformations and optimize at Action time
Narrow vs Wide	Narrow: 1:1 partition mapping (no shuffle); Wide: shuffle occurs

Transformations are not executed immediately when called, but processed with an optimized execution plan when an Action is called.

Operational Concepts — Performance optimization and internal workings

Concept	Description
Caching	Store frequently used data in memory for fast reuse
Broadcast	Distribute small data copies to all nodes for efficient joins
DAG	Directed acyclic graph of operation dependencies

Data Abstractions

Comparison of characteristics of Spark’s three data APIs:

API	Type Safety	Optimization	When to Use
RDD	Yes (generics)	Limited	When low-level control needed
DataFrame	No (Row)	Catalyst optimization	SQL-style processing
Dataset	Yes (case class — Scala’s data class, similar to Java’s Record)	Catalyst optimization	Type safety + optimization

In most cases, use DataFrame, and choose Dataset when compile-time type checking is needed.