Understand Spark’s core components and how they work. This section covers how Spark operates internally and concepts you need to know for efficient distributed processing.
When Java/Spring Developers Learn Spark#
For Java/Spring developers, Spark is both familiar and a new paradigm. While it uses a functional style similar to Stream API, the nature of distributed environments requires different thinking in some areas.
Familiar Aspects
- Functional APIs like
filter(),map(),groupBy() - Using Java lambda expressions
- SQL query support
New Concepts to Understand
- Lazy Evaluation: Method calls don’t execute immediately
- Serialization Constraints: Objects used in closures must be serializable
- Shuffle Cost: Data movement involves network I/O, making it expensive
- Immutable Data: RDDs/DataFrames cannot be modified, always return new objects
Once you understand these differences, Spark becomes a powerful tool for large-scale data processing. Each concept document explains in detail with Java code examples.
Learning Path#
Following this order will give you a systematic understanding from Spark basics to operations.
Fundamentals
First, understand Spark’s core structure and APIs:
- Architecture - Roles and interactions of Driver, Executor, Cluster Manager
- RDD Basics - Spark’s basic abstraction, distributed collection concepts
- DataFrame and Dataset - Modern high-level APIs
- Spark SQL - Distributed data processing with SQL
- Transformations and Actions - Core of lazy evaluation and execution
Advanced Concepts
After understanding the basics, learn performance optimization and advanced features:
- Partitioning and Shuffle - Core of distributed processing, data distribution strategies
- Caching and Persistence - In-memory processing optimization
- Structured Streaming - Real-time stream data processing
- MLlib - Distributed machine learning
Operations
Knowledge for operating Spark in production environments:
- Performance Tuning - Memory, partition, shuffle optimization
- Deployment and Cluster Management - Standalone, YARN, Kubernetes environments
- Spark Connect - Thin client architecture (Spark 3.4+)
Core Concepts Summary#
Brief introduction to essential concepts for understanding Spark. Detailed content for each concept is covered in individual documents.
Fundamentals — Core components and data processing model of Spark
| Concept | Description |
|---|---|
| Driver | Runs the application’s main(), orchestrates work |
| Executor | Worker process that performs actual data processing |
| Cluster Manager | Resource allocation (Standalone, YARN, K8s) |
| Job | Unit of work corresponding to one Action |
| Stage | Set of Tasks divided by shuffle boundaries |
| Task | Smallest unit of work executed on a single partition |
The Driver coordinates overall work, while actual data processing runs in parallel across multiple Executors.
Advanced Concepts — Execution model and mechanisms that affect performance
| Concept | Description |
|---|---|
| Transformation | Lazy evaluation, returns new RDD/DataFrame (map, filter, groupBy) |
| Action | Immediate execution, returns value (collect, count, show) |
| Lazy Evaluation | Batch all Transformations and optimize at Action time |
| Narrow vs Wide | Narrow: 1:1 partition mapping (no shuffle); Wide: shuffle occurs |
Transformations are not executed immediately when called, but processed with an optimized execution plan when an Action is called.
Operational Concepts — Performance optimization and internal workings
| Concept | Description |
|---|---|
| Caching | Store frequently used data in memory for fast reuse |
| Broadcast | Distribute small data copies to all nodes for efficient joins |
| DAG | Directed acyclic graph of operation dependencies |
Data Abstractions
Comparison of characteristics of Spark’s three data APIs:
| API | Type Safety | Optimization | When to Use |
|---|---|---|---|
| RDD | Yes (generics) | Limited | When low-level control needed |
| DataFrame | No (Row) | Catalyst optimization | SQL-style processing |
| Dataset | Yes (case class — Scala’s data class, similar to Java’s Record) | Catalyst optimization | Type safety + optimization |
In most cases, use DataFrame, and choose Dataset when compile-time type checking is needed.