Understand Spark’s core components and how they work. This section covers how Spark operates internally and concepts you need to know for efficient distributed processing.
When Java/Spring Developers Learn Spark#
For Java/Spring developers, Spark is both familiar and a new paradigm. While it uses a functional style similar to Stream API, the nature of distributed environments requires different thinking in some areas.
Familiar Aspects
- Functional APIs like
filter(),map(),groupBy() - Using Java lambda expressions
- SQL query support
New Concepts to Understand
- Lazy Evaluation: Method calls don’t execute immediately
- Serialization Constraints: Objects used in closures must be serializable
- Shuffle Cost: Data movement involves network I/O, making it expensive
- Immutable Data: RDDs/DataFrames cannot be modified, always return new objects
Once you understand these differences, Spark becomes a powerful tool for large-scale data processing. Each concept document explains in detail with Java code examples.
Learning Path#
Following this order will give you a systematic understanding from Spark basics to operations.
Fundamentals
First, understand Spark’s core structure and APIs:
- Architecture - Roles and interactions of Driver, Executor, Cluster Manager
- RDD Basics - Spark’s basic abstraction, distributed collection concepts
- DataFrame and Dataset - Modern high-level APIs
- Spark SQL - Distributed data processing with SQL
- Transformations and Actions - Core of lazy evaluation and execution
Advanced Concepts
After understanding the basics, learn performance optimization and advanced features:
- Partitioning and Shuffle - Core of distributed processing, data distribution strategies
- Caching and Persistence - In-memory processing optimization
- Structured Streaming - Real-time stream data processing
- MLlib - Distributed machine learning
Operations
Knowledge for operating Spark in production environments:
- Performance Tuning - Memory, partition, shuffle optimization
- Deployment and Cluster Management - Standalone, YARN, Kubernetes environments
- Spark Connect - Thin client architecture (Spark 3.4+)
Core Concepts Summary#
Brief introduction to essential concepts for understanding Spark. Detailed content for each concept is covered in individual documents.
Execution Model
Core components for understanding how Spark applications execute:
| Concept | Description |
|---|---|
| Driver | Runs the application’s main(), orchestrates work |
| Executor | Worker process that performs actual data processing |
| Cluster Manager | Resource allocation (Standalone, YARN, K8s) |
| Job | Unit of work corresponding to one Action |
| Stage | Set of Tasks divided by shuffle boundaries |
| Task | Smallest unit of work executed on a single partition |
The Driver coordinates overall work, while actual data processing runs in parallel across multiple Executors.
Data Abstractions
Comparison of characteristics of Spark’s three data APIs:
| API | Type Safety | Optimization | When to Use |
|---|---|---|---|
| RDD | Yes (generics) | Limited | When low-level control needed |
| DataFrame | No (Row) | Catalyst optimization | SQL-style processing |
| Dataset | Yes (case class) | Catalyst optimization | Type safety + optimization |
In most cases, use DataFrame, and choose Dataset when compile-time type checking is needed.
Operation Types
Spark operations are broadly divided into Transformations and Actions:
| Type | Characteristics | Examples |
|---|---|---|
| Transformation | Lazy evaluation, returns new RDD/DataFrame | map, filter, groupBy |
| Action | Immediate execution, returns value | collect, count, show |
Transformations are not executed immediately when called, but processed with an optimized execution plan when an Action is called.
Narrow vs Wide Transformation
Transformations are divided into two types depending on whether shuffle occurs:
| Type | Shuffle | Examples |
|---|---|---|
| Narrow | No (1:1 partition mapping) | map, filter, union |
| Wide | Yes (shuffle occurs) | groupBy, join, reduceByKey |
Wide Transformations cause network I/O and significantly impact performance. Each document covers how these concepts connect in detail.