Understand Spark’s core components and how they work. This section covers how Spark operates internally and concepts you need to know for efficient distributed processing.

When Java/Spring Developers Learn Spark#

For Java/Spring developers, Spark is both familiar and a new paradigm. While it uses a functional style similar to Stream API, the nature of distributed environments requires different thinking in some areas.

Familiar Aspects

  • Functional APIs like filter(), map(), groupBy()
  • Using Java lambda expressions
  • SQL query support

New Concepts to Understand

  • Lazy Evaluation: Method calls don’t execute immediately
  • Serialization Constraints: Objects used in closures must be serializable
  • Shuffle Cost: Data movement involves network I/O, making it expensive
  • Immutable Data: RDDs/DataFrames cannot be modified, always return new objects

Once you understand these differences, Spark becomes a powerful tool for large-scale data processing. Each concept document explains in detail with Java code examples.

Learning Path#

Following this order will give you a systematic understanding from Spark basics to operations.

Fundamentals

First, understand Spark’s core structure and APIs:

  1. Architecture - Roles and interactions of Driver, Executor, Cluster Manager
  2. RDD Basics - Spark’s basic abstraction, distributed collection concepts
  3. DataFrame and Dataset - Modern high-level APIs
  4. Spark SQL - Distributed data processing with SQL
  5. Transformations and Actions - Core of lazy evaluation and execution

Advanced Concepts

After understanding the basics, learn performance optimization and advanced features:

  1. Partitioning and Shuffle - Core of distributed processing, data distribution strategies
  2. Caching and Persistence - In-memory processing optimization
  3. Structured Streaming - Real-time stream data processing
  4. MLlib - Distributed machine learning

Operations

Knowledge for operating Spark in production environments:

  1. Performance Tuning - Memory, partition, shuffle optimization
  2. Deployment and Cluster Management - Standalone, YARN, Kubernetes environments
  3. Spark Connect - Thin client architecture (Spark 3.4+)

Core Concepts Summary#

Brief introduction to essential concepts for understanding Spark. Detailed content for each concept is covered in individual documents.

Fundamentals — Core components and data processing model of Spark

ConceptDescription
DriverRuns the application’s main(), orchestrates work
ExecutorWorker process that performs actual data processing
Cluster ManagerResource allocation (Standalone, YARN, K8s)
JobUnit of work corresponding to one Action
StageSet of Tasks divided by shuffle boundaries
TaskSmallest unit of work executed on a single partition

The Driver coordinates overall work, while actual data processing runs in parallel across multiple Executors.

Advanced Concepts — Execution model and mechanisms that affect performance

ConceptDescription
TransformationLazy evaluation, returns new RDD/DataFrame (map, filter, groupBy)
ActionImmediate execution, returns value (collect, count, show)
Lazy EvaluationBatch all Transformations and optimize at Action time
Narrow vs WideNarrow: 1:1 partition mapping (no shuffle); Wide: shuffle occurs

Transformations are not executed immediately when called, but processed with an optimized execution plan when an Action is called.

Operational Concepts — Performance optimization and internal workings

ConceptDescription
CachingStore frequently used data in memory for fast reuse
BroadcastDistribute small data copies to all nodes for efficient joins
DAGDirected acyclic graph of operation dependencies

Data Abstractions

Comparison of characteristics of Spark’s three data APIs:

APIType SafetyOptimizationWhen to Use
RDDYes (generics)LimitedWhen low-level control needed
DataFrameNo (Row)Catalyst optimizationSQL-style processing
DatasetYes (case class — Scala’s data class, similar to Java’s Record)Catalyst optimizationType safety + optimization

In most cases, use DataFrame, and choose Dataset when compile-time type checking is needed.