Understand Spark’s core components and how they work. This section covers how Spark operates internally and concepts you need to know for efficient distributed processing.

When Java/Spring Developers Learn Spark#

For Java/Spring developers, Spark is both familiar and a new paradigm. While it uses a functional style similar to Stream API, the nature of distributed environments requires different thinking in some areas.

Familiar Aspects

  • Functional APIs like filter(), map(), groupBy()
  • Using Java lambda expressions
  • SQL query support

New Concepts to Understand

  • Lazy Evaluation: Method calls don’t execute immediately
  • Serialization Constraints: Objects used in closures must be serializable
  • Shuffle Cost: Data movement involves network I/O, making it expensive
  • Immutable Data: RDDs/DataFrames cannot be modified, always return new objects

Once you understand these differences, Spark becomes a powerful tool for large-scale data processing. Each concept document explains in detail with Java code examples.

Learning Path#

Following this order will give you a systematic understanding from Spark basics to operations.

Fundamentals

First, understand Spark’s core structure and APIs:

  1. Architecture - Roles and interactions of Driver, Executor, Cluster Manager
  2. RDD Basics - Spark’s basic abstraction, distributed collection concepts
  3. DataFrame and Dataset - Modern high-level APIs
  4. Spark SQL - Distributed data processing with SQL
  5. Transformations and Actions - Core of lazy evaluation and execution

Advanced Concepts

After understanding the basics, learn performance optimization and advanced features:

  1. Partitioning and Shuffle - Core of distributed processing, data distribution strategies
  2. Caching and Persistence - In-memory processing optimization
  3. Structured Streaming - Real-time stream data processing
  4. MLlib - Distributed machine learning

Operations

Knowledge for operating Spark in production environments:

  1. Performance Tuning - Memory, partition, shuffle optimization
  2. Deployment and Cluster Management - Standalone, YARN, Kubernetes environments
  3. Spark Connect - Thin client architecture (Spark 3.4+)

Core Concepts Summary#

Brief introduction to essential concepts for understanding Spark. Detailed content for each concept is covered in individual documents.

Execution Model

Core components for understanding how Spark applications execute:

ConceptDescription
DriverRuns the application’s main(), orchestrates work
ExecutorWorker process that performs actual data processing
Cluster ManagerResource allocation (Standalone, YARN, K8s)
JobUnit of work corresponding to one Action
StageSet of Tasks divided by shuffle boundaries
TaskSmallest unit of work executed on a single partition

The Driver coordinates overall work, while actual data processing runs in parallel across multiple Executors.

Data Abstractions

Comparison of characteristics of Spark’s three data APIs:

APIType SafetyOptimizationWhen to Use
RDDYes (generics)LimitedWhen low-level control needed
DataFrameNo (Row)Catalyst optimizationSQL-style processing
DatasetYes (case class)Catalyst optimizationType safety + optimization

In most cases, use DataFrame, and choose Dataset when compile-time type checking is needed.

Operation Types

Spark operations are broadly divided into Transformations and Actions:

TypeCharacteristicsExamples
TransformationLazy evaluation, returns new RDD/DataFramemap, filter, groupBy
ActionImmediate execution, returns valuecollect, count, show

Transformations are not executed immediately when called, but processed with an optimized execution plan when an Action is called.

Narrow vs Wide Transformation

Transformations are divided into two types depending on whether shuffle occurs:

TypeShuffleExamples
NarrowNo (1:1 partition mapping)map, filter, union
WideYes (shuffle occurs)groupBy, join, reduceByKey

Wide Transformations cause network I/O and significantly impact performance. Each document covers how these concepts connect in detail.