Glossary#
Key terminology and concepts used in Spark. Each term links to related documentation.
Core Concepts#
Action#
Operations that trigger actual computation on RDD/DataFrame and return results. Examples include count(), collect(), show(), write().
→ Transformations and Actions
Application#
A Spark program submitted by the user. Consists of a Driver and Executors. → Architecture
Broadcast Variable#
A shared variable distributed to all nodes as read-only. Used for efficiently sharing small datasets. → Performance Tuning
Catalyst Optimizer#
Spark SQL’s query optimization engine. Transforms logical plans into optimized physical plans. → Spark SQL
Checkpoint#
A mechanism that saves RDD/DataFrame to reliable storage, breaking the Lineage and enabling faster failure recovery. → Caching and Persistence
Cluster Manager#
An external service that manages cluster resources. Options include Standalone, YARN, Kubernetes, and Mesos. → Architecture, Deployment and Cluster Management
Coalesce#
An operation that reduces the number of partitions. Merges partitions without Shuffle. → Partitioning and Shuffle
DAG (Directed Acyclic Graph)#
A directed acyclic graph representing the dependency relationships of Transformations. Used by Spark to optimize execution plans. → Architecture
DataFrame#
A distributed data collection organized into named columns. In Java, it’s represented as Dataset<Row>.
→ DataFrame and Dataset
Dataset#
A distributed data collection with a specific type. Provides compile-time type safety. → DataFrame and Dataset
Driver#
The process that runs the Spark application’s main() function and creates the SparkSession. → Architecture
Executor#
A JVM process running on Worker nodes. Executes Tasks and stores data. → Architecture
Job#
A unit of parallel computation corresponding to a single Action. Consists of multiple Stages. → Architecture
Lazy Evaluation#
A mechanism where Transformations are not executed immediately but deferred until an Action is called. → Transformations and Actions
Lineage#
Information about how an RDD was created through Transformations. Used for failure recovery. → RDD Fundamentals
Narrow Transformation#
A Transformation where each input partition contributes to at most one output partition. No Shuffle occurs. → Transformations and Actions
Partition#
A logical unit of data division in RDD/DataFrame. Each partition is processed on a single node in the cluster. → Partitioning and Shuffle
Persist#
Storing RDD/DataFrame in memory or disk with a specified Storage Level. → Caching and Persistence
RDD (Resilient Distributed Dataset)#
Spark’s fundamental data abstraction. An immutable, distributed, fault-tolerant data collection. → RDD Fundamentals
Repartition#
An operation that changes the number of partitions. Causes Shuffle. → Partitioning and Shuffle
Serialization#
The process of converting objects to byte streams. Required for network transfer or disk storage. → Performance Tuning
Shuffle#
Data redistribution across partitions. Occurs in Wide Transformations and significantly impacts performance. → Partitioning and Shuffle
SparkContext#
An object representing the connection to a Spark cluster. Unified into SparkSession since Spark 2.0. → Architecture
SparkSession#
The unified entry point for Spark applications. Includes SparkContext, SQLContext, and HiveContext. → Quick Start, Architecture
Stage#
A set of Tasks separated by Shuffle boundaries. A single Job consists of multiple Stages. → Architecture
Storage Level#
Specifies how data is stored when caching. Options include MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, etc. → Caching and Persistence
Task#
The smallest unit of work executed on a single Partition. Runs on Executors. → Architecture
Transformation#
Operations that create new RDD/DataFrames from existing ones. Subject to Lazy Evaluation. → Transformations and Actions
Tungsten#
Spark’s execution engine optimization project. Improves memory management, code generation, etc. → Performance Tuning
Wide Transformation#
A Transformation where multiple input partitions contribute to a single output partition. Causes Shuffle. → Transformations and Actions
Worker Node#
A cluster node that runs Executors. → Architecture, Deployment and Cluster Management
Streaming Terms#
Micro-Batch#
The default mode of Structured Streaming that processes stream data in small batches. → Structured Streaming
Trigger#
Configuration that determines when stream processing is executed. → Structured Streaming
Watermark#
A late arrival tolerance setting for handling late-arriving data. → Structured Streaming
Window#
Operations for time-based grouping. Includes Tumbling, Sliding, and Session Windows. → Structured Streaming
Machine Learning Terms#
Estimator#
An algorithm that trains using the fit() method to produce a Transformer. → MLlib
Pipeline#
A workflow connecting multiple Estimators and Transformers. → MLlib
Transformer (ML)#
A component that transforms data using the transform() method. → MLlib
Configuration Related#
AQE (Adaptive Query Execution)#
A feature that dynamically optimizes query plans at runtime. Available in Spark 3.0+. → Performance Tuning
Broadcast Join#
A join method that distributes small tables to all nodes to join without Shuffle. → Performance Tuning
CBO (Cost-Based Optimization)#
An optimization technique that selects the optimal execution plan based on table statistics. → Performance Tuning
Dynamic Allocation#
A feature that automatically adjusts the number of Executors based on workload. → Performance Tuning, Deployment and Cluster Management