Glossary#

Key terminology and concepts used in Spark. Each term links to related documentation.

Core Concepts#

Action#

Operations that trigger actual computation on RDD/DataFrame and return results. Examples include count(), collect(), show(), write(). → Transformations and Actions

Application#

A Spark program submitted by the user. Consists of a Driver and Executors. → Architecture

Broadcast Variable#

A shared variable distributed to all nodes as read-only. Used for efficiently sharing small datasets. → Performance Tuning

Catalyst Optimizer#

Spark SQL’s query optimization engine. Transforms logical plans into optimized physical plans. → Spark SQL

Checkpoint#

A mechanism that saves RDD/DataFrame to reliable storage, breaking the Lineage and enabling faster failure recovery. → Caching and Persistence

Cluster Manager#

An external service that manages cluster resources. Options include Standalone, YARN, Kubernetes, and Mesos. → Architecture, Deployment and Cluster Management

Coalesce#

An operation that reduces the number of partitions. Merges partitions without Shuffle. → Partitioning and Shuffle

DAG (Directed Acyclic Graph)#

A directed acyclic graph representing the dependency relationships of Transformations. Used by Spark to optimize execution plans. → Architecture

DataFrame#

A distributed data collection organized into named columns. In Java, it’s represented as Dataset<Row>. → DataFrame and Dataset

Dataset#

A distributed data collection with a specific type. Provides compile-time type safety. → DataFrame and Dataset

Driver#

The process that runs the Spark application’s main() function and creates the SparkSession. → Architecture

Executor#

A JVM process running on Worker nodes. Executes Tasks and stores data. → Architecture

Job#

A unit of parallel computation corresponding to a single Action. Consists of multiple Stages. → Architecture

Lazy Evaluation#

A mechanism where Transformations are not executed immediately but deferred until an Action is called. → Transformations and Actions

Lineage#

Information about how an RDD was created through Transformations. Used for failure recovery. → RDD Fundamentals

Narrow Transformation#

A Transformation where each input partition contributes to at most one output partition. No Shuffle occurs. → Transformations and Actions

Partition#

A logical unit of data division in RDD/DataFrame. Each partition is processed on a single node in the cluster. → Partitioning and Shuffle

Persist#

Storing RDD/DataFrame in memory or disk with a specified Storage Level. → Caching and Persistence

RDD (Resilient Distributed Dataset)#

Spark’s fundamental data abstraction. An immutable, distributed, fault-tolerant data collection. → RDD Fundamentals

Repartition#

An operation that changes the number of partitions. Causes Shuffle. → Partitioning and Shuffle

Serialization#

The process of converting objects to byte streams. Required for network transfer or disk storage. → Performance Tuning

Shuffle#

Data redistribution across partitions. Occurs in Wide Transformations and significantly impacts performance. → Partitioning and Shuffle

SparkContext#

An object representing the connection to a Spark cluster. Unified into SparkSession since Spark 2.0. → Architecture

SparkSession#

The unified entry point for Spark applications. Includes SparkContext, SQLContext, and HiveContext. → Quick Start, Architecture

Stage#

A set of Tasks separated by Shuffle boundaries. A single Job consists of multiple Stages. → Architecture

Storage Level#

Specifies how data is stored when caching. Options include MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, etc. → Caching and Persistence

Task#

The smallest unit of work executed on a single Partition. Runs on Executors. → Architecture

Transformation#

Operations that create new RDD/DataFrames from existing ones. Subject to Lazy Evaluation. → Transformations and Actions

Tungsten#

Spark’s execution engine optimization project. Improves memory management, code generation, etc. → Performance Tuning

Wide Transformation#

A Transformation where multiple input partitions contribute to a single output partition. Causes Shuffle. → Transformations and Actions

Worker Node#

A cluster node that runs Executors. → Architecture, Deployment and Cluster Management

Streaming Terms#

Micro-Batch#

The default mode of Structured Streaming that processes stream data in small batches. → Structured Streaming

Trigger#

Configuration that determines when stream processing is executed. → Structured Streaming

Watermark#

A late arrival tolerance setting for handling late-arriving data. → Structured Streaming

Window#

Operations for time-based grouping. Includes Tumbling, Sliding, and Session Windows. → Structured Streaming

Machine Learning Terms#

Estimator#

An algorithm that trains using the fit() method to produce a Transformer. → MLlib

Pipeline#

A workflow connecting multiple Estimators and Transformers. → MLlib

Transformer (ML)#

A component that transforms data using the transform() method. → MLlib

AQE (Adaptive Query Execution)#

A feature that dynamically optimizes query plans at runtime. Available in Spark 3.0+. → Performance Tuning

Broadcast Join#

A join method that distributes small tables to all nodes to join without Shuffle. → Performance Tuning

CBO (Cost-Based Optimization)#

An optimization technique that selects the optimal execution plan based on table statistics. → Performance Tuning

Dynamic Allocation#

A feature that automatically adjusts the number of Executors based on workload. → Performance Tuning, Deployment and Cluster Management