Target Audience: Java developers interested in big data processing Prerequisites: Basic Java syntax, SQL fundamentals, Gradle/Maven experience After reading: You’ll be able to create a Spark project locally and query/aggregate CSV data with DataFrame
TL;DR
- SparkSession is Spark’s unified entry point, created with
SparkSession.builder().getOrCreate()- DataFrame API allows filtering, aggregation, and SQL queries
local[*]mode enables testing in local development environment
Run a Spark application and process data in 5 minutes. Following this guide, you’ll experience the entire process from project creation to data querying.
Overview#
The diagram below shows the Quick Start flow:
flowchart LR
A[1. Create Project] --> B[2. Add Dependencies]
B --> C[3. Create SparkSession]
C --> D[4. Process Data]
D --> E[5. View Results]This diagram shows the five-step Quick Start flow from project creation to viewing results.
Prerequisites#
Prepare the following environment before starting:
- Java 17+ (Java 8, 11 also supported, but 17 recommended)
- Gradle or Maven
- IDE (IntelliJ IDEA, VS Code, etc.)
Pre-start Verification
| Item | Command | Expected Result |
|---|---|---|
| Java | java -version | openjdk version "17.x.x" or higher |
| Gradle | gradle --version | Gradle 8.x or higher |
| Maven (alternative) | mvn --version | Apache Maven 3.x.x |
Step 1/5: Create Project (~1 min)#
Create a Java project using Spring Initializr or your IDE. This example starts with a plain Java project.
mkdir spark-quickstart
cd spark-quickstartStep 2/5: Gradle Setup (~2 min)#
Create a build.gradle file:
plugins {
id 'java'
id 'application'
}
group = 'com.example'
version = '1.0.0'
java {
sourceCompatibility = '17'
}
repositories {
mavenCentral()
}
dependencies {
// Spark Core
implementation 'org.apache.spark:spark-core_2.13:3.5.1'
// Spark SQL (for DataFrame, Dataset)
implementation 'org.apache.spark:spark-sql_2.13:3.5.1'
// Logging
implementation 'org.slf4j:slf4j-simple:2.0.9'
}
application {
mainClass = 'com.example.SparkQuickStart'
}
// Prevent Spark JAR conflicts
configurations.all {
exclude group: 'org.slf4j', module: 'slf4j-log4j12'
}Version note: The
2.13inspark-core_2.13is the Scala version. Even when using Java, the Scala runtime is required, so it must be specified.
Using Maven
If you prefer Maven, use the following pom.xml:
pom.xml file:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>spark-quickstart</artifactId>
<version>1.0.0</version>
<packaging>jar</packaging>
<properties>
<maven.compiler.source>17</maven.compiler.source>
<maven.compiler.target>17</maven.compiler.target>
<spark.version>3.5.1</spark.version>
<scala.version>2.13</scala.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>2.0.9</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>3.1.0</version>
<configuration>
<mainClass>com.example.SparkQuickStart</mainClass>
</configuration>
</plugin>
</plugins>
</build>
</project>Run:
mvn compile exec:javaStep 3/5: Create Sample Data (~1 min)#
Create src/main/resources/employees.csv file:
id,name,department,salary
1,John Smith,Engineering,5000
2,Jane Doe,Marketing,4500
3,Bob Johnson,Engineering,5500
4,Alice Williams,Sales,4000
5,Charlie Brown,Engineering,6000
6,Diana Lee,Marketing,4800
7,Edward Kim,Sales,4200
8,Fiona Chen,Engineering,5200Step 4/5: Write Spark Application (~3 min)#
src/main/java/com/example/SparkQuickStart.java:
package com.example;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import static org.apache.spark.sql.functions.*;
public class SparkQuickStart {
public static void main(String[] args) {
// 1. Create SparkSession - Entry point to Spark
SparkSession spark = SparkSession.builder()
.appName("Quick Start")
.master("local[*]") // Local mode, use all cores
.getOrCreate();
// Adjust log level (prevent excessive logs)
spark.sparkContext().setLogLevel("WARN");
System.out.println("=== Spark Quick Start ===\n");
// 2. Read CSV file
Dataset<Row> employees = spark.read()
.option("header", "true") // First line as header
.option("inferSchema", "true") // Auto-infer types
.csv("src/main/resources/employees.csv");
// 3. View data
System.out.println("=== All Employee Data ===");
employees.show();
// 4. Check schema
System.out.println("=== Schema ===");
employees.printSchema();
// 5. Filter - Salary >= 5000
System.out.println("=== Employees with Salary >= 5000 ===");
employees.filter(col("salary").geq(5000)).show();
// 6. Aggregate - Average salary by department
System.out.println("=== Average Salary by Department ===");
employees.groupBy("department")
.agg(
avg("salary").alias("avg_salary"),
count("*").alias("employee_count")
)
.orderBy(desc("avg_salary"))
.show();
// 7. SQL - Same operation with SQL
employees.createOrReplaceTempView("employees");
System.out.println("=== SQL Query: Engineering Department ===");
spark.sql("""
SELECT name, salary
FROM employees
WHERE department = 'Engineering'
ORDER BY salary DESC
""").show();
// 8. Close SparkSession
spark.stop();
System.out.println("=== Complete ===");
}
}Step 5/5: Run (~1 min)#
./gradlew runFor Windows:
gradlew.bat runExpected Output#
When run successfully, you should see output like this:
=== Spark Quick Start ===
=== All Employee Data ===
+---+--------------+-----------+------+
| id| name| department|salary|
+---+--------------+-----------+------+
| 1| John Smith|Engineering| 5000|
| 2| Jane Doe| Marketing| 4500|
| 3| Bob Johnson|Engineering| 5500|
| 4|Alice Williams| Sales| 4000|
| 5| Charlie Brown|Engineering| 6000|
| 6| Diana Lee| Marketing| 4800|
| 7| Edward Kim| Sales| 4200|
| 8| Fiona Chen|Engineering| 5200|
+---+--------------+-----------+------+
=== Schema ===
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- department: string (nullable = true)
|-- salary: integer (nullable = true)
=== Employees with Salary >= 5000 ===
+---+-------------+-----------+------+
| id| name| department|salary|
+---+-------------+-----------+------+
| 1| John Smith|Engineering| 5000|
| 3| Bob Johnson|Engineering| 5500|
| 5|Charlie Brown|Engineering| 6000|
| 8| Fiona Chen|Engineering| 5200|
+---+-------------+-----------+------+
=== Average Salary by Department ===
+-----------+----------+--------------+
| department|avg_salary|employee_count|
+-----------+----------+--------------+
|Engineering| 5425.0| 4|
| Marketing| 4650.0| 2|
| Sales| 4100.0| 2|
+-----------+----------+--------------+
=== SQL Query: Engineering Department ===
+-------------+------+
| name|salary|
+-------------+------+
|Charlie Brown| 6000|
| Bob Johnson| 5500|
| Fiona Chen| 5200|
| John Smith| 5000|
+-------------+------+
=== Complete ===Congratulations! You’ve successfully run your first Spark application.
Production-Level Code#
Advanced Topic
This section covers advanced patterns for production environments. You can skip this section if you are just learning the Quick Start.
In real production environments, exception handling and resource cleanup are essential:
package com.example;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.AnalysisException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import static org.apache.spark.sql.functions.*;
public class SparkQuickStartProduction {
private static final Logger logger = LoggerFactory.getLogger(SparkQuickStartProduction.class);
public static void main(String[] args) {
SparkSession spark = null;
int exitCode = 0;
try {
// Create SparkSession
spark = SparkSession.builder()
.appName("Quick Start Production")
.master("local[*]")
.config("spark.sql.session.timeZone", "UTC")
.getOrCreate();
spark.sparkContext().setLogLevel("WARN");
logger.info("SparkSession created successfully");
// Read data (with error mode)
Dataset<Row> employees = spark.read()
.option("header", "true")
.option("inferSchema", "true")
.option("mode", "FAILFAST") // Fail immediately on bad data
.csv("src/main/resources/employees.csv");
// Validate schema
validateSchema(employees);
// Process data
Dataset<Row> result = employees
.filter(col("salary").isNotNull())
.groupBy("department")
.agg(
avg("salary").alias("avg_salary"),
count("*").alias("count")
);
// Output results
result.show();
logger.info("Processing complete: {} departments", result.count());
} catch (AnalysisException e) {
logger.error("Data analysis error: {}", e.getMessage());
exitCode = 1;
} catch (Exception e) {
logger.error("Spark job failed", e);
exitCode = 1;
} finally {
// Resource cleanup (always execute)
if (spark != null) {
spark.stop();
logger.info("SparkSession closed");
}
}
System.exit(exitCode);
}
private static void validateSchema(Dataset<Row> df) {
String[] requiredColumns = {"id", "name", "department", "salary"};
for (String col : requiredColumns) {
if (!java.util.Arrays.asList(df.columns()).contains(col)) {
throw new IllegalArgumentException("Missing required column: " + col);
}
}
}
}Key Points for Production Code:
The table below summarizes essential code patterns for production environments:
| Item | Description |
|---|---|
try-finally | Ensures SparkSession is always cleaned up |
option("mode", "FAILFAST") | Fail immediately when bad data is found |
Logger | Structured logging instead of System.out |
exitCode | Exit code for script/CI integration |
validateSchema | Runtime schema validation |
Each item helps prevent or debug issues that can occur in production environments.
What Just Happened?#
Let’s examine how each step works in the code above.
1. SparkSession Creation
SparkSession spark = SparkSession.builder()
.appName("Quick Start")
.master("local[*]")
.getOrCreate();SparkSession: Unified entry point since Spark 2.0. Consolidates previousSparkContext,SQLContext, andHiveContextappName: Application name displayed in Spark UImaster("local[*]"): Run in local mode,*uses all available CPU coreslocal: Single threadlocal[4]: 4 threadslocal[*]: All cores- In cluster environments, use
spark://master:7077,yarn, etc.
2. Reading Data
Dataset<Row> employees = spark.read()
.option("header", "true")
.option("inferSchema", "true")
.csv("src/main/resources/employees.csv");Dataset<Row>: Spark’s distributed data collection. In Java, Dataset ofRowtype serves as DataFrameoption("inferSchema", "true"): Samples data to auto-infer column types- Spark supports various data sources: CSV, JSON, Parquet, JDBC, etc.
3. DataFrame Operations
employees.filter(col("salary").geq(5000))- Similar to Java’s Stream API, but distributed
filter,select,groupBy, etc. are Transformations — lazily evaluated (not executed immediately; instead, they are batched and processed together when an Action is called)show,collect,count, etc. are Actions — execute actual computation
4. Using SQL
employees.createOrReplaceTempView("employees");
spark.sql("SELECT * FROM employees WHERE ...");- Register DataFrame as a temporary view to enable SQL queries
- Leverage existing SQL knowledge directly
- Uses the same execution engine (Catalyst Optimizer) internally
Java Developer Comparison#
Comparing existing Java code with Spark code makes similarities and differences easy to understand.
Java Stream vs Spark DataFrame
// Java Stream (single JVM)
List<Employee> highEarners = employees.stream()
.filter(e -> e.getSalary() >= 5000)
.collect(Collectors.toList());
// Spark DataFrame (distributed processing)
Dataset<Row> highEarners = employees
.filter(col("salary").geq(5000));The two code snippets are very similar, but:
- Java Stream: Runs in single JVM, memory limited
- Spark DataFrame: Distributed across multiple nodes, can process tens of TBs
Troubleshooting#
Common problems and solutions when running Spark.
Too Many Logs
Spark outputs many logs by default. Add a log4j2.properties file to src/main/resources or:
spark.sparkContext().setLogLevel("WARN"); // or "ERROR"Java Version Error
Spark 3.5 supports Java 8, 11, and 17. Java 21 is not yet officially supported.
Error: A JNI error has occurred→ Check Java version: java -version
OutOfMemoryError
Default memory may be insufficient for local execution:
./gradlew run -Dorg.gradle.jvmargs="-Xmx2g"Or add to build.gradle:
application {
applicationDefaultJvmArgs = ['-Xmx2g']
}Hadoop-related Errors on Windows
When running on Windows, you may see warnings related to winutils.exe. This doesn’t affect functionality, but to resolve:
- Download from winutils
- Save to
C:\hadoop\bin\winutils.exe - Set environment variable
HADOOP_HOME=C:\hadoop
Checking Spark UI#
While the Spark application is running, access http://localhost:4040 to view the Spark UI:
- Jobs: List and status of executed Jobs
- Stages: Task distribution for each Stage
- Storage: Cached RDD/DataFrame information
- Environment: Spark configuration values
Note: The UI shuts down when the application terminates. To check before shutdown, add
Thread.sleep(60000);beforespark.stop().
Next Steps#
After completing Quick Start, choose your next document based on your learning goals:
| Goal | Recommended Document |
|---|---|
| Understand Spark internals | Architecture |
| Learn RDD basics | RDD Basics |
| Deep dive into DataFrame | DataFrame and Dataset |
| Spring Boot integration | Environment Setup |
To understand Spark’s overall operation, read the Architecture document first. For a practice-focused approach, start with the Environment Setup document.