Target Audience: Java developers interested in big data processing Prerequisites: Basic Java syntax, SQL fundamentals, Gradle/Maven experience After reading: You’ll be able to create a Spark project locally and query/aggregate CSV data with DataFrame
TL;DR
- SparkSession is Spark’s unified entry point, created with
SparkSession.builder().getOrCreate()- DataFrame API allows filtering, aggregation, and SQL queries
local[*]mode enables testing in local development environment
Run a Spark application and process data in 5 minutes. Following this guide, you’ll experience the entire process from project creation to data querying.
Overview#
The diagram below shows the Quick Start flow:
flowchart LR
A[1. Create Project] --> B[2. Add Dependencies]
B --> C[3. Create SparkSession]
C --> D[4. Process Data]
D --> E[5. View Results]Prerequisites#
Prepare the following environment before starting:
- Java 17+ (Java 8, 11 also supported, but 17 recommended)
- Gradle or Maven
- IDE (IntelliJ IDEA, VS Code, etc.)
Pre-start Verification
| Item | Command | Expected Result |
|---|---|---|
| Java | java -version | openjdk version "17.x.x" or higher |
| Gradle | gradle --version | Gradle 8.x or higher |
| Maven (alternative) | mvn --version | Apache Maven 3.x.x |
Step 1/5: Create Project (~1 min)#
Create a Java project using Spring Initializr or your IDE. This example starts with a plain Java project.
mkdir spark-quickstart
cd spark-quickstartStep 2/5: Gradle Setup (~2 min)#
Create a build.gradle file:
plugins {
id 'java'
id 'application'
}
group = 'com.example'
version = '1.0.0'
java {
sourceCompatibility = '17'
}
repositories {
mavenCentral()
}
dependencies {
// Spark Core
implementation 'org.apache.spark:spark-core_2.13:3.5.1'
// Spark SQL (for DataFrame, Dataset)
implementation 'org.apache.spark:spark-sql_2.13:3.5.1'
// Logging
implementation 'org.slf4j:slf4j-simple:2.0.9'
}
application {
mainClass = 'com.example.SparkQuickStart'
}
// Prevent Spark JAR conflicts
configurations.all {
exclude group: 'org.slf4j', module: 'slf4j-log4j12'
}Version note: The
2.13inspark-core_2.13is the Scala version. Even when using Java, the Scala runtime is required, so it must be specified.
Using Maven
If you prefer Maven, use the following pom.xml:
pom.xml file:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>spark-quickstart</artifactId>
<version>1.0.0</version>
<packaging>jar</packaging>
<properties>
<maven.compiler.source>17</maven.compiler.source>
<maven.compiler.target>17</maven.compiler.target>
<spark.version>3.5.1</spark.version>
<scala.version>2.13</scala.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>2.0.9</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>3.1.0</version>
<configuration>
<mainClass>com.example.SparkQuickStart</mainClass>
</configuration>
</plugin>
</plugins>
</build>
</project>Run:
mvn compile exec:javaStep 3/5: Create Sample Data (~1 min)#
Create src/main/resources/employees.csv file:
id,name,department,salary
1,John Smith,Engineering,5000
2,Jane Doe,Marketing,4500
3,Bob Johnson,Engineering,5500
4,Alice Williams,Sales,4000
5,Charlie Brown,Engineering,6000
6,Diana Lee,Marketing,4800
7,Edward Kim,Sales,4200
8,Fiona Chen,Engineering,5200Step 4/5: Write Spark Application (~3 min)#
src/main/java/com/example/SparkQuickStart.java:
package com.example;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import static org.apache.spark.sql.functions.*;
public class SparkQuickStart {
public static void main(String[] args) {
// 1. Create SparkSession - Entry point to Spark
SparkSession spark = SparkSession.builder()
.appName("Quick Start")
.master("local[*]") // Local mode, use all cores
.getOrCreate();
// Adjust log level (prevent excessive logs)
spark.sparkContext().setLogLevel("WARN");
System.out.println("=== Spark Quick Start ===\n");
// 2. Read CSV file
Dataset<Row> employees = spark.read()
.option("header", "true") // First line as header
.option("inferSchema", "true") // Auto-infer types
.csv("src/main/resources/employees.csv");
// 3. View data
System.out.println("=== All Employee Data ===");
employees.show();
// 4. Check schema
System.out.println("=== Schema ===");
employees.printSchema();
// 5. Filter - Salary >= 5000
System.out.println("=== Employees with Salary >= 5000 ===");
employees.filter(col("salary").geq(5000)).show();
// 6. Aggregate - Average salary by department
System.out.println("=== Average Salary by Department ===");
employees.groupBy("department")
.agg(
avg("salary").alias("avg_salary"),
count("*").alias("employee_count")
)
.orderBy(desc("avg_salary"))
.show();
// 7. SQL - Same operation with SQL
employees.createOrReplaceTempView("employees");
System.out.println("=== SQL Query: Engineering Department ===");
spark.sql("""
SELECT name, salary
FROM employees
WHERE department = 'Engineering'
ORDER BY salary DESC
""").show();
// 8. Close SparkSession
spark.stop();
System.out.println("=== Complete ===");
}
}Step 5/5: Run (~1 min)#
./gradlew runFor Windows:
gradlew.bat runExpected Output#
When run successfully, you should see output like this:
=== Spark Quick Start ===
=== All Employee Data ===
+---+--------------+-----------+------+
| id| name| department|salary|
+---+--------------+-----------+------+
| 1| John Smith|Engineering| 5000|
| 2| Jane Doe| Marketing| 4500|
| 3| Bob Johnson|Engineering| 5500|
| 4|Alice Williams| Sales| 4000|
| 5| Charlie Brown|Engineering| 6000|
| 6| Diana Lee| Marketing| 4800|
| 7| Edward Kim| Sales| 4200|
| 8| Fiona Chen|Engineering| 5200|
+---+--------------+-----------+------+
=== Schema ===
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- department: string (nullable = true)
|-- salary: integer (nullable = true)
=== Employees with Salary >= 5000 ===
+---+-------------+-----------+------+
| id| name| department|salary|
+---+-------------+-----------+------+
| 1| John Smith|Engineering| 5000|
| 3| Bob Johnson|Engineering| 5500|
| 5|Charlie Brown|Engineering| 6000|
| 8| Fiona Chen|Engineering| 5200|
+---+-------------+-----------+------+
=== Average Salary by Department ===
+-----------+----------+--------------+
| department|avg_salary|employee_count|
+-----------+----------+--------------+
|Engineering| 5425.0| 4|
| Marketing| 4650.0| 2|
| Sales| 4100.0| 2|
+-----------+----------+--------------+
=== SQL Query: Engineering Department ===
+-------------+------+
| name|salary|
+-------------+------+
|Charlie Brown| 6000|
| Bob Johnson| 5500|
| Fiona Chen| 5200|
| John Smith| 5000|
+-------------+------+
=== Complete ===Congratulations! You’ve successfully run your first Spark application.
Production-Level Code#
In real production environments, exception handling and resource cleanup are essential:
package com.example;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.AnalysisException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import static org.apache.spark.sql.functions.*;
public class SparkQuickStartProduction {
private static final Logger logger = LoggerFactory.getLogger(SparkQuickStartProduction.class);
public static void main(String[] args) {
SparkSession spark = null;
int exitCode = 0;
try {
// Create SparkSession
spark = SparkSession.builder()
.appName("Quick Start Production")
.master("local[*]")
.config("spark.sql.session.timeZone", "UTC")
.getOrCreate();
spark.sparkContext().setLogLevel("WARN");
logger.info("SparkSession created successfully");
// Read data (with error mode)
Dataset<Row> employees = spark.read()
.option("header", "true")
.option("inferSchema", "true")
.option("mode", "FAILFAST") // Fail immediately on bad data
.csv("src/main/resources/employees.csv");
// Validate schema
validateSchema(employees);
// Process data
Dataset<Row> result = employees
.filter(col("salary").isNotNull())
.groupBy("department")
.agg(
avg("salary").alias("avg_salary"),
count("*").alias("count")
);
// Output results
result.show();
logger.info("Processing complete: {} departments", result.count());
} catch (AnalysisException e) {
logger.error("Data analysis error: {}", e.getMessage());
exitCode = 1;
} catch (Exception e) {
logger.error("Spark job failed", e);
exitCode = 1;
} finally {
// Resource cleanup (always execute)
if (spark != null) {
spark.stop();
logger.info("SparkSession closed");
}
}
System.exit(exitCode);
}
private static void validateSchema(Dataset<Row> df) {
String[] requiredColumns = {"id", "name", "department", "salary"};
for (String col : requiredColumns) {
if (!java.util.Arrays.asList(df.columns()).contains(col)) {
throw new IllegalArgumentException("Missing required column: " + col);
}
}
}
}Key Points for Production Code:
The table below summarizes essential code patterns for production environments:
| Item | Description |
|---|---|
try-finally | Ensures SparkSession is always cleaned up |
option("mode", "FAILFAST") | Fail immediately when bad data is found |
Logger | Structured logging instead of System.out |
exitCode | Exit code for script/CI integration |
validateSchema | Runtime schema validation |
Each item helps prevent or debug issues that can occur in production environments.
What Just Happened?#
Let’s examine how each step works in the code above.
1. SparkSession Creation
SparkSession spark = SparkSession.builder()
.appName("Quick Start")
.master("local[*]")
.getOrCreate();SparkSession: Unified entry point since Spark 2.0. Consolidates previousSparkContext,SQLContext, andHiveContextappName: Application name displayed in Spark UImaster("local[*]"): Run in local mode,*uses all available CPU coreslocal: Single threadlocal[4]: 4 threadslocal[*]: All cores- In cluster environments, use
spark://master:7077,yarn, etc.
2. Reading Data
Dataset<Row> employees = spark.read()
.option("header", "true")
.option("inferSchema", "true")
.csv("src/main/resources/employees.csv");Dataset<Row>: Spark’s distributed data collection. In Java, Dataset ofRowtype serves as DataFrameoption("inferSchema", "true"): Samples data to auto-infer column types- Spark supports various data sources: CSV, JSON, Parquet, JDBC, etc.
3. DataFrame Operations
employees.filter(col("salary").geq(5000))- Similar to Java’s Stream API, but distributed
filter,select,groupBy, etc. are Transformations — lazily evaluatedshow,collect,count, etc. are Actions — execute actual computation
4. Using SQL
employees.createOrReplaceTempView("employees");
spark.sql("SELECT * FROM employees WHERE ...");- Register DataFrame as a temporary view to enable SQL queries
- Leverage existing SQL knowledge directly
- Uses the same execution engine (Catalyst Optimizer) internally
Java Developer Comparison#
Comparing existing Java code with Spark code makes similarities and differences easy to understand.
Java Stream vs Spark DataFrame
// Java Stream (single JVM)
List<Employee> highEarners = employees.stream()
.filter(e -> e.getSalary() >= 5000)
.collect(Collectors.toList());
// Spark DataFrame (distributed processing)
Dataset<Row> highEarners = employees
.filter(col("salary").geq(5000));The two code snippets are very similar, but:
- Java Stream: Runs in single JVM, memory limited
- Spark DataFrame: Distributed across multiple nodes, can process tens of TBs
Troubleshooting#
Common problems and solutions when running Spark.
Too Many Logs
Spark outputs many logs by default. Add a log4j2.properties file to src/main/resources or:
spark.sparkContext().setLogLevel("WARN"); // or "ERROR"Java Version Error
Spark 3.5 supports Java 8, 11, and 17. Java 21 is not yet officially supported.
Error: A JNI error has occurred→ Check Java version: java -version
OutOfMemoryError
Default memory may be insufficient for local execution:
./gradlew run -Dorg.gradle.jvmargs="-Xmx2g"Or add to build.gradle:
application {
applicationDefaultJvmArgs = ['-Xmx2g']
}Hadoop-related Errors on Windows
When running on Windows, you may see warnings related to winutils.exe. This doesn’t affect functionality, but to resolve:
- Download from winutils
- Save to
C:\hadoop\bin\winutils.exe - Set environment variable
HADOOP_HOME=C:\hadoop
Checking Spark UI#
While the Spark application is running, access http://localhost:4040 to view the Spark UI:
- Jobs: List and status of executed Jobs
- Stages: Task distribution for each Stage
- Storage: Cached RDD/DataFrame information
- Environment: Spark configuration values
Note: The UI shuts down when the application terminates. To check before shutdown, add
Thread.sleep(60000);beforespark.stop().
Next Steps#
After completing Quick Start, choose your next document based on your learning goals:
| Goal | Recommended Document |
|---|---|
| Understand Spark internals | Architecture |
| Learn RDD basics | RDD Basics |
| Deep dive into DataFrame | DataFrame and Dataset |
| Spring Boot integration | Environment Setup |
To understand Spark’s overall operation, read the Architecture document first. For a practice-focused approach, start with the Environment Setup document.