Target Audience: Java developers interested in big data processing Prerequisites: Basic Java syntax, SQL fundamentals, Gradle/Maven experience After reading: You’ll be able to create a Spark project locally and query/aggregate CSV data with DataFrame

TL;DR
  • SparkSession is Spark’s unified entry point, created with SparkSession.builder().getOrCreate()
  • DataFrame API allows filtering, aggregation, and SQL queries
  • local[*] mode enables testing in local development environment

Run a Spark application and process data in 5 minutes. Following this guide, you’ll experience the entire process from project creation to data querying.

Overview#

The diagram below shows the Quick Start flow:

flowchart LR
    A[1. Create Project] --> B[2. Add Dependencies]
    B --> C[3. Create SparkSession]
    C --> D[4. Process Data]
    D --> E[5. View Results]

This diagram shows the five-step Quick Start flow from project creation to viewing results.

Prerequisites#

Prepare the following environment before starting:

  • Java 17+ (Java 8, 11 also supported, but 17 recommended)
  • Gradle or Maven
  • IDE (IntelliJ IDEA, VS Code, etc.)

Pre-start Verification

ItemCommandExpected Result
Javajava -versionopenjdk version "17.x.x" or higher
Gradlegradle --versionGradle 8.x or higher
Maven (alternative)mvn --versionApache Maven 3.x.x

Step 1/5: Create Project (~1 min)#

Create a Java project using Spring Initializr or your IDE. This example starts with a plain Java project.

mkdir spark-quickstart
cd spark-quickstart

Step 2/5: Gradle Setup (~2 min)#

Create a build.gradle file:

plugins {
    id 'java'
    id 'application'
}

group = 'com.example'
version = '1.0.0'

java {
    sourceCompatibility = '17'
}

repositories {
    mavenCentral()
}

dependencies {
    // Spark Core
    implementation 'org.apache.spark:spark-core_2.13:3.5.1'
    // Spark SQL (for DataFrame, Dataset)
    implementation 'org.apache.spark:spark-sql_2.13:3.5.1'

    // Logging
    implementation 'org.slf4j:slf4j-simple:2.0.9'
}

application {
    mainClass = 'com.example.SparkQuickStart'
}

// Prevent Spark JAR conflicts
configurations.all {
    exclude group: 'org.slf4j', module: 'slf4j-log4j12'
}

Version note: The 2.13 in spark-core_2.13 is the Scala version. Even when using Java, the Scala runtime is required, so it must be specified.

Using Maven

If you prefer Maven, use the following pom.xml:

pom.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
         http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>spark-quickstart</artifactId>
    <version>1.0.0</version>
    <packaging>jar</packaging>

    <properties>
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
        <spark.version>3.5.1</spark.version>
        <scala.version>2.13</scala.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-simple</artifactId>
            <version>2.0.9</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>3.1.0</version>
                <configuration>
                    <mainClass>com.example.SparkQuickStart</mainClass>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

Run:

mvn compile exec:java

Step 3/5: Create Sample Data (~1 min)#

Create src/main/resources/employees.csv file:

id,name,department,salary
1,John Smith,Engineering,5000
2,Jane Doe,Marketing,4500
3,Bob Johnson,Engineering,5500
4,Alice Williams,Sales,4000
5,Charlie Brown,Engineering,6000
6,Diana Lee,Marketing,4800
7,Edward Kim,Sales,4200
8,Fiona Chen,Engineering,5200

Step 4/5: Write Spark Application (~3 min)#

src/main/java/com/example/SparkQuickStart.java:

package com.example;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import static org.apache.spark.sql.functions.*;

public class SparkQuickStart {

    public static void main(String[] args) {
        // 1. Create SparkSession - Entry point to Spark
        SparkSession spark = SparkSession.builder()
                .appName("Quick Start")
                .master("local[*]")  // Local mode, use all cores
                .getOrCreate();

        // Adjust log level (prevent excessive logs)
        spark.sparkContext().setLogLevel("WARN");

        System.out.println("=== Spark Quick Start ===\n");

        // 2. Read CSV file
        Dataset<Row> employees = spark.read()
                .option("header", "true")       // First line as header
                .option("inferSchema", "true")  // Auto-infer types
                .csv("src/main/resources/employees.csv");

        // 3. View data
        System.out.println("=== All Employee Data ===");
        employees.show();

        // 4. Check schema
        System.out.println("=== Schema ===");
        employees.printSchema();

        // 5. Filter - Salary >= 5000
        System.out.println("=== Employees with Salary >= 5000 ===");
        employees.filter(col("salary").geq(5000)).show();

        // 6. Aggregate - Average salary by department
        System.out.println("=== Average Salary by Department ===");
        employees.groupBy("department")
                .agg(
                    avg("salary").alias("avg_salary"),
                    count("*").alias("employee_count")
                )
                .orderBy(desc("avg_salary"))
                .show();

        // 7. SQL - Same operation with SQL
        employees.createOrReplaceTempView("employees");

        System.out.println("=== SQL Query: Engineering Department ===");
        spark.sql("""
            SELECT name, salary
            FROM employees
            WHERE department = 'Engineering'
            ORDER BY salary DESC
            """).show();

        // 8. Close SparkSession
        spark.stop();

        System.out.println("=== Complete ===");
    }
}

Step 5/5: Run (~1 min)#

./gradlew run

For Windows:

gradlew.bat run

Expected Output#

When run successfully, you should see output like this:

=== Spark Quick Start ===

=== All Employee Data ===
+---+--------------+-----------+------+
| id|          name| department|salary|
+---+--------------+-----------+------+
|  1|    John Smith|Engineering|  5000|
|  2|      Jane Doe|  Marketing|  4500|
|  3|   Bob Johnson|Engineering|  5500|
|  4|Alice Williams|      Sales|  4000|
|  5| Charlie Brown|Engineering|  6000|
|  6|     Diana Lee|  Marketing|  4800|
|  7|    Edward Kim|      Sales|  4200|
|  8|    Fiona Chen|Engineering|  5200|
+---+--------------+-----------+------+

=== Schema ===
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: integer (nullable = true)

=== Employees with Salary >= 5000 ===
+---+-------------+-----------+------+
| id|         name| department|salary|
+---+-------------+-----------+------+
|  1|   John Smith|Engineering|  5000|
|  3|  Bob Johnson|Engineering|  5500|
|  5|Charlie Brown|Engineering|  6000|
|  8|   Fiona Chen|Engineering|  5200|
+---+-------------+-----------+------+

=== Average Salary by Department ===
+-----------+----------+--------------+
| department|avg_salary|employee_count|
+-----------+----------+--------------+
|Engineering|    5425.0|             4|
|  Marketing|    4650.0|             2|
|      Sales|    4100.0|             2|
+-----------+----------+--------------+

=== SQL Query: Engineering Department ===
+-------------+------+
|         name|salary|
+-------------+------+
|Charlie Brown|  6000|
|  Bob Johnson|  5500|
|   Fiona Chen|  5200|
|   John Smith|  5000|
+-------------+------+

=== Complete ===

Congratulations! You’ve successfully run your first Spark application.


Production-Level Code#

Advanced Topic
This section covers advanced patterns for production environments. You can skip this section if you are just learning the Quick Start.

In real production environments, exception handling and resource cleanup are essential:

package com.example;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.AnalysisException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import static org.apache.spark.sql.functions.*;

public class SparkQuickStartProduction {
    private static final Logger logger = LoggerFactory.getLogger(SparkQuickStartProduction.class);

    public static void main(String[] args) {
        SparkSession spark = null;
        int exitCode = 0;

        try {
            // Create SparkSession
            spark = SparkSession.builder()
                    .appName("Quick Start Production")
                    .master("local[*]")
                    .config("spark.sql.session.timeZone", "UTC")
                    .getOrCreate();

            spark.sparkContext().setLogLevel("WARN");
            logger.info("SparkSession created successfully");

            // Read data (with error mode)
            Dataset<Row> employees = spark.read()
                    .option("header", "true")
                    .option("inferSchema", "true")
                    .option("mode", "FAILFAST")  // Fail immediately on bad data
                    .csv("src/main/resources/employees.csv");

            // Validate schema
            validateSchema(employees);

            // Process data
            Dataset<Row> result = employees
                    .filter(col("salary").isNotNull())
                    .groupBy("department")
                    .agg(
                        avg("salary").alias("avg_salary"),
                        count("*").alias("count")
                    );

            // Output results
            result.show();
            logger.info("Processing complete: {} departments", result.count());

        } catch (AnalysisException e) {
            logger.error("Data analysis error: {}", e.getMessage());
            exitCode = 1;
        } catch (Exception e) {
            logger.error("Spark job failed", e);
            exitCode = 1;
        } finally {
            // Resource cleanup (always execute)
            if (spark != null) {
                spark.stop();
                logger.info("SparkSession closed");
            }
        }

        System.exit(exitCode);
    }

    private static void validateSchema(Dataset<Row> df) {
        String[] requiredColumns = {"id", "name", "department", "salary"};
        for (String col : requiredColumns) {
            if (!java.util.Arrays.asList(df.columns()).contains(col)) {
                throw new IllegalArgumentException("Missing required column: " + col);
            }
        }
    }
}

Key Points for Production Code:

The table below summarizes essential code patterns for production environments:

ItemDescription
try-finallyEnsures SparkSession is always cleaned up
option("mode", "FAILFAST")Fail immediately when bad data is found
LoggerStructured logging instead of System.out
exitCodeExit code for script/CI integration
validateSchemaRuntime schema validation

Each item helps prevent or debug issues that can occur in production environments.


What Just Happened?#

Let’s examine how each step works in the code above.

1. SparkSession Creation

SparkSession spark = SparkSession.builder()
        .appName("Quick Start")
        .master("local[*]")
        .getOrCreate();
  • SparkSession: Unified entry point since Spark 2.0. Consolidates previous SparkContext, SQLContext, and HiveContext
  • appName: Application name displayed in Spark UI
  • master("local[*]"): Run in local mode, * uses all available CPU cores
    • local: Single thread
    • local[4]: 4 threads
    • local[*]: All cores
    • In cluster environments, use spark://master:7077, yarn, etc.

2. Reading Data

Dataset<Row> employees = spark.read()
        .option("header", "true")
        .option("inferSchema", "true")
        .csv("src/main/resources/employees.csv");
  • Dataset<Row>: Spark’s distributed data collection. In Java, Dataset of Row type serves as DataFrame
  • option("inferSchema", "true"): Samples data to auto-infer column types
  • Spark supports various data sources: CSV, JSON, Parquet, JDBC, etc.

3. DataFrame Operations

employees.filter(col("salary").geq(5000))
  • Similar to Java’s Stream API, but distributed
  • filter, select, groupBy, etc. are Transformations — lazily evaluated (not executed immediately; instead, they are batched and processed together when an Action is called)
  • show, collect, count, etc. are Actions — execute actual computation

4. Using SQL

employees.createOrReplaceTempView("employees");
spark.sql("SELECT * FROM employees WHERE ...");
  • Register DataFrame as a temporary view to enable SQL queries
  • Leverage existing SQL knowledge directly
  • Uses the same execution engine (Catalyst Optimizer) internally

Java Developer Comparison#

Comparing existing Java code with Spark code makes similarities and differences easy to understand.

Java Stream vs Spark DataFrame

// Java Stream (single JVM)
List<Employee> highEarners = employees.stream()
        .filter(e -> e.getSalary() >= 5000)
        .collect(Collectors.toList());

// Spark DataFrame (distributed processing)
Dataset<Row> highEarners = employees
        .filter(col("salary").geq(5000));

The two code snippets are very similar, but:

  • Java Stream: Runs in single JVM, memory limited
  • Spark DataFrame: Distributed across multiple nodes, can process tens of TBs

Troubleshooting#

Common problems and solutions when running Spark.

Too Many Logs

Spark outputs many logs by default. Add a log4j2.properties file to src/main/resources or:

spark.sparkContext().setLogLevel("WARN");  // or "ERROR"

Java Version Error

Spark 3.5 supports Java 8, 11, and 17. Java 21 is not yet officially supported.

Error: A JNI error has occurred

→ Check Java version: java -version

OutOfMemoryError

Default memory may be insufficient for local execution:

./gradlew run -Dorg.gradle.jvmargs="-Xmx2g"

Or add to build.gradle:

application {
    applicationDefaultJvmArgs = ['-Xmx2g']
}

Hadoop-related Errors on Windows

When running on Windows, you may see warnings related to winutils.exe. This doesn’t affect functionality, but to resolve:

  1. Download from winutils
  2. Save to C:\hadoop\bin\winutils.exe
  3. Set environment variable HADOOP_HOME=C:\hadoop

Checking Spark UI#

While the Spark application is running, access http://localhost:4040 to view the Spark UI:

  • Jobs: List and status of executed Jobs
  • Stages: Task distribution for each Stage
  • Storage: Cached RDD/DataFrame information
  • Environment: Spark configuration values

Note: The UI shuts down when the application terminates. To check before shutdown, add Thread.sleep(60000); before spark.stop().


Next Steps#

After completing Quick Start, choose your next document based on your learning goals:

GoalRecommended Document
Understand Spark internalsArchitecture
Learn RDD basicsRDD Basics
Deep dive into DataFrameDataFrame and Dataset
Spring Boot integrationEnvironment Setup

To understand Spark’s overall operation, read the Architecture document first. For a practice-focused approach, start with the Environment Setup document.