Estimated Time: About 20 minutes
TL;DR
- Find slow Jobs in the Jobs tab, then identify the bottleneck Stage in the Stages tab
- If Duration Min/Max differs by more than 10x in the Tasks tab, you have data skew
- If the SQL tab shows many Exchange nodes, you need to reduce shuffles
Problem Definition#
You’ve opened the Spark UI but there are multiple tabs, and you’re not sure what to look for in each one. Follow this guide to find performance bottlenecks in Jobs → Stages → Tasks order.
What this guide covers:
- How to read the key metrics in each Spark UI tab
- The order to narrow down bottleneck causes
What this guide does not cover:
- Prometheus/Grafana-based monitoring setup → Monitoring Setup
- Resolving OOM errors → Troubleshooting OutOfMemoryError
Prerequisites#
| Item | Requirement | How to Verify |
|---|---|---|
| Spark Version | 2.4 or higher (3.x recommended) | spark-submit --version |
| Java Version | 8, 11, or 17 | java -version |
| Spark UI | Accessible | Open http://localhost:4040 in browser |
Environment Verification#
# Check Spark version
spark-submit --version
# Verify Spark UI access (while application is running)
curl -s http://localhost:4040/api/v1/applications | head -1Expected output:
[{"id":"local-1234567890","name":"MyApp",...}]If there is no output, see the Troubleshooting section.
Step 1/7: Accessing the Spark UI#
The URL varies depending on your environment.
| Environment | URL | Notes |
|---|---|---|
| Local / Standalone | http://localhost:4040 | Only accessible while app is running |
| YARN | YARN ResourceManager → Application → ApplicationMaster link | Check app ID with yarn application -list |
| Kubernetes | kubectl port-forward <driver-pod> 4040:4040 then localhost:4040 | Find Driver Pod with kubectl get pods |
| History Server | http://localhost:18080 | Available even after app terminates |
Verifying Access#
# Local environment
curl -s -o /dev/null -w "%{http_code}" http://localhost:4040
# YARN environment — check app list
yarn application -list 2>/dev/null | grep RUNNING
# Kubernetes environment — Driver Pod port forwarding
kubectl port-forward svc/spark-driver 4040:4040If the HTTP response is 200, everything is working. Proceed to the next step.
Step 2/7: Jobs Tab — Getting the Big Picture#
Click the Jobs tab. This shows the full list of jobs.
What to Look For#
| Item | Meaning | What to Check |
|---|---|---|
| Duration | Job execution time | Look for jobs that are abnormally longer than others |
| Stages | Number of Stages in a Job | Check the Succeeded/Failed ratio |
| Status | Completed/Failed/Running | If any are Failed, click that Job |
Job = one Action. Each Action call such ascount(),collect(), orsave()creates one Job.
Decision Criteria#
- If a specific Job’s Duration is 3x or more longer than others, click that Job
- If the Failed count in the Stages column is not 0, investigate immediately
Once you find a slow Job, click it to navigate to its Stage list.
Step 3/7: Stages Tab — Finding the Bottleneck#
Click the Stages tab. Alternatively, clicking a slow Job from Step 2 shows that Job’s Stage list.
Key Metrics#
| Metric | Meaning | Warning Sign |
|---|---|---|
| Duration | Stage execution time | A Stage consuming 80%+ of total Job time |
| Shuffle Read | Data size read from previous Stage | Shuffle optimization needed if several GB or more |
| Shuffle Write | Data size sent to next Stage | Check alongside Shuffle Read |
| Spill (Memory) | Memory → disk spill | Non-zero means insufficient memory |
| Spill (Disk) | Total disk spill volume | Non-zero means severe memory shortage |
When Spill occurs, performance degrades sharply due to disk I/O. Increase Executor memory or raise the partition count.
Decision Criteria#
Once you find the slow Stage, click it to view Task-level details.
Step 4/7: Tasks Tab — Analyzing Individual Tasks#
Clicking a Stage reveals the Task list and statistics. This is where you find the root cause of the bottleneck.
Key Metrics#
| Metric | Normal Range | Warning Sign | Cause |
|---|---|---|---|
| Duration (Min/Max) | Similar (within 2x) | 10x+ difference | Data skew |
| GC Time | Less than 5% of total | 10% or more | Insufficient memory |
| Shuffle Read Size (Min/Max) | Even distribution | Only some tasks are large | Data skew |
| Locality Level | PROCESS_LOCAL | ANY | Data locality issue |
Diagnosing Data Skew#
If the Min/Max difference in Task Duration is 10x or more, you have data skew.
import static org.apache.spark.sql.functions.*;
// Check data distribution per partition
df.groupBy(spark_partition_id().alias("partition"))
.count()
.orderBy(col("count").desc())
.show(10);Expected output (when skew exists):
+---------+-------+
|partition| count |
+---------+-------+
| 5|1000000| <-- abnormally large
| 3| 5000|
| 1| 4800|
+---------+-------+If you found data skew → see Resolving Data Skew.
Diagnosing GC Issues#
If GC Time is 10% or more of total Task Duration, memory is insufficient.
# <app-id> can be found at the top of the Jobs tab or via:
# curl -s http://localhost:4040/api/v1/applications | python3 -m json.tool
# Check Stage metrics via REST API
curl -s http://localhost:4040/api/v1/applications/<app-id>/stages \
| python3 -m json.tool | grep -E "gcTime|executorRunTime"If you found GC issues → see Troubleshooting OutOfMemoryError.
Step 5/7: Storage Tab — Checking Cache#
Click the Storage tab. This shows the list of RDDs/DataFrames cached with cache() or persist().
What to Check#
| Item | Meaning | What to Check |
|---|---|---|
| RDD Name | Name of cached data | Verify the intended data is cached |
| Storage Level | Storage location (memory/disk) | MEMORY_ONLY is the default |
| Cached Partitions | Number of cached partitions | Compare against total partition count |
| Fraction Cached | Cache ratio | If not 100%, memory is insufficient |
| Size in Memory | Memory usage | Verify it’s reasonable relative to Executor memory |
If Fraction Cached is below 100%, some partitions were evicted from cache due to insufficient memory. Change the Storage Level to MEMORY_AND_DISK.Step 6/7: Environment Tab — Checking Configuration#
Click the Environment tab. This shows all currently applied Spark configurations.
Key Settings to Check#
| Setting | Default | What to Check |
|---|---|---|
spark.executor.memory | 1g | Verify it’s not too small for your workload |
spark.executor.cores | 1 | Verify roughly 5GB of memory per core (balances GC overhead and parallelism) |
spark.sql.shuffle.partitions | 200 | Verify it’s tuned for your data size |
spark.sql.adaptive.enabled | true (3.x) | Verify AQE is enabled |
spark.serializer | JavaSerializer | Verify KryoSerializer is being used |
# Find <app-id>: curl -s http://localhost:4040/api/v1/applications | python3 -c "import sys,json;print(json.load(sys.stdin)[0]['id'])"
# Check settings via REST API
curl -s http://localhost:4040/api/v1/applications/<app-id>/environment \
| python3 -m json.tool | grep -E "executor.memory|shuffle.partitions"If your intended settings are not reflected, check the priority order of spark-submit options, spark-defaults.conf, and in-code settings. The .config() setting in code has the highest priority.Step 7/7: SQL Tab — Analyzing Execution Plans#
Click the SQL tab. This shows the execution plans (DAGs) for queries executed via Spark SQL or the DataFrame API.
Key Items to Check#
| Node | Meaning | Warning Sign |
|---|---|---|
| Exchange | Shuffle occurred | Too many Exchange nodes means shuffle optimization is needed |
| BroadcastHashJoin | Broadcast join | If not used for small tables, check settings |
| SortMergeJoin | Shuffle-based join | If the table is small, switch to broadcast join |
| Scan | Data read | A full scan means partition pruning failed |
| Filter | Filter condition | Should be directly above Scan for pushdown to work |
Verifying Partition Pruning#
// Check execution plan
df.explain(true);If PartitionFilters is empty in the output, partition pruning has failed.
// Partition pruning succeeded
+- FileScan parquet [...] PartitionFilters: [date >= 2026-01-01]
// Partition pruning failed
+- FileScan parquet [...] PartitionFilters: []If shuffles are excessive → see Optimizing Shuffles.
History Server Setup (Optional)#
To access the Spark UI even after your app terminates, set up the History Server.
1. Enable Event Logging#
SparkSession spark = SparkSession.builder()
.appName("My App")
.config("spark.eventLog.enabled", "true")
.config("spark.eventLog.dir", "/var/log/spark/events")
.config("spark.eventLog.compress", "true")
.getOrCreate();2. Start the History Server#
# Create event log directory
mkdir -p /var/log/spark/events
# Add to spark-defaults.conf
# spark.history.fs.logDirectory=/var/log/spark/events
# spark.history.ui.port=18080
# Start History Server
$SPARK_HOME/sbin/start-history-server.sh3. Verify Access#
curl -s -o /dev/null -w "%{http_code}" http://localhost:18080If 200 is returned, it’s working. Access http://localhost:18080 in your browser.
Troubleshooting#
| Symptom | Cause | Solution |
|---|---|---|
UI won’t open (Connection refused) | Spark app is not running | Verify the app is running via spark-submit |
| UI won’t open (port conflict) | Port 4040 is used by another app | Change spark.ui.port to 4041, etc. |
| UI won’t open (firewall) | Port is blocked on remote server | Tunnel with ssh -L 4040:localhost:4040 <server> |
| UI is disabled | spark.ui.enabled=false is set | Check in the Environment tab or set to true in code |
| Previous jobs not in History Server | Event logging not configured | Set spark.eventLog.enabled=true |
| Previous jobs not in History Server | Log path mismatch | Verify spark.eventLog.dir and spark.history.fs.logDirectory match |
| Cannot access UI on YARN | Need ApplicationMaster URL | Click the app in YARN ResourceManager and use the ApplicationMaster link |
Decision Tree#
When a performance issue occurs, diagnose in the following order.
flowchart TD
A["Performance issue detected"] --> B["Step 2: Check Jobs tab"]
B --> C{"Slow Job<br>found?"}
C -->|Yes| D["Step 3: Check Stages tab<br>for that Job"]
C -->|No| E["Step 7: Check execution<br>plan in SQL tab"]
D --> F{"Spill<br>occurring?"}
F -->|Yes| G["Increase Executor memory<br>or partition count"]
F -->|No| H["Step 4: Check Tasks tab"]
H --> I{"Task Duration<br>Min/Max diff<br>10x or more?"}
I -->|Yes| J["Data skew<br>→ data-skew guide"]
I -->|No| K{"GC Time<br>10% or more?"}
K -->|Yes| L["Insufficient memory<br>→ OOM guide"]
K -->|No| M["Step 5: Check cache<br>in Storage tab"]
E --> N{"Many Exchange<br>nodes?"}
N -->|Yes| O["Shuffle optimization<br>→ shuffle-optimization<br>guide"]
N -->|No| P["Step 6: Check settings<br>in Environment tab"]Diagram summary: Performance issue → Check Jobs tab for slow Jobs → Check Stages tab for Spill → Check Tasks tab for skew (Duration difference) and GC → Branch to cause-specific guide. If many Exchange nodes, shuffle optimization is needed.
Verification#
After completing this guide, you should achieve the following.
| Item | Success Criteria |
|---|---|
| Bottleneck located | Identified which Stage of which Job is slow |
| Root cause diagnosed | Determined cause as skew/GC/shuffle/spill |
| Next action | Ready to proceed to the cause-specific resolution guide |
Key Takeaways
- Analysis order: Jobs → Stages → Tasks → branch by cause
- Data skew: Task Duration Min/Max difference of 10x or more
- Insufficient memory: GC Time 10%+ or Spill occurring
- Excessive shuffles: Many Exchange nodes in SQL tab
- Settings not applied: Verify intended values in Environment tab
Next Steps#
- Troubleshooting OutOfMemoryError - Resolve GC/memory issues
- Resolving Data Skew - Fix uneven Task distribution
- Optimizing Shuffles - Reduce network I/O
- Performance Tuning - Comprehensive performance optimization