This guide walks you through finding and optimizing performance bottlenecks in Scala applications.

Estimated time: About 20-25 minutes

TL;DR
  • CPU profiling: Identify hotspots with JFR (Java Flight Recorder)
  • Memory analysis: Analyze heap dumps with VisualVM or MAT
  • Collection selection: Choosing the right collection for the job can make a significant performance difference
  • Optimization techniques: @specialized, @tailrec, avoiding boxing

Problems This Guide Solves#

Use this guide in the following situations:

  • Your application’s response time has suddenly become slow
  • Memory usage is continuously increasing (suspected memory leak)
  • You need to decide which collection to use from a performance perspective
  • GC pauses are occurring frequently

What This Guide Does Not Cover#

  • JVM tuning (GC options, etc.): See the official JVM documentation
  • Distributed system performance optimization: This is a separate topic
  • Cats Effect / ZIO performance optimization: See the respective library documentation

Before You Begin#

Verify the following environment is ready:

ItemRequirementHow to Verify
JDK version11+ (JFR included by default)java -version
Scala version2.13.x or 3.xscala -version
VisualVM (optional)Latest versionvisualvm --version
# Check JDK version (JFR is included by default in JDK 11+)
java -version
# Example output: openjdk version "17.0.8"

# Install VisualVM (macOS)
brew install --cask visualvm

Step 1: Choosing a Profiling Tool#

Choose the appropriate tool based on the problem you are trying to solve:

flowchart TD
    A["Performance issue"] --> B{"What kind of<br>problem?"}
    B -->|"High CPU usage"| C["CPU profiling<br>with JFR"]
    B -->|"Out of memory<br>OOM"| D["Heap dump analysis<br>jmap + MAT"]
    B -->|"Frequent GC<br>pauses"| E["GC log analysis<br>JFR + GCViewer"]
    B -->|"Inconsistent<br>response times"| F["Latency analysis<br>with JFR"]
    C --> G["Identify hotspot methods"]
    D --> H["Identify large objects/leaks"]
    E --> I["GC tuning or<br>allocation optimization"]
    F --> J["Thread contention<br>I/O wait analysis"]

Step 2: CPU/Memory Profiling with JFR#

2.1 Basic JFR Usage#

JFR (Java Flight Recorder) is a low-overhead profiler built into JDK 11+:

# Attach JFR to a running application
# 1. First, find the PID
jps -l
# Example output: 12345 com.example.MyApp

# 2. Start a JFR recording (60 seconds)
jcmd 12345 JFR.start duration=60s filename=recording.jfr

# 3. Check the recording
jcmd 12345 JFR.check

# 4. Stop the recording (manual stop before duration)
jcmd 12345 JFR.stop

Or enable it at JVM startup:

# Enable JFR in sbt
sbt -J-XX:StartFlightRecording=duration=120s,filename=app.jfr run

# Or configure in build.sbt
javaOptions += "-XX:StartFlightRecording=duration=120s,filename=app.jfr"

2.2 Analyzing JFR Results#

# Analyze with JDK Mission Control (JMC)
jmc  # Launch the GUI tool and open the .jfr file

# Quick check via CLI
jfr summary recording.jfr

# Filter events
jfr print --events jdk.CPULoad recording.jfr
jfr print --events jdk.ObjectAllocationSample recording.jfr

Key events to check:

EventDescription
jdk.CPULoadCPU utilization
jdk.ExecutionSampleHotspot methods (CPU profiling)
jdk.ObjectAllocationSampleObject allocation frequency
jdk.GCPhasePauseGC pause duration
jdk.ThreadParkThread wait

Step 3: Heap Dump Analysis#

3.1 Generating a Heap Dump#

# Heap dump of a running process
jmap -dump:format=b,file=heap.hprof 12345

# Automatic dump on OOM
sbt -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:HeapDumpPath=./heap.hprof run

3.2 Analyzing with VisualVM#

# Launch VisualVM
visualvm

# 1. File > Load... > Select heap.hprof
# 2. Check the largest objects in the Summary tab
# 3. Check classes with the most instances in the Classes tab

Key things to check:

IndicatorSuspicious Situation
Abnormally high instance countPossible object leak
Many large arraysOversized collection allocation
Excessively many String instancesString duplication or leak
Instance count of the same class keeps growingCheck for non-GC’d references

3.3 Common Memory Leak Patterns in Scala#

// Wrong: closure captures the entire outer object
class DataProcessor {
  val largeData: Array[Byte] = new Array[Byte](100 * 1024 * 1024) // 100MB

  def getProcessor(): () => Unit = {
    // This closure holds a reference to the entire DataProcessor (including largeData)
    () => println("Processing...")
  }
}

// Correct: capture only the needed data
class DataProcessor {
  val largeData: Array[Byte] = new Array[Byte](100 * 1024 * 1024)

  def getProcessor(): () => Unit = {
    val message = "Processing..."  // Copy only the needed value locally
    () => println(message)
  }
}

Step 4: Collection Performance Characteristics#

4.1 Major Collection Comparison#

OperationListVectorArrayArrayBuffer
headO(1)O(1)O(1)O(1)
Index accessO(n)O(log n)O(1)O(1)
appendO(n)O(1)*O(n)O(1)*
prependO(1)O(1)*O(n)O(n)
TraversalO(n)O(n)O(n)O(n)

* Amortized time complexity

4.2 Choosing Collections by Use Case#

// 1. Frequent prepend/remove from front -> List
val stack = List(1, 2, 3)
val pushed = 0 :: stack        // O(1) prepend
val (head, tail) = (stack.head, stack.tail)  // O(1)

// 2. Random access needed -> Vector or Array
val indexed = Vector(1, 2, 3)
indexed(1)                     // O(log n), practically close to O(1)

// 3. Performance-critical numeric operations -> Array
val numbers = Array(1.0, 2.0, 3.0)
numbers(0)                     // O(1), no boxing (primitive)

// 4. Mutable collection needed -> ArrayBuffer
import scala.collection.mutable.ArrayBuffer
val buffer = ArrayBuffer(1, 2, 3)
buffer += 4                    // O(1) amortized append

4.3 Performance Benchmark Example#

Add JMH (Java Microbenchmark Harness) to your sbt project:

// project/plugins.sbt
addSbtPlugin("pl.project13.scala" % "sbt-jmh" % "0.4.7")
// build.sbt
enablePlugins(JmhPlugin)

Benchmark code:

import org.openjdk.jmh.annotations._
import java.util.concurrent.TimeUnit

@BenchmarkMode(Array(Mode.AverageTime))
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
class CollectionBenchmark {
  val size = 10000
  val list: List[Int] = (1 to size).toList
  val vector: Vector[Int] = (1 to size).toVector
  val array: Array[Int] = (1 to size).toArray

  @Benchmark
  def listIndexAccess(): Int = list(size / 2)

  @Benchmark
  def vectorIndexAccess(): Int = vector(size / 2)

  @Benchmark
  def arrayIndexAccess(): Int = array(size / 2)
}
# Run the benchmark
sbt "jmh:run -i 10 -wi 5 -f 2 CollectionBenchmark"

Step 5: Boxing/Unboxing Overhead#

5.1 Understanding the Problem#

Scala generics are type-erased on the JVM, so primitive types like Int and Double get boxed:

// Code that causes boxing
def sum[A](list: List[A])(implicit num: Numeric[A]): A = {
  list.foldLeft(num.zero)(num.plus)
  // Int gets repeatedly boxed/unboxed to java.lang.Integer
}

// Using primitive types directly avoids boxing
def sumInts(list: Array[Int]): Int = {
  var total = 0
  var i = 0
  while (i < list.length) {
    total += list(i)
    i += 1
  }
  total
}

5.2 @specialized Annotation#

Generates specialized implementations for frequently used primitive types:

// Without @specialized: all types are boxed as Object
class Container[A](val value: A)

// With @specialized: separate classes generated for Int, Double, etc.
class Container[@specialized(Int, Double, Long) A](val value: A)

// No boxing when used
val intContainer = new Container[Int](42)       // Uses Container$mcI$sp
val doubleContainer = new Container[Double](3.14) // Uses Container$mcD$sp
Scala 3 Note

In Scala 3, using opaque types instead of @specialized is recommended:

opaque type Meters = Double
object Meters:
  def apply(d: Double): Meters = d
  extension (m: Meters) def value: Double = m
// Handled as Double at runtime without boxing

Step 6: Tail Recursion Optimization#

6.1 @tailrec Annotation#

Prevents stack overflow in recursive functions:

import scala.annotation.tailrec

// Wrong: not tail-recursive (stack overflow risk)
def factorial(n: Long): Long = {
  if (n <= 1) 1
  else n * factorial(n - 1)  // Multiplication after the recursive call
}

// Correct: tail-recursive (compiler transforms to a loop)
@tailrec
def factorial(n: Long, acc: Long = 1): Long = {
  if (n <= 1) acc
  else factorial(n - 1, n * acc)  // Recursive call is the last operation
}

factorial(100000)  // No stack overflow

6.2 When @tailrec Fails#

import scala.annotation.tailrec

// Compilation error: could not optimize @tailrec annotated method
// Reason: the recursive call is not the last operation
// @tailrec
// def sum(list: List[Int]): Int = list match {
//   case Nil => 0
//   case head :: tail => head + sum(tail)  // + operation is last
// }

// Solution: use the accumulator pattern
@tailrec
def sum(list: List[Int], acc: Int = 0): Int = list match {
  case Nil => acc
  case head :: tail => sum(tail, acc + head)  // Recursive call is last
}

6.3 Optimizing Mutual Recursion with Trampolining#

Mutual recursion cannot be optimized with @tailrec:

import scala.util.control.TailCalls._

def isEven(n: Long): TailRec[Boolean] = {
  if (n == 0) done(true)
  else tailcall(isOdd(n - 1))
}

def isOdd(n: Long): TailRec[Boolean] = {
  if (n == 0) done(false)
  else tailcall(isEven(n - 1))
}

// Runs without stack overflow
isEven(1000000).result  // true

Step 7: Common Mistakes and Solutions#

7.1 Unnecessary Intermediate Collections#

// Wrong: creates 3 intermediate collections
val result = (1 to 1000000)
  .map(_ * 2)       // Intermediate collection 1
  .filter(_ > 100)  // Intermediate collection 2
  .take(10)          // Intermediate collection 3

// Correct: lazy evaluation with view
val result = (1 to 1000000).view
  .map(_ * 2)
  .filter(_ > 100)
  .take(10)
  .toList  // Only the final result is materialized

// Or use iterator
val result = (1 to 1000000).iterator
  .map(_ * 2)
  .filter(_ > 100)
  .take(10)
  .toList

7.2 String Concatenation Performance#

// Wrong: O(n^2) - creates a new String on every concatenation
var result = ""
for (i <- 1 to 10000) {
  result += s"item-$i,"  // Copy on every iteration
}

// Correct: use StringBuilder - O(n)
val sb = new StringBuilder
for (i <- 1 to 10000) {
  sb.append(s"item-$i,")
}
val result = sb.toString()

// Or use mkString
val result = (1 to 10000).map(i => s"item-$i").mkString(",")

7.3 Excessive Pattern Matching Nesting#

// Deeply nested matching that may affect performance
def process(data: Any): String = data match {
  case list: List[_] => list.headOption match {
    case Some(map: Map[_, _]) => map.headOption match {
      case Some((k, v)) => s"$k -> $v"
      case None => "empty map"
    }
    case _ => "not a map list"
  }
  case _ => "unknown"
}

// Simplify pattern matching with clear types
case class ProcessInput(data: List[Map[String, String]])

def process(input: ProcessInput): String = {
  input.data.headOption
    .flatMap(_.headOption)
    .map { case (k, v) => s"$k -> $v" }
    .getOrElse("empty")
}

Checklist#

Items to verify for performance optimization:

  • Have you profiled first? - Optimize based on measurements, not guesses
  • Have you identified hotspots with JFR recording? - Methods consuming the most CPU
  • Are there any memory leaks? - Verify with heap dump analysis
  • Are you using appropriate collections? - Choose the right collection for the use case
  • Is there unnecessary boxing? - Use primitive types for numeric operations
  • Are you using view/iterator for lazy evaluation? - Eliminate intermediate collections

If you have checked all items and performance is still insufficient, write JMH benchmarks to measure the exact bottleneck.