<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>How-To Guides on Advanced Beginner</title><link>https://advanced-beginner.github.io/en/docs/spark/howto/</link><description>Recent content in How-To Guides on Advanced Beginner</description><generator>Hugo</generator><language>en-US</language><managingEditor>d8lzz1gpw@mozmail.com (kimbenji)</managingEditor><webMaster>d8lzz1gpw@mozmail.com (kimbenji)</webMaster><lastBuildDate>Mon, 23 Mar 2026 19:08:15 +0900</lastBuildDate><atom:link href="https://advanced-beginner.github.io/en/docs/spark/howto/index.xml" rel="self" type="application/rss+xml"/><item><title>Troubleshooting OutOfMemoryError</title><link>https://advanced-beginner.github.io/en/docs/spark/howto/oom-troubleshooting/</link><pubDate>Fri, 16 Jan 2026 00:00:00 +0000</pubDate><author>d8lzz1gpw@mozmail.com (kimbenji)</author><guid>https://advanced-beginner.github.io/en/docs/spark/howto/oom-troubleshooting/</guid><description>&lt;blockquote class="book-hint info"&gt;&lt;strong&gt;Estimated Time&lt;/strong&gt;: About 15 minutes
&lt;/blockquote&gt;

&lt;blockquote class="book-hint info"&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;br&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Driver OOM&lt;/strong&gt;: Reduce &lt;code&gt;collect()&lt;/code&gt; result size, increase &lt;code&gt;spark.driver.memory&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Executor OOM&lt;/strong&gt;: Increase partition count (&lt;code&gt;repartition&lt;/code&gt;), increase &lt;code&gt;spark.executor.memory&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Diagnose First&lt;/strong&gt;: Check Spark UI to identify where OOM occurs&lt;/li&gt;
&lt;/ul&gt;

&lt;/blockquote&gt;

&lt;h2 id="problem-definition"&gt;Problem Definition&lt;a class="anchor" href="#problem-definition"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;The following error occurs during Spark application execution:&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;java.lang.OutOfMemoryError: Java heap space&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Or:&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;Container killed by YARN for exceeding memory limits&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This guide explains step-by-step how to diagnose and resolve OOM errors.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="prerequisites"&gt;Prerequisites&lt;a class="anchor" href="#prerequisites"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Item&lt;/th&gt;
 &lt;th&gt;Requirement&lt;/th&gt;
 &lt;th&gt;How to Verify&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Spark Version&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;2.4 or higher (3.x recommended)&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;spark-submit --version&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Java Version&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;8, 11, or 17&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;java -version&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Spark UI&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Accessible&lt;/td&gt;
 &lt;td&gt;Open &lt;code&gt;http://localhost:4040&lt;/code&gt; in browser&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;strong&gt;Permissions&lt;/strong&gt;&lt;/td&gt;
 &lt;td&gt;Can modify Spark settings&lt;/td&gt;
 &lt;td&gt;Verify spark-submit execution permissions&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Supported Environments&lt;/strong&gt;: Linux, macOS, Windows (WSL2 recommended)&lt;/p&gt;</description></item><item><title>Resolving Data Skew</title><link>https://advanced-beginner.github.io/en/docs/spark/howto/data-skew/</link><pubDate>Fri, 16 Jan 2026 00:00:00 +0000</pubDate><author>d8lzz1gpw@mozmail.com (kimbenji)</author><guid>https://advanced-beginner.github.io/en/docs/spark/howto/data-skew/</guid><description>&lt;blockquote class="book-hint info"&gt;&lt;strong&gt;Estimated Time&lt;/strong&gt;: About 20 minutes
&lt;/blockquote&gt;

&lt;blockquote class="book-hint info"&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;br&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Diagnosis&lt;/strong&gt;: Compare Task Duration Min/Max in Spark UI Stages tab (10x+ difference = skew)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enable AQE&lt;/strong&gt;: &lt;code&gt;spark.sql.adaptive.skewJoin.enabled=true&lt;/code&gt; (Spark 3.0+)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manual Fix&lt;/strong&gt;: Distribute hot keys using Salting technique&lt;/li&gt;
&lt;/ul&gt;

&lt;/blockquote&gt;

&lt;h2 id="problem-definition"&gt;Problem Definition&lt;a class="anchor" href="#problem-definition"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;A Spark job mostly completes quickly but &lt;strong&gt;some Tasks take much longer&lt;/strong&gt;:&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;Stage 3: 199/200 tasks completed... (last 1 running for tens of minutes)&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This is &lt;strong&gt;Data Skew&lt;/strong&gt; - data is concentrated on specific keys, causing those partitions to be overloaded.&lt;/p&gt;</description></item><item><title>Optimizing Shuffle</title><link>https://advanced-beginner.github.io/en/docs/spark/howto/shuffle-optimization/</link><pubDate>Fri, 16 Jan 2026 00:00:00 +0000</pubDate><author>d8lzz1gpw@mozmail.com (kimbenji)</author><guid>https://advanced-beginner.github.io/en/docs/spark/howto/shuffle-optimization/</guid><description>&lt;blockquote class="book-hint info"&gt;&lt;strong&gt;Estimated Time&lt;/strong&gt;: About 20 minutes
&lt;/blockquote&gt;

&lt;blockquote class="book-hint info"&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;br&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Check Shuffle&lt;/strong&gt;: &lt;code&gt;Exchange&lt;/code&gt; node in &lt;code&gt;df.explain()&lt;/code&gt; = shuffle occurs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Eliminate Unnecessary Shuffles&lt;/strong&gt;: Perform multiple aggregations at once in the same group&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Broadcast Join&lt;/strong&gt;: Use &lt;code&gt;broadcast()&lt;/code&gt; for small tables (tens of MB)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Shuffle Partition Count&lt;/strong&gt;: Adjust &lt;code&gt;spark.sql.shuffle.partitions&lt;/code&gt; (default 200)&lt;/li&gt;
&lt;/ul&gt;

&lt;/blockquote&gt;

&lt;h2 id="problem-definition"&gt;Problem Definition&lt;a class="anchor" href="#problem-definition"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Shuffle optimization is needed when you see these symptoms:&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Symptom&lt;/th&gt;
 &lt;th&gt;Where to Check&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Stage transitions take 10+ seconds&lt;/td&gt;
 &lt;td&gt;Spark UI → Jobs tab&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Shuffle Read/Write is tens of GB or more&lt;/td&gt;
 &lt;td&gt;Spark UI → Stages tab&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Timeout due to network I/O&lt;/td&gt;
 &lt;td&gt;Application logs&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Many &lt;code&gt;Exchange&lt;/code&gt; nodes in &lt;code&gt;explain()&lt;/code&gt;&lt;/td&gt;
 &lt;td&gt;Execution plan output&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Shuffle is the most expensive operation in Spark. The following operations cause shuffle:&lt;/p&gt;</description></item></channel></rss>