<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Appendix on Advanced Beginner</title><link>https://advanced-beginner.github.io/en/docs/spark/appendix/</link><description>Recent content in Appendix on Advanced Beginner</description><generator>Hugo</generator><language>en-US</language><managingEditor>d8lzz1gpw@mozmail.com (kimbenji)</managingEditor><webMaster>d8lzz1gpw@mozmail.com (kimbenji)</webMaster><lastBuildDate>Mon, 23 Mar 2026 19:08:15 +0900</lastBuildDate><atom:link href="https://advanced-beginner.github.io/en/docs/spark/appendix/index.xml" rel="self" type="application/rss+xml"/><item><title>Glossary</title><link>https://advanced-beginner.github.io/en/docs/spark/appendix/glossary/</link><pubDate>Wed, 07 Jan 2026 00:00:00 +0000</pubDate><author>d8lzz1gpw@mozmail.com (kimbenji)</author><guid>https://advanced-beginner.github.io/en/docs/spark/appendix/glossary/</guid><description>&lt;h1 id="glossary"&gt;Glossary&lt;a class="anchor" href="#glossary"&gt;#&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;Key terminology and concepts used in Spark. Each term links to related documentation.&lt;/p&gt;
&lt;h2 id="core-concepts"&gt;Core Concepts&lt;a class="anchor" href="#core-concepts"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="action"&gt;Action&lt;a class="anchor" href="#action"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Operations that trigger actual computation on RDD/DataFrame and return results. Examples include &lt;code&gt;count()&lt;/code&gt;, &lt;code&gt;collect()&lt;/code&gt;, &lt;code&gt;show()&lt;/code&gt;, &lt;code&gt;write()&lt;/code&gt;.
→ &lt;a href="https://advanced-beginner.github.io/en/docs/spark/concepts/transformations-actions/"&gt;Transformations and Actions&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="application"&gt;Application&lt;a class="anchor" href="#application"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;A Spark program submitted by the user. Consists of a &lt;a href="#driver"&gt;Driver&lt;/a&gt; and &lt;a href="#executor"&gt;Executors&lt;/a&gt;.
→ &lt;a href="https://advanced-beginner.github.io/en/docs/spark/concepts/architecture/"&gt;Architecture&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="broadcast-variable"&gt;Broadcast Variable&lt;a class="anchor" href="#broadcast-variable"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;A shared variable distributed to all nodes as read-only. Used for efficiently sharing small datasets.
→ &lt;a href="https://advanced-beginner.github.io/en/docs/spark/concepts/tuning/"&gt;Performance Tuning&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="catalyst-optimizer"&gt;Catalyst Optimizer&lt;a class="anchor" href="#catalyst-optimizer"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Spark SQL&amp;rsquo;s query optimization engine. Transforms logical plans into optimized physical plans.
→ &lt;a href="https://advanced-beginner.github.io/en/docs/spark/concepts/spark-sql/"&gt;Spark SQL&lt;/a&gt;&lt;/p&gt;</description></item><item><title>FAQ</title><link>https://advanced-beginner.github.io/en/docs/spark/appendix/faq/</link><pubDate>Wed, 07 Jan 2026 00:00:00 +0000</pubDate><author>d8lzz1gpw@mozmail.com (kimbenji)</author><guid>https://advanced-beginner.github.io/en/docs/spark/appendix/faq/</guid><description>&lt;h1 id="faq"&gt;FAQ&lt;a class="anchor" href="#faq"&gt;#&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;Frequently asked questions and solutions to common problems.&lt;/p&gt;
&lt;h2 id="general-questions"&gt;General Questions&lt;a class="anchor" href="#general-questions"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="which-java-versions-does-spark-support"&gt;Which Java versions does Spark support?&lt;a class="anchor" href="#which-java-versions-does-spark-support"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Spark 3.5 supports Java 8, 11, and 17. Java 21 is not officially supported yet.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;java -version &lt;span class="c1"&gt;# Check version&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="can-i-use-spark-with-java-only-without-scala"&gt;Can I use Spark with Java only, without Scala?&lt;a class="anchor" href="#can-i-use-spark-with-java-only-without-scala"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Yes, you can. Spark fully supports the Java API. However, since the Spark runtime is written in Scala, Scala libraries are included in the dependencies.&lt;/p&gt;
&lt;h3 id="whats-the-difference-between-dataframe-and-dataset"&gt;What&amp;rsquo;s the difference between DataFrame and Dataset?&lt;a class="anchor" href="#whats-the-difference-between-dataframe-and-dataset"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;DataFrame&lt;/strong&gt; (&lt;code&gt;Dataset&amp;lt;Row&amp;gt;&lt;/code&gt;): Has schema but no compile-time type checking&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dataset&lt;/strong&gt; (&lt;code&gt;Dataset&amp;lt;T&amp;gt;&lt;/code&gt;): Uses POJO types to provide compile-time type safety&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In Java, DataFrame is an alias for &lt;code&gt;Dataset&amp;lt;Row&amp;gt;&lt;/code&gt;.&lt;/p&gt;</description></item><item><title>References</title><link>https://advanced-beginner.github.io/en/docs/spark/appendix/references/</link><pubDate>Wed, 07 Jan 2026 00:00:00 +0000</pubDate><author>d8lzz1gpw@mozmail.com (kimbenji)</author><guid>https://advanced-beginner.github.io/en/docs/spark/appendix/references/</guid><description>&lt;h1 id="references"&gt;References&lt;a class="anchor" href="#references"&gt;#&lt;/a&gt;&lt;/h1&gt;
&lt;p&gt;Official documentation and additional resources for learning Apache Spark.&lt;/p&gt;
&lt;h2 id="official-documentation"&gt;Official Documentation&lt;a class="anchor" href="#official-documentation"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="apache-spark-official-site"&gt;Apache Spark Official Site&lt;a class="anchor" href="#apache-spark-official-site"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/"&gt;Spark Official Site&lt;/a&gt;&lt;/strong&gt; — Downloads, news, release information&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/3.5.7/"&gt;Spark 3.5 Documentation&lt;/a&gt;&lt;/strong&gt; — Current stable version documentation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/"&gt;Spark Latest Documentation&lt;/a&gt;&lt;/strong&gt; — Latest version documentation&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="programming-guides"&gt;Programming Guides&lt;a class="anchor" href="#programming-guides"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/rdd-programming-guide.html"&gt;RDD Programming Guide&lt;/a&gt;&lt;/strong&gt; — Detailed RDD API explanation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/sql-programming-guide.html"&gt;Spark SQL, DataFrames and Datasets Guide&lt;/a&gt;&lt;/strong&gt; — SQL and DataFrame API&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html"&gt;Structured Streaming Programming Guide&lt;/a&gt;&lt;/strong&gt; — Real-time stream processing&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/ml-guide.html"&gt;MLlib Guide&lt;/a&gt;&lt;/strong&gt; — Machine learning library&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/graphx-programming-guide.html"&gt;GraphX Programming Guide&lt;/a&gt;&lt;/strong&gt; — Graph processing&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="operations-guides"&gt;Operations Guides&lt;a class="anchor" href="#operations-guides"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/cluster-overview.html"&gt;Cluster Overview&lt;/a&gt;&lt;/strong&gt; — Cluster architecture&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/tuning.html"&gt;Tuning Guide&lt;/a&gt;&lt;/strong&gt; — Performance tuning&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/monitoring.html"&gt;Monitoring Guide&lt;/a&gt;&lt;/strong&gt; — Monitoring&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/configuration.html"&gt;Configuration&lt;/a&gt;&lt;/strong&gt; — Configuration options&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/security.html"&gt;Security&lt;/a&gt;&lt;/strong&gt; — Security settings&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="cluster-manager-guides"&gt;Cluster Manager Guides&lt;a class="anchor" href="#cluster-manager-guides"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/spark-standalone.html"&gt;Standalone Mode&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/running-on-yarn.html"&gt;YARN&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/running-on-kubernetes.html"&gt;Kubernetes&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="api-documentation"&gt;API Documentation&lt;a class="anchor" href="#api-documentation"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="java-api"&gt;Java API&lt;a class="anchor" href="#java-api"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/api/java/index.html"&gt;Spark Java API (Javadoc)&lt;/a&gt;&lt;/strong&gt; — Java API reference&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html"&gt;Dataset&lt;Row&gt;&lt;/a&gt;&lt;/strong&gt; — DataFrame class&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html"&gt;SparkSession&lt;/a&gt;&lt;/strong&gt; — Entry point class&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html"&gt;functions&lt;/a&gt;&lt;/strong&gt; — Built-in functions&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="scala-api"&gt;Scala API&lt;a class="anchor" href="#scala-api"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html"&gt;Spark Scala API (Scaladoc)&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="additional-learning-resources"&gt;Additional Learning Resources&lt;a class="anchor" href="#additional-learning-resources"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="online-courses"&gt;Online Courses&lt;a class="anchor" href="#online-courses"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://www.databricks.com/learn"&gt;Databricks Academy&lt;/a&gt;&lt;/strong&gt; — Official training from Spark&amp;rsquo;s co-creator company&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://www.coursera.org/learn/scala-spark-big-data"&gt;Coursera: Big Data Analysis with Scala and Spark&lt;/a&gt;&lt;/strong&gt; — EPFL&amp;rsquo;s Scala/Spark course&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://www.edx.org/learn/big-data/university-of-california-san-diego-big-data-analytics-using-spark"&gt;edX: Big Data Analytics Using Spark&lt;/a&gt;&lt;/strong&gt; — UC San Diego course&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="blogs-and-documentation"&gt;Blogs and Documentation&lt;a class="anchor" href="#blogs-and-documentation"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://www.databricks.com/blog"&gt;Databricks Blog&lt;/a&gt;&lt;/strong&gt; — Latest Spark technology and use cases&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://sparkbyexamples.com/"&gt;Spark By Examples&lt;/a&gt;&lt;/strong&gt; — Java, Scala, Python examples&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://www.baeldung.com/apache-spark"&gt;Baeldung Spark Tutorials&lt;/a&gt;&lt;/strong&gt; — Spark tutorials for Java developers&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="community"&gt;Community&lt;a class="anchor" href="#community"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://stackoverflow.com/questions/tagged/apache-spark"&gt;Stack Overflow - apache-spark&lt;/a&gt;&lt;/strong&gt; — Q&amp;amp;A&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/community.html"&gt;Spark Mailing Lists&lt;/a&gt;&lt;/strong&gt; — Developer mailing lists&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://github.com/apache/spark"&gt;GitHub - apache/spark&lt;/a&gt;&lt;/strong&gt; — Source code and issue tracker&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="related-technology-documentation"&gt;Related Technology Documentation&lt;a class="anchor" href="#related-technology-documentation"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="data-sources"&gt;Data Sources&lt;a class="anchor" href="#data-sources"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://kafka.apache.org/documentation/"&gt;Kafka&lt;/a&gt;&lt;/strong&gt; — Streaming data source&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html"&gt;HDFS&lt;/a&gt;&lt;/strong&gt; — Distributed file system&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://parquet.apache.org/docs/"&gt;Parquet&lt;/a&gt;&lt;/strong&gt; — Columnar format&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://docs.delta.io/"&gt;Delta Lake&lt;/a&gt;&lt;/strong&gt; — Storage with ACID transaction support&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="cluster-environments"&gt;Cluster Environments&lt;a class="anchor" href="#cluster-environments"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html"&gt;Hadoop YARN&lt;/a&gt;&lt;/strong&gt; — Resource management&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://kubernetes.io/docs/home/"&gt;Kubernetes&lt;/a&gt;&lt;/strong&gt; — Container orchestration&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="cloud-services"&gt;Cloud Services&lt;a class="anchor" href="#cloud-services"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://docs.aws.amazon.com/emr/"&gt;AWS EMR&lt;/a&gt;&lt;/strong&gt; — AWS managed Spark&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://cloud.google.com/dataproc/docs"&gt;Google Dataproc&lt;/a&gt;&lt;/strong&gt; — GCP managed Spark&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://learn.microsoft.com/en-us/azure/hdinsight/"&gt;Azure HDInsight&lt;/a&gt;&lt;/strong&gt; — Azure managed Hadoop/Spark&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://docs.databricks.com/"&gt;Databricks&lt;/a&gt;&lt;/strong&gt; — Unified Data Analytics Platform&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="version-release-notes"&gt;Version Release Notes&lt;a class="anchor" href="#version-release-notes"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/releases/spark-release-3-5-0.html"&gt;Spark 3.5 Release&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/releases/spark-release-3-4-0.html"&gt;Spark 3.4 Release&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://spark.apache.org/releases/spark-release-3-3-0.html"&gt;Spark 3.3 Release&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="performance-benchmarks"&gt;Performance Benchmarks&lt;a class="anchor" href="#performance-benchmarks"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="http://www.tpc.org/tpcds/"&gt;TPC-DS Benchmark&lt;/a&gt;&lt;/strong&gt; — Decision support system benchmark&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://github.com/databricks/spark-sql-perf"&gt;Spark SQL Performance Tests&lt;/a&gt;&lt;/strong&gt; — Databricks performance testing tools&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="recommended-books"&gt;Recommended Books&lt;a class="anchor" href="#recommended-books"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="beginner"&gt;Beginner&lt;a class="anchor" href="#beginner"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Learning Spark, 2nd Edition&lt;/strong&gt; (O&amp;rsquo;Reilly) — Jules S. Damji et al.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spark: The Definitive Guide&lt;/strong&gt; (O&amp;rsquo;Reilly) — Bill Chambers, Matei Zaharia&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="advanced"&gt;Advanced&lt;a class="anchor" href="#advanced"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;High Performance Spark&lt;/strong&gt; (O&amp;rsquo;Reilly) — Holden Karau, Rachel Warren&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spark in Action, 2nd Edition&lt;/strong&gt; (Manning) — Jean-Georges Perrin&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>