<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Troubleshooting on Advanced Beginner</title><link>https://advanced-beginner.github.io/en/docs/observability/howto/</link><description>Recent content in Troubleshooting on Advanced Beginner</description><generator>Hugo</generator><language>en-US</language><managingEditor>d8lzz1gpw@mozmail.com (kimbenji)</managingEditor><webMaster>d8lzz1gpw@mozmail.com (kimbenji)</webMaster><lastBuildDate>Fri, 16 Jan 2026 09:24:28 +0000</lastBuildDate><atom:link href="https://advanced-beginner.github.io/en/docs/observability/howto/index.xml" rel="self" type="application/rss+xml"/><item><title>Debugging High Latency</title><link>https://advanced-beginner.github.io/en/docs/observability/howto/debug-high-latency/</link><pubDate>Mon, 12 Jan 2026 00:00:00 +0000</pubDate><author>d8lzz1gpw@mozmail.com (kimbenji)</author><guid>https://advanced-beginner.github.io/en/docs/observability/howto/debug-high-latency/</guid><description>&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Target Scenario&lt;/strong&gt;: P99 response time exceeds SLA (500ms)
&lt;strong&gt;Goal&lt;/strong&gt;: Identify and resolve bottlenecks
&lt;strong&gt;Duration&lt;/strong&gt;: 15~30 minutes (depending on problem complexity)
&lt;strong&gt;Success Criteria&lt;/strong&gt;: P99 response time recovers below SLA threshold (500ms)&lt;/p&gt;
&lt;/blockquote&gt;&lt;h2 id="problem-scenario"&gt;Problem Scenario&lt;a class="anchor" href="#problem-scenario"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;Alert: HighP99Latency
Service: order-service
P99: 2.5s (Threshold: 500ms)
Duration: 10 minutes&lt;/code&gt;&lt;/pre&gt;&lt;h2 id="diagnostic-workflow"&gt;Diagnostic Workflow&lt;a class="anchor" href="#diagnostic-workflow"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;pre class="mermaid"&gt;graph TD
 A[&amp;#34;1. Scope Analysis&amp;lt;br&amp;gt;Which service? Since when?&amp;#34;]
 B[&amp;#34;2. Segment Analysis&amp;lt;br&amp;gt;Where is it slow?&amp;#34;]
 C[&amp;#34;3. Resource Check&amp;lt;br&amp;gt;CPU/Memory/DB?&amp;#34;]
 D[&amp;#34;4. Root Cause&amp;lt;br&amp;gt;Code? Query? External?&amp;#34;]
 E[&amp;#34;5. Resolution&amp;#34;]

 A --&amp;gt; B --&amp;gt; C --&amp;gt; D --&amp;gt; E&lt;/pre&gt;&lt;h2 id="step-1-scope-analysis"&gt;Step 1: Scope Analysis&lt;a class="anchor" href="#step-1-scope-analysis"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="check-impact-scope"&gt;Check Impact Scope&lt;a class="anchor" href="#check-impact-scope"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Which service is slow?&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;topk&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;histogram_quantile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;service&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;le&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kr"&gt;rate&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;http_request_duration_seconds_bucket&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;5m&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Since when did it become slow?&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;histogram_quantile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;le&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kr"&gt;rate&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;http_request_duration_seconds_bucket&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;service&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;&amp;#34;&lt;/span&gt;&lt;span class="s"&gt;order-service&lt;/span&gt;&lt;span class="p"&gt;&amp;#34;}[&lt;/span&gt;&lt;span class="s"&gt;5m&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# → Time range: Last 1 hour&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="check-specific-endpoints"&gt;Check Specific Endpoints&lt;a class="anchor" href="#check-specific-endpoints"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-promql" data-lang="promql"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# P99 by endpoint&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kr"&gt;histogram_quantile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;le&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kr"&gt;rate&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;http_request_duration_seconds_bucket&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;service&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;&amp;#34;&lt;/span&gt;&lt;span class="s"&gt;order-service&lt;/span&gt;&lt;span class="p"&gt;&amp;#34;}[&lt;/span&gt;&lt;span class="s"&gt;5m&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: &lt;code&gt;/orders&lt;/code&gt; POST endpoint is slow&lt;/p&gt;</description></item><item><title>Optimizing Metric Cardinality</title><link>https://advanced-beginner.github.io/en/docs/observability/howto/reduce-cardinality/</link><pubDate>Mon, 12 Jan 2026 00:00:00 +0000</pubDate><author>d8lzz1gpw@mozmail.com (kimbenji)</author><guid>https://advanced-beginner.github.io/en/docs/observability/howto/reduce-cardinality/</guid><description>&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Target Scenario&lt;/strong&gt;: Prometheus memory/storage spike, slow queries
&lt;strong&gt;Goal&lt;/strong&gt;: Optimize costs by reducing unnecessary time series
&lt;strong&gt;Duration&lt;/strong&gt;: 30 minutes~1 hour (depending on analysis and fix complexity)
&lt;strong&gt;Success Criteria&lt;/strong&gt;: Time series count reduced below target and memory usage stabilized&lt;/p&gt;
&lt;/blockquote&gt;&lt;h2 id="problem-scenario"&gt;Problem Scenario&lt;a class="anchor" href="#problem-scenario"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;Alert: PrometheusHighCardinality
Active Series: 2,500,000 (Threshold: 1,000,000)
Memory Usage: 32GB&lt;/code&gt;&lt;/pre&gt;&lt;h2 id="what-is-cardinality"&gt;What is Cardinality?&lt;a class="anchor" href="#what-is-cardinality"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Cardinality = Number of unique time series&lt;/strong&gt;&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;http_requests_total{method=&amp;#34;GET&amp;#34;, status=&amp;#34;200&amp;#34;, path=&amp;#34;/api/users&amp;#34;} # 1 series
http_requests_total{method=&amp;#34;GET&amp;#34;, status=&amp;#34;200&amp;#34;, path=&amp;#34;/api/users/123&amp;#34;} # Another 1!
http_requests_total{method=&amp;#34;GET&amp;#34;, status=&amp;#34;200&amp;#34;, path=&amp;#34;/api/users/456&amp;#34;} # Another 1!&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: If user_id is in path, time series created for each user&lt;/p&gt;</description></item><item><title>Managing Alert Fatigue</title><link>https://advanced-beginner.github.io/en/docs/observability/howto/manage-alert-fatigue/</link><pubDate>Fri, 16 Jan 2026 00:00:00 +0000</pubDate><author>d8lzz1gpw@mozmail.com (kimbenji)</author><guid>https://advanced-beginner.github.io/en/docs/observability/howto/manage-alert-fatigue/</guid><description>&lt;blockquote class='book-hint '&gt;
&lt;p&gt;&lt;strong&gt;Situation&lt;/strong&gt;: Receiving dozens to hundreds of alerts daily, missing critical ones
&lt;strong&gt;Goal&lt;/strong&gt;: Only receive alerts that require actual action
&lt;strong&gt;Time Required&lt;/strong&gt;: 1-2 hours (analyzing and modifying alert rules)
&lt;strong&gt;Success Criteria&lt;/strong&gt;: Daily alert count reduced to a manageable level (e.g., 10 or fewer)&lt;/p&gt;
&lt;/blockquote&gt;&lt;h2 id="before-you-begin"&gt;Before You Begin&lt;a class="anchor" href="#before-you-begin"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="required-environment"&gt;Required Environment&lt;a class="anchor" href="#required-environment"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Component&lt;/th&gt;
 &lt;th&gt;Version&lt;/th&gt;
 &lt;th&gt;Verification&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;Prometheus&lt;/td&gt;
 &lt;td&gt;2.40+&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;prometheus --version&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;Alertmanager&lt;/td&gt;
 &lt;td&gt;0.25+&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;alertmanager --version&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;amtool&lt;/td&gt;
 &lt;td&gt;0.25+&lt;/td&gt;
 &lt;td&gt;&lt;code&gt;amtool --version&lt;/code&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="required-permissions"&gt;Required Permissions&lt;a class="anchor" href="#required-permissions"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Write access to Prometheus configuration file (&lt;code&gt;prometheus.yml&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Write access to Alertmanager configuration file (&lt;code&gt;alertmanager.yml&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Permission to restart Prometheus/Alertmanager&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="environment-check"&gt;Environment Check&lt;a class="anchor" href="#environment-check"&gt;#&lt;/a&gt;&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Check Prometheus status&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;curl -s http://localhost:9090/-/healthy &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;Prometheus OK&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Check Alertmanager status&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;curl -s http://localhost:9093/-/healthy &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;Alertmanager OK&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Check amtool configuration&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;amtool config show --alertmanager.url&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:9093&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="problem-scenario"&gt;Problem Scenario&lt;a class="anchor" href="#problem-scenario"&gt;#&lt;/a&gt;&lt;/h2&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;# Yesterday&amp;#39;s alert summary
Critical: 15 (HighCPU 8, HighMemory 7)
Warning: 87 (SlowResponse 45, HighLatency 32, PodRestart 10)
Total: 102

# Actual incidents: 1
# Missed alerts: 1 (buried in HighCPU alerts)&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;When alert fatigue occurs:&lt;/p&gt;</description></item></channel></rss>