<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The Pragmatic Tech Lead]]></title><description><![CDATA[The Pragmatic Tech Lead is your go-to guide for navigating the real-world challenges of software engineering, and career growth.]]></description><link>https://www.rahuldhar.me</link><image><url>https://substackcdn.com/image/fetch/$s_!tzua!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd327426d-6453-47e5-9f52-6e6205106a71_750x750.png</url><title>The Pragmatic Tech Lead</title><link>https://www.rahuldhar.me</link></image><generator>Substack</generator><lastBuildDate>Wed, 13 May 2026 10:03:37 GMT</lastBuildDate><atom:link href="https://www.rahuldhar.me/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Rahul Dhar]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[rahuldhar47@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[rahuldhar47@substack.com]]></itunes:email><itunes:name><![CDATA[Rahul Dhar]]></itunes:name></itunes:owner><itunes:author><![CDATA[Rahul Dhar]]></itunes:author><googleplay:owner><![CDATA[rahuldhar47@substack.com]]></googleplay:owner><googleplay:email><![CDATA[rahuldhar47@substack.com]]></googleplay:email><googleplay:author><![CDATA[Rahul Dhar]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The Secret to Netflix’s Lightning-Fast Counters]]></title><description><![CDATA[How distributed systems, batching, and eventual consistency make millions of increments possible every second.]]></description><link>https://www.rahuldhar.me/p/the-secret-to-netflixs-lightning</link><guid isPermaLink="false">https://www.rahuldhar.me/p/the-secret-to-netflixs-lightning</guid><dc:creator><![CDATA[Rahul Dhar]]></dc:creator><pubDate>Sun, 26 Oct 2025 18:45:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lby_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f7d4fad-82f4-4632-88f5-6b558481c2de_1312x736.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Dear readers,</p><p>Counting sounds easy, right?<br>You add one number to another, maybe you keep a counter in a database or increment a value in memory. Simple.</p><p>But at <strong>Netflix&#8217;s scale</strong>, even something as basic as &#8220;counting&#8221; becomes a serious engineering challenge.</p><p>When you have <strong>millions of users streaming content every minute</strong> across <strong>thousands of servers</strong> distributed around the world, the act of &#8220;adding one more view&#8221; to a counter quickly turns into a distributed systems problem.</p><p>Let&#8217;s dive into what the problem was, why traditional methods fail, and what we can learn from how Netflix approached it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lby_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f7d4fad-82f4-4632-88f5-6b558481c2de_1312x736.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lby_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f7d4fad-82f4-4632-88f5-6b558481c2de_1312x736.jpeg 424w, https://substackcdn.com/image/fetch/$s_!lby_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f7d4fad-82f4-4632-88f5-6b558481c2de_1312x736.jpeg 848w, https://substackcdn.com/image/fetch/$s_!lby_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f7d4fad-82f4-4632-88f5-6b558481c2de_1312x736.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!lby_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f7d4fad-82f4-4632-88f5-6b558481c2de_1312x736.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lby_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f7d4fad-82f4-4632-88f5-6b558481c2de_1312x736.jpeg" width="1312" height="736" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9f7d4fad-82f4-4632-88f5-6b558481c2de_1312x736.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:736,&quot;width&quot;:1312,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1056979,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/177182468?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f7d4fad-82f4-4632-88f5-6b558481c2de_1312x736.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lby_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f7d4fad-82f4-4632-88f5-6b558481c2de_1312x736.jpeg 424w, https://substackcdn.com/image/fetch/$s_!lby_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f7d4fad-82f4-4632-88f5-6b558481c2de_1312x736.jpeg 848w, https://substackcdn.com/image/fetch/$s_!lby_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f7d4fad-82f4-4632-88f5-6b558481c2de_1312x736.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!lby_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f7d4fad-82f4-4632-88f5-6b558481c2de_1312x736.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h3>The Problem: Counting at Netflix Scale</h3><p>Imagine you are tracking how many people are watching <em>Stranger Things</em> right now.</p><p>Every time someone presses play, one counter somewhere needs to go up by one.<br>Now imagine this happening across <strong>hundreds of data centers</strong>, each handling its own share of users.</p><p>That&#8217;s <strong>millions of increments per second</strong>.</p><p>If you had just one database where all those servers sent an &#8220;increment&#8221; request (<code>INCR</code>), you would have:</p><ul><li><p>A massive bottleneck with too many requests hitting the same database</p></li><li><p>Network delays and contention</p></li><li><p>Possible data loss if that central database crashes</p></li></ul><p>And worst of all, your &#8220;total view count&#8221; could become inconsistent or outdated.</p><p>Netflix engineers needed a way to count <strong>accurately</strong>, <strong>fast</strong>, and <strong>reliably</strong>, even when data was spread across thousands of machines.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/comm/mynetwork/discovery-see-all?usecase=PEOPLE_FOLLOWS&amp;followMember=iamrahuldhar&quot;,&quot;text&quot;:&quot;Follow me on Linkedin&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.linkedin.com/comm/mynetwork/discovery-see-all?usecase=PEOPLE_FOLLOWS&amp;followMember=iamrahuldhar"><span>Follow me on Linkedin</span></a></p><div><hr></div><h3>Why Traditional Counting Doesn&#8217;t Work</h3><p>Let&#8217;s take the simple approach most of us use in smaller systems:</p><pre><code><code>count = count + 1</code></code></pre><p>It works beautifully on a single machine. But the moment you distribute this logic across hundreds of servers, problems begin.</p><ol><li><p><strong>Concurrency Issues:</strong> Two servers might try to update the same counter at the same time, causing incorrect totals</p></li><li><p><strong>Latency:</strong> The more servers that try to reach the central counter, the slower it gets</p></li><li><p><strong>Failure Handling:</strong> What happens if one server crashes mid-update</p></li><li><p><strong>Scalability:</strong> A single counter database cannot handle billions of updates per second</p></li></ol><p>Netflix realized that instead of making one machine handle all counts, they needed <strong>each machine to count locally</strong>, and then <strong>combine those counts efficiently</strong>.</p><div><hr></div><h3>Netflix&#8217;s Solution: Decentralized, Event-Driven Counting</h3><p>Here&#8217;s the key insight:<br>Do not count everything in one place. Let every server count independently and merge those results later.</p><p>This is called <strong>decentralized counting</strong>.</p><p>Each server does its own small counting for a short time window. Then, it sends those local totals, called <strong>deltas</strong>, to a central pipeline like Kafka.</p><p>Think of it like this:</p><ul><li><p>Every store branch counts how many customers visited today</p></li><li><p>At the end of the day, each branch sends their total number to the head office</p></li><li><p>The head office adds all those branch numbers together to get the company-wide total</p></li></ul><p>Netflix does exactly this, but continuously and in real time.</p><div><hr></div><h3>How It Works Step by Step</h3><ol><li><p><strong>Local Counting</strong><br>Each Netflix microservice instance keeps a small in-memory counter. It increments locally whenever a new event happens, like a stream start or a like</p></li><li><p><strong>Batching</strong><br>Instead of sending every single increment, each instance batches its counts for a short period, for example every few seconds</p></li><li><p><strong>Publishing</strong><br>These batched counts are published to <strong>Kafka</strong>, Netflix&#8217;s distributed event streaming platform</p></li><li><p><strong>Aggregation</strong><br>A specialized service consumes these delta events and aggregates them by keys, for example show ID, region, or time window</p></li><li><p><strong>Storage</strong><br>The final aggregated counts are stored in a scalable database like Cassandra or Redis, which can handle huge read and write volumes</p></li></ol><div><hr></div><h3>Handling Delays, Duplicates, and Failures</h3><p>In distributed systems, you cannot assume everything arrives perfectly on time or only once.</p><p>Some deltas might get delayed due to network lag, and some might even get sent twice during retries.</p><p>Netflix solved this by designing their system to be <strong>idempotent</strong>, meaning even if the same update is applied twice, the result remains correct.</p><p>Each delta has:</p><ul><li><p>A <strong>unique ID</strong> to detect duplicates</p></li><li><p>A <strong>timestamp window</strong> to know when the counts happened</p></li><li><p>The <strong>increment value</strong> for how much to add</p></li></ul><p>Even if Kafka replays an event or a message is retried, the aggregator can safely ignore duplicates and apply counts correctly.</p><p>This ensures that the final global count is <strong>accurate</strong>, even if updates come late or in random order.</p><div><hr></div><h3>In-Depth Architecture</h3><p>For those who&#8217;ve stayed with us this far, let&#8217;s dive deeper into the architecture. After evaluating multiple approaches, Netflix ultimately adopted a hybrid, event-driven solution for distributed counting. This final design strikes a careful balance between low-latency reads, high throughput, and strong durability.</p><h4>1. Event Logging in TimeSeries Store</h4><p>Every counter increment is treated as an <strong>immutable event</strong> and ingested into Netflix&#8217;s <strong>TimeSeries abstraction</strong>, which acts as the event store. Each event record includes:</p><ul><li><p><code>event_time</code>: Timestamp of the increment</p></li><li><p><code>event_id</code>: Unique identifier for idempotency</p></li><li><p><code>counter_name</code> and <code>namespace</code>: To identify which counter it belongs to</p></li><li><p><code>delta</code>: The increment value (positive or negative)</p></li></ul><p>The TimeSeries abstraction, typically backed by <strong>Cassandra</strong>, ensures high availability, partition tolerance, and fast writes. Events are partitioned using <code>time_bucket</code> and <code>event_bucket</code> columns to prevent <strong>wide partitions</strong> that could degrade read performance.</p><p><strong>Benefits:</strong></p><ul><li><p>Immutable events prevent accidental data loss</p></li><li><p>Built-in retention policies automatically purge old events</p></li><li><p>Supports safe retries due to idempotency keys</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bO9S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2281626b-d4ab-418c-8811-79e158b63105_1600x763.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bO9S!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2281626b-d4ab-418c-8811-79e158b63105_1600x763.webp 424w, https://substackcdn.com/image/fetch/$s_!bO9S!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2281626b-d4ab-418c-8811-79e158b63105_1600x763.webp 848w, https://substackcdn.com/image/fetch/$s_!bO9S!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2281626b-d4ab-418c-8811-79e158b63105_1600x763.webp 1272w, https://substackcdn.com/image/fetch/$s_!bO9S!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2281626b-d4ab-418c-8811-79e158b63105_1600x763.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bO9S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2281626b-d4ab-418c-8811-79e158b63105_1600x763.webp" width="1456" height="694" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2281626b-d4ab-418c-8811-79e158b63105_1600x763.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:694,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:117692,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/177182468?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2281626b-d4ab-418c-8811-79e158b63105_1600x763.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bO9S!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2281626b-d4ab-418c-8811-79e158b63105_1600x763.webp 424w, https://substackcdn.com/image/fetch/$s_!bO9S!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2281626b-d4ab-418c-8811-79e158b63105_1600x763.webp 848w, https://substackcdn.com/image/fetch/$s_!bO9S!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2281626b-d4ab-418c-8811-79e158b63105_1600x763.webp 1272w, https://substackcdn.com/image/fetch/$s_!bO9S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2281626b-d4ab-418c-8811-79e158b63105_1600x763.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>Handling Wide Partitions</strong>: The <em>time_bucket</em> and <em>event_bucket</em> columns play a crucial role in breaking up a wide partition, preventing high-throughput counter events from overwhelming a given partition. <em>For more information regarding this, refer to this <a href="https://netflixtechblog.com/introducing-netflix-timeseries-data-abstraction-layer-31552f6326f8">blog</a></em>.</p></blockquote><div><hr></div><h4>2. Background Aggregation (Rollups)</h4><p>Reading individual events for each request is expensive. To solve this, Netflix performs <strong>continuous aggregation</strong> of events in the background.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aPar!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a2ba15-ebeb-4e0d-9247-c690f238c1a2_1400x939.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aPar!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a2ba15-ebeb-4e0d-9247-c690f238c1a2_1400x939.webp 424w, https://substackcdn.com/image/fetch/$s_!aPar!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a2ba15-ebeb-4e0d-9247-c690f238c1a2_1400x939.webp 848w, https://substackcdn.com/image/fetch/$s_!aPar!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a2ba15-ebeb-4e0d-9247-c690f238c1a2_1400x939.webp 1272w, https://substackcdn.com/image/fetch/$s_!aPar!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a2ba15-ebeb-4e0d-9247-c690f238c1a2_1400x939.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aPar!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a2ba15-ebeb-4e0d-9247-c690f238c1a2_1400x939.webp" width="1400" height="939" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71a2ba15-ebeb-4e0d-9247-c690f238c1a2_1400x939.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:939,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:128638,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/177182468?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a2ba15-ebeb-4e0d-9247-c690f238c1a2_1400x939.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aPar!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a2ba15-ebeb-4e0d-9247-c690f238c1a2_1400x939.webp 424w, https://substackcdn.com/image/fetch/$s_!aPar!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a2ba15-ebeb-4e0d-9247-c690f238c1a2_1400x939.webp 848w, https://substackcdn.com/image/fetch/$s_!aPar!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a2ba15-ebeb-4e0d-9247-c690f238c1a2_1400x939.webp 1272w, https://substackcdn.com/image/fetch/$s_!aPar!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71a2ba15-ebeb-4e0d-9247-c690f238c1a2_1400x939.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Key points:</p><ul><li><p><strong>Rollup windows:</strong> Aggregation occurs within a defined time window to ensure consistency</p></li><li><p><strong>lastRollupTs:</strong> This represents the most recent time when the counter value was last aggregated. For a counter being operated for the first time, this timestamp defaults to a reasonable time in the past.</p></li><li><p><strong>Immutable window:</strong> Aggregation can only occur safely within an immutable window that is no longer receiving counter events. The &#8220;acceptLimit&#8221; parameter of the TimeSeries Abstraction plays a crucial role here, as it rejects incoming events with timestamps beyond this limit. During aggregations, this window is pushed slightly further back to account for clock skews.</p></li></ul><p>This results in a <strong>rollup count</strong>, which represents the sum of all events in the aggregation window.</p><div><hr></div><h4>3. Rollup Storage</h4><p>Aggregated counts are stored in a <strong>persistent Rollup store</strong>, again often <strong>Cassandra</strong>, with one table per dataset.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-3Z_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff02645-9028-41dd-b22d-1d9fbcd8fe6b_1400x635.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-3Z_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff02645-9028-41dd-b22d-1d9fbcd8fe6b_1400x635.webp 424w, https://substackcdn.com/image/fetch/$s_!-3Z_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff02645-9028-41dd-b22d-1d9fbcd8fe6b_1400x635.webp 848w, https://substackcdn.com/image/fetch/$s_!-3Z_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff02645-9028-41dd-b22d-1d9fbcd8fe6b_1400x635.webp 1272w, https://substackcdn.com/image/fetch/$s_!-3Z_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff02645-9028-41dd-b22d-1d9fbcd8fe6b_1400x635.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-3Z_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff02645-9028-41dd-b22d-1d9fbcd8fe6b_1400x635.webp" width="1400" height="635" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bff02645-9028-41dd-b22d-1d9fbcd8fe6b_1400x635.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:635,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:78442,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/177182468?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff02645-9028-41dd-b22d-1d9fbcd8fe6b_1400x635.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-3Z_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff02645-9028-41dd-b22d-1d9fbcd8fe6b_1400x635.webp 424w, https://substackcdn.com/image/fetch/$s_!-3Z_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff02645-9028-41dd-b22d-1d9fbcd8fe6b_1400x635.webp 848w, https://substackcdn.com/image/fetch/$s_!-3Z_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff02645-9028-41dd-b22d-1d9fbcd8fe6b_1400x635.webp 1272w, https://substackcdn.com/image/fetch/$s_!-3Z_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbff02645-9028-41dd-b22d-1d9fbcd8fe6b_1400x635.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>LastWriteTs</strong>: Every time a given counter receives a write, they also log a <strong>last-write-timestamp</strong> as a columnar update in this table. This is done using Cassandra&#8217;s <a href="https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/cqlInsert.html#cqlInsert__timestamp-value">USING TIMESTAMP</a> feature to predictably apply the Last-Write-Win (LWW) semantics. This timestamp is the same as the <em>event_time</em> for the event. </p><ul><li><p>Each row contains: <code>counter_name</code>, <code>lastRollupCount</code>, <code>lastRollupTs</code></p></li><li><p><code>LastWriteTs</code> is recorded for every write to track whether the counter has new events not yet aggregated</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d4qs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F983de0cd-487e-4e89-aaac-d41b3cb39d44_1600x820.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d4qs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F983de0cd-487e-4e89-aaac-d41b3cb39d44_1600x820.webp 424w, https://substackcdn.com/image/fetch/$s_!d4qs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F983de0cd-487e-4e89-aaac-d41b3cb39d44_1600x820.webp 848w, https://substackcdn.com/image/fetch/$s_!d4qs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F983de0cd-487e-4e89-aaac-d41b3cb39d44_1600x820.webp 1272w, https://substackcdn.com/image/fetch/$s_!d4qs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F983de0cd-487e-4e89-aaac-d41b3cb39d44_1600x820.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d4qs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F983de0cd-487e-4e89-aaac-d41b3cb39d44_1600x820.webp" width="1456" height="746" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/983de0cd-487e-4e89-aaac-d41b3cb39d44_1600x820.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:746,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:142798,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/177182468?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F983de0cd-487e-4e89-aaac-d41b3cb39d44_1600x820.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!d4qs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F983de0cd-487e-4e89-aaac-d41b3cb39d44_1600x820.webp 424w, https://substackcdn.com/image/fetch/$s_!d4qs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F983de0cd-487e-4e89-aaac-d41b3cb39d44_1600x820.webp 848w, https://substackcdn.com/image/fetch/$s_!d4qs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F983de0cd-487e-4e89-aaac-d41b3cb39d44_1600x820.webp 1272w, https://substackcdn.com/image/fetch/$s_!d4qs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F983de0cd-487e-4e89-aaac-d41b3cb39d44_1600x820.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>Subsequent aggregations continue from the last checkpoint</p></li></ul><p><strong>Benefit:</strong> Efficient reads with only incremental aggregation needed</p><div><hr></div><h4>4. Caching Layer (EVCache)</h4><p>To provide <strong>low-latency reads</strong>, the latest aggregated counts are cached in <strong>EVCache</strong>.</p><ul><li><p>Cached value = <code>{lastRollupCount, lastRollupTs}</code></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xp2A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0032dc33-7b2d-4250-a3b1-b3d6668d4a3b_1193x844.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xp2A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0032dc33-7b2d-4250-a3b1-b3d6668d4a3b_1193x844.webp 424w, https://substackcdn.com/image/fetch/$s_!Xp2A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0032dc33-7b2d-4250-a3b1-b3d6668d4a3b_1193x844.webp 848w, https://substackcdn.com/image/fetch/$s_!Xp2A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0032dc33-7b2d-4250-a3b1-b3d6668d4a3b_1193x844.webp 1272w, https://substackcdn.com/image/fetch/$s_!Xp2A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0032dc33-7b2d-4250-a3b1-b3d6668d4a3b_1193x844.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xp2A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0032dc33-7b2d-4250-a3b1-b3d6668d4a3b_1193x844.webp" width="1193" height="844" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0032dc33-7b2d-4250-a3b1-b3d6668d4a3b_1193x844.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:844,&quot;width&quot;:1193,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:51520,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/177182468?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0032dc33-7b2d-4250-a3b1-b3d6668d4a3b_1193x844.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xp2A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0032dc33-7b2d-4250-a3b1-b3d6668d4a3b_1193x844.webp 424w, https://substackcdn.com/image/fetch/$s_!Xp2A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0032dc33-7b2d-4250-a3b1-b3d6668d4a3b_1193x844.webp 848w, https://substackcdn.com/image/fetch/$s_!Xp2A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0032dc33-7b2d-4250-a3b1-b3d6668d4a3b_1193x844.webp 1272w, https://substackcdn.com/image/fetch/$s_!Xp2A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0032dc33-7b2d-4250-a3b1-b3d6668d4a3b_1193x844.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>Reads return the cached value, with a slight acceptable lag</p></li><li><p>Triggers a rollup in the background if needed to catch up on events</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k22c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563bc362-8ab1-459c-acb7-a451985557c2_1400x718.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k22c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563bc362-8ab1-459c-acb7-a451985557c2_1400x718.webp 424w, https://substackcdn.com/image/fetch/$s_!k22c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563bc362-8ab1-459c-acb7-a451985557c2_1400x718.webp 848w, https://substackcdn.com/image/fetch/$s_!k22c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563bc362-8ab1-459c-acb7-a451985557c2_1400x718.webp 1272w, https://substackcdn.com/image/fetch/$s_!k22c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563bc362-8ab1-459c-acb7-a451985557c2_1400x718.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k22c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563bc362-8ab1-459c-acb7-a451985557c2_1400x718.webp" width="1400" height="718" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/563bc362-8ab1-459c-acb7-a451985557c2_1400x718.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:718,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:133666,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/177182468?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563bc362-8ab1-459c-acb7-a451985557c2_1400x718.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!k22c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563bc362-8ab1-459c-acb7-a451985557c2_1400x718.webp 424w, https://substackcdn.com/image/fetch/$s_!k22c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563bc362-8ab1-459c-acb7-a451985557c2_1400x718.webp 848w, https://substackcdn.com/image/fetch/$s_!k22c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563bc362-8ab1-459c-acb7-a451985557c2_1400x718.webp 1272w, https://substackcdn.com/image/fetch/$s_!k22c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F563bc362-8ab1-459c-acb7-a451985557c2_1400x718.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This allows <strong>single-digit millisecond read latencies</strong> even at massive scale.</p><div><hr></div><h4>5. Event-Driven Rollup Pipeline</h4><p>Rollups are triggered using a <strong>lightweight rollup event</strong> per counter.</p><pre><code>rollupEvent: {
  &#8220;namespace&#8221;: &#8220;my_dataset&#8221;,
  &#8220;counter&#8221;: &#8220;counter123&#8221;
}</code></pre><p>This event doesn&#8217;t contain the actual number that was added to the counter. Instead, it <strong>just tells the Rollup server</strong> that this counter was updated and needs to be processed. By sending these lightweight &#8220;update signals,&#8221; the system knows <strong>exactly which counters to aggregate</strong>, so it doesn&#8217;t have to go through all the events in the database every time.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-Q4H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3394bd9-e93f-4aa2-a45a-7211c82f5160_1400x567.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-Q4H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3394bd9-e93f-4aa2-a45a-7211c82f5160_1400x567.webp 424w, https://substackcdn.com/image/fetch/$s_!-Q4H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3394bd9-e93f-4aa2-a45a-7211c82f5160_1400x567.webp 848w, https://substackcdn.com/image/fetch/$s_!-Q4H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3394bd9-e93f-4aa2-a45a-7211c82f5160_1400x567.webp 1272w, https://substackcdn.com/image/fetch/$s_!-Q4H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3394bd9-e93f-4aa2-a45a-7211c82f5160_1400x567.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-Q4H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3394bd9-e93f-4aa2-a45a-7211c82f5160_1400x567.webp" width="1400" height="567" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3394bd9-e93f-4aa2-a45a-7211c82f5160_1400x567.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:567,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88056,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/177182468?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3394bd9-e93f-4aa2-a45a-7211c82f5160_1400x567.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-Q4H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3394bd9-e93f-4aa2-a45a-7211c82f5160_1400x567.webp 424w, https://substackcdn.com/image/fetch/$s_!-Q4H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3394bd9-e93f-4aa2-a45a-7211c82f5160_1400x567.webp 848w, https://substackcdn.com/image/fetch/$s_!-Q4H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3394bd9-e93f-4aa2-a45a-7211c82f5160_1400x567.webp 1272w, https://substackcdn.com/image/fetch/$s_!-Q4H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3394bd9-e93f-4aa2-a45a-7211c82f5160_1400x567.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>In-Memory Rollup Queues:</strong> Each Rollup server keeps several queues in memory to collect rollup events and process multiple counters at the same time. Using in-memory queues makes the system simpler, cheaper, and easier to adjust if they need more or fewer queues. The trade-off is that if a server crashes, some rollup events in memory might be lost.</p><p><strong>Minimize Duplicate Effort</strong>: Netflix uses a fast non-cryptographic hash like <a href="https://xxhash.com/">XXHash</a> to ensure that the same set of counters end up on the same queue. Further, Netflix tries to minimize the amount of duplicate aggregation work by having a separate rollup stack that chooses to run <em>fewer</em> <em>beefier</em> instances.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j8Tr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f7a825b-d3fb-4ffe-97f4-6612ab135cb1_1377x864.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j8Tr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f7a825b-d3fb-4ffe-97f4-6612ab135cb1_1377x864.webp 424w, https://substackcdn.com/image/fetch/$s_!j8Tr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f7a825b-d3fb-4ffe-97f4-6612ab135cb1_1377x864.webp 848w, https://substackcdn.com/image/fetch/$s_!j8Tr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f7a825b-d3fb-4ffe-97f4-6612ab135cb1_1377x864.webp 1272w, https://substackcdn.com/image/fetch/$s_!j8Tr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f7a825b-d3fb-4ffe-97f4-6612ab135cb1_1377x864.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j8Tr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f7a825b-d3fb-4ffe-97f4-6612ab135cb1_1377x864.webp" width="1377" height="864" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f7a825b-d3fb-4ffe-97f4-6612ab135cb1_1377x864.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:864,&quot;width&quot;:1377,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:53038,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/177182468?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f7a825b-d3fb-4ffe-97f4-6612ab135cb1_1377x864.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!j8Tr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f7a825b-d3fb-4ffe-97f4-6612ab135cb1_1377x864.webp 424w, https://substackcdn.com/image/fetch/$s_!j8Tr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f7a825b-d3fb-4ffe-97f4-6612ab135cb1_1377x864.webp 848w, https://substackcdn.com/image/fetch/$s_!j8Tr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f7a825b-d3fb-4ffe-97f4-6612ab135cb1_1377x864.webp 1272w, https://substackcdn.com/image/fetch/$s_!j8Tr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f7a825b-d3fb-4ffe-97f4-6612ab135cb1_1377x864.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Dynamic Batching</strong>: The Rollup server dynamically adjusts the number of time partitions that need to be scanned based on cardinality of counters in order to prevent overwhelming the underlying store with many parallel read requests.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GOG3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eeaa00b-493b-496f-83a6-ae96904ef912_1302x1036.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GOG3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eeaa00b-493b-496f-83a6-ae96904ef912_1302x1036.webp 424w, https://substackcdn.com/image/fetch/$s_!GOG3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eeaa00b-493b-496f-83a6-ae96904ef912_1302x1036.webp 848w, https://substackcdn.com/image/fetch/$s_!GOG3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eeaa00b-493b-496f-83a6-ae96904ef912_1302x1036.webp 1272w, https://substackcdn.com/image/fetch/$s_!GOG3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eeaa00b-493b-496f-83a6-ae96904ef912_1302x1036.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GOG3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eeaa00b-493b-496f-83a6-ae96904ef912_1302x1036.webp" width="1302" height="1036" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8eeaa00b-493b-496f-83a6-ae96904ef912_1302x1036.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1036,&quot;width&quot;:1302,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:51520,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/177182468?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eeaa00b-493b-496f-83a6-ae96904ef912_1302x1036.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GOG3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eeaa00b-493b-496f-83a6-ae96904ef912_1302x1036.webp 424w, https://substackcdn.com/image/fetch/$s_!GOG3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eeaa00b-493b-496f-83a6-ae96904ef912_1302x1036.webp 848w, https://substackcdn.com/image/fetch/$s_!GOG3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eeaa00b-493b-496f-83a6-ae96904ef912_1302x1036.webp 1272w, https://substackcdn.com/image/fetch/$s_!GOG3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8eeaa00b-493b-496f-83a6-ae96904ef912_1302x1036.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Adaptive Back-Pressure:</strong> Each consumer waits for one batch of counters to finish processing before starting the next batch. If the previous batch took longer, the system waits a bit more before starting the next one. This helps prevent the underlying TimeSeries database from getting overloaded with too many requests at once.</p><p><strong>Key Considerations:</strong></p><ul><li><p>Aggregation occurs without distributed locks, using <strong>immutable windows</strong> to ensure correctness</p></li><li><p>Counters are re-queued if new writes occur, ensuring convergence</p></li><li><p>Stale counts self-remediate during the next read-triggered rollup</p></li></ul><div><hr></div><h4>6. Idempotency and Safety</h4><p>Each increment event includes <code>event_id</code> + <code>event_time</code> as a <strong>unique key</strong>. This ensures:</p><ul><li><p>Safe retries without over-counting</p></li><li><p>Accurate aggregation across distributed consumers</p></li><li><p>Reliable reset operations without race conditions</p></li></ul><div><hr></div><h4>7. Experimental &#8220;Accurate&#8221; Counter</h4><p>For use cases requiring <strong>near real-time accurate counts</strong>:</p><ul><li><p>The delta between the last rollup and current events is computed in <strong>real-time</strong> during read</p></li><li><p><code>currentAccurateCount = lastRollupCount + delta</code></p></li><li><p>Batched processing is still applied to avoid overwhelming the underlying TimeSeries store</p></li></ul><p>This allows clients to see <strong>almost real-time counts</strong>, while the system maintains high throughput for millions of counters globally.</p><div><hr></div><h3>Putting It All Together</h3><p>The <strong>final architecture</strong> combines:</p><ol><li><p><strong>Event logging:</strong> Every increment is stored as an immutable event in TimeSeries</p></li><li><p><strong>Background aggregation:</strong> Continuous rollups ensure efficient reads</p></li><li><p><strong>Rollup storage:</strong> Persistent storage for aggregated counts</p></li><li><p><strong>Caching:</strong> EVCache for near-instant reads</p></li><li><p><strong>Event-driven pipeline:</strong> Efficient, parallel, and idempotent rollups</p></li><li><p><strong>Optional real-time delta:</strong> For &#8220;accurate&#8221; counter reads</p></li></ol><p>This combination of <strong>event logging, aggregation, caching, and idempotency</strong> is what allows Netflix to:</p><ul><li><p>Handle <strong>millions of simultaneous increments</strong></p></li><li><p>Provide <strong>near real-time counts</strong></p></li><li><p>Avoid <strong>overloading any single server</strong></p></li><li><p>Maintain <strong>global durability and reliability</strong></p></li></ul><div><hr></div><h3>Key Design Takeaways</h3><p>Netflix&#8217;s distributed counting system is a perfect example of how <strong>simple ideas need solid architecture at scale</strong>.</p><ol><li><p><strong>Push computation closer to data</strong><br>Let each node do part of the work instead of relying on one big system</p></li><li><p><strong>Batch operations</strong><br>Do not send every single event. Aggregate small chunks to reduce load</p></li><li><p><strong>Design for eventual consistency</strong><br>It is okay if data takes a few seconds to settle as long as it converges correctly</p></li><li><p><strong>Use idempotent events</strong><br>Avoid double-counting when messages are retried</p></li><li><p><strong>Stream, don&#8217;t store</strong><br>Use event-driven architecture for scalability and resilience</p></li></ol><blockquote><p>Reference: <a href="https://netflixtechblog.com/netflixs-distributed-counter-abstraction-8d0c45eb66b2">Netflix&#8217;s Distributed Counter Abstraction</a></p></blockquote><div><hr></div><h3>Final Thoughts</h3><p>I love how Netflix engineers turn &#8220;simple problems&#8221; into lessons in elegant system design.</p><p>Distributed counting teaches us that even a basic <code>count++</code> can become complex when scaled across thousands of machines. With the right architecture, batching, and event streaming, it becomes not just manageable but efficient.</p><p>It is a reminder that <strong>great engineering is about scaling simplicity</strong>, not complexity.</p><p>Next time you design a system, ask yourself:</p><blockquote><p>What happens when this &#8220;simple&#8221; feature needs to scale to a billion users?</p></blockquote><p>That is where system design truly begins.</p>]]></content:encoded></item><item><title><![CDATA[Inside Netflix’s 100× Speed Boost: The Maestro Rewrite]]></title><description><![CDATA[Why Netflix ditched polling and stateless workers to build a lightning-fast, event-driven workflow engine.]]></description><link>https://www.rahuldhar.me/p/inside-netflixs-100-speed-boost-the</link><guid isPermaLink="false">https://www.rahuldhar.me/p/inside-netflixs-100-speed-boost-the</guid><dc:creator><![CDATA[Rahul Dhar]]></dc:creator><pubDate>Sat, 11 Oct 2025 02:30:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!pRu6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc06e8f56-5f37-4875-8876-660c0b616566_1312x736.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Dear readers,</p><p>Today, we&#8217;re diving into a fascinating story from Netflix&#8217;s engineering team, a story that shows how even world-class systems can hit scaling walls when the use case evolves, and how a smart architectural rethink can unlock massive performance gains. In Netflix&#8217;s case, the result was a <strong>100&#215; improvement</strong>.</p><p>In this post, we&#8217;ll cover:</p><ul><li><p><strong>The challenge Netflix faced</strong> with its Maestro workflow engine, and why tasks that used to run fine started slowing down</p></li><li><p><strong>The architecture behind the slowdown</strong>, including stateless workers, polling queues, and database overhead</p></li><li><p><strong>The redesign that supercharged Maestro</strong>, introducing stateful actors, in-memory workflows, task grouping, and event-driven execution</p></li><li><p><strong>The results and takeaways</strong>, showing how a well-thought-out system design can make even massive workflows snappy</p></li></ul><p>To set the stage, imagine if every task in your system took an extra <strong>5 seconds</strong> to start. Not a big deal, right? </p><p>Now multiply that by <strong>millions of executions a day</strong>. That&#8217;s the hidden performance tax Netflix uncovered in Maestro. And their solution is a brilliant example of <strong>architecture beating brute force</strong>.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://www.rahuldhar.me/p/inside-netflixs-100-speed-boost-the?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading RD&#8217;s Newsletter! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.rahuldhar.me/p/inside-netflixs-100-speed-boost-the?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.rahuldhar.me/p/inside-netflixs-100-speed-boost-the?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><blockquote><p><strong>Reference:</strong> <a href="https://netflixtechblog.com/100x-faster-how-we-supercharged-netflix-maestros-workflow-engine-028e9637f041">Netflix Tech Blog &#8212; </a><em><a href="https://netflixtechblog.com/100x-faster-how-we-supercharged-netflix-maestros-workflow-engine-028e9637f041">100&#215; Faster: How We Supercharged Netflix Maestro&#8217;s Workflow Engine</a></em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pRu6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc06e8f56-5f37-4875-8876-660c0b616566_1312x736.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pRu6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc06e8f56-5f37-4875-8876-660c0b616566_1312x736.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pRu6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc06e8f56-5f37-4875-8876-660c0b616566_1312x736.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pRu6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc06e8f56-5f37-4875-8876-660c0b616566_1312x736.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pRu6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc06e8f56-5f37-4875-8876-660c0b616566_1312x736.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pRu6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc06e8f56-5f37-4875-8876-660c0b616566_1312x736.jpeg" width="1312" height="736" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c06e8f56-5f37-4875-8876-660c0b616566_1312x736.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:736,&quot;width&quot;:1312,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:472484,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/175812093?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc06e8f56-5f37-4875-8876-660c0b616566_1312x736.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pRu6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc06e8f56-5f37-4875-8876-660c0b616566_1312x736.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pRu6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc06e8f56-5f37-4875-8876-660c0b616566_1312x736.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pRu6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc06e8f56-5f37-4875-8876-660c0b616566_1312x736.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pRu6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc06e8f56-5f37-4875-8876-660c0b616566_1312x736.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h3><strong>The Problem Netflix Faced</strong></h3><p>Netflix has a system called <strong>Maestro</strong>. Think of Maestro as the &#8220;task manager&#8221; that controls when and how Netflix runs thousands of background jobs &#8212; like moving data, training ML models, generating recommendations, and more.</p><p>It&#8217;s like a <strong>conductor of an orchestra</strong>, telling each musician (task) when to start, stop, or wait.</p><p>For years, Maestro worked perfectly because most jobs ran only once a day or once every few hours. A few seconds of delay didn&#8217;t matter.</p><p>But then Netflix started using Maestro for <strong>real-time use cases</strong> &#8212; like ads, games, or quick data pipelines that run every few minutes or even seconds. Suddenly, they noticed:</p><blockquote><p>Every task took an extra <strong>5&#8211;10 seconds just to start</strong>.</p></blockquote><p>That was overhead inside Maestro itself &#8212; before the actual work even began. This made fast workflows feel slow and frustrated developers who had to wait on every change.</p><div><hr></div><h3><strong>Why It Was Slow</strong></h3><p>Originally, Maestro&#8217;s design worked like this:</p><ul><li><p>A <strong>central system</strong> put tasks into a queue (like a to-do list).</p></li><li><p><strong>Stateless workers</strong> (servers with no memory of previous tasks) kept polling the queue:</p></li></ul><blockquote><p>&#8220;Do you have something for me?&#8221;</p></blockquote><ul><li><p>When they got something, they loaded task details from a <strong>database</strong>, did the work, then wrote results back.</p></li></ul><p>This worked great at scale, but:</p><ul><li><p>Polling every few seconds caused delays.</p></li><li><p>Repeatedly loading and saving state slowed things further.</p></li><li><p>Using multiple systems (queues, databases, workers) added more lag.</p></li></ul><p>It&#8217;s like every time you wanted to start a task, you had to:</p><ol><li><p>Call your manager</p></li><li><p>Check your email</p></li><li><p>Unlock your computer</p></li><li><p>Then finally start working.</p></li></ol><p>This overhead is fine once a day, but crippling if you&#8217;re doing it every minute.</p><div><hr></div><h3><strong>What Netflix Did to Fix It</strong></h3><p>Netflix didn&#8217;t just tweak Maestro, they <strong>redesigned its brain</strong> for real-time responsiveness.</p><p>Here&#8217;s how they did it:</p><h4>1. <strong>Stateful Actors: Giving Each Task a &#8220;Memory&#8221;</strong></h4><p>Instead of stateless workers constantly asking &#8220;what next?&#8221;, each workflow or task got its own <strong>actor</strong>, a lightweight in-memory brain that:</p><ul><li><p>Lives in memory (JVM)</p></li><li><p>Knows its next steps</p></li><li><p>Reacts instantly to events</p></li><li><p>Avoids repetitive database reads</p></li></ul><p>Result: Tasks now start in <strong>milliseconds</strong>, not seconds.</p><div><hr></div><h4>2. <strong>Keep Data in Memory, Sync in Background</strong></h4><p>They used the database only as a <strong>backup</strong>, not as the main source for every step.<br>The real-time info lives in memory, super fast.</p><p>Think of it like working on a Google Doc in real-time, but everything is saved in the background. You don&#8217;t hit &#8220;Save&#8221; every time.</p><div><hr></div><h4>3. <strong>Group Tasks Logically</strong></h4><p>They split workflows into &#8220;groups&#8221; and assigned each group to a specific server.<br>This way, related tasks stay on the same machine, less shuffling around, faster execution.</p><p>And if a server goes down, another can take over the group.</p><p>Think of it like assigning each teacher a set of students. If one teacher is sick, another takes over their class. But students don&#8217;t keep running between classrooms unnecessarily.</p><div><hr></div><h4>4. <strong>Removed the Middleman Queue</strong></h4><p>Instead of using an external queue where workers kept asking for tasks, they:</p><ul><li><p>Used <strong>in-memory queues</strong> for instant communication inside the system, and</p></li><li><p>A small database-backed mechanism to make sure tasks don&#8217;t get lost.</p></li></ul><p>Like passing a note directly to a colleague instead of sending an email and waiting for them to check their inbox.</p><div><hr></div><h4>5. <strong>Gradual Migration with Real Testing</strong></h4><p>They built a <strong>testing framework</strong> to run real production workflows in parallel on the new system to catch bugs.<br>Then they moved users over gradually.</p><p>Like test-driving a new engine before swapping it into the airplane mid-flight.</p><div><hr></div><h3><strong>The Results</strong></h3><p>The startup delay per task went from <strong>5 seconds &#8594; 50 milliseconds</strong>.<br>That&#8217;s a <strong>100&#215; speedup</strong>.</p><ul><li><p>Developers no longer have to wait between iterations.</p></li><li><p>Real-time workflows run smoothly.</p></li><li><p>Maestro remains scalable but is now <strong>fast</strong> too.</p></li></ul><div><hr></div><h3><strong>Why This Matters (Even If You&#8217;re Not Netflix)</strong></h3><p>Many engineering teams optimize for <strong>scale</strong>, but ignore <strong>latency inside the system</strong>.</p><p>Polling, stateless workers, and external queues are simple to build, but add hidden costs.</p><p>By shifting to:</p><ul><li><p><strong>Event-driven designs</strong></p></li><li><p><strong>In-memory state</strong></p></li><li><p><strong>Actor models</strong></p></li></ul><p>&#8230;you can unlock massive speed gains without simply throwing more servers at the problem.</p><p>Netflix&#8217;s journey is a reminder: sometimes, <strong>true performance comes from rethinking architecture</strong>, not just optimizing code.</p><div><hr></div><h3><strong>Conclusion</strong></h3><p>Netflix&#8217;s Maestro evolution is more than just a speed hack. It&#8217;s a <strong>masterclass in system design</strong>.</p><p>When workloads change, the architecture that once worked well can become the bottleneck. By moving from a centralized, polling-based design to a decentralized, actor-driven model, Netflix made Maestro real-time ready.</p><p>For the rest of us, the takeaway is clear:<br>&#9989; Don&#8217;t underestimate internal latency.<br>&#9989; Event-driven + in-memory approaches can deliver big wins.<br>&#9989; Architectural redesign can unlock performance leaps that brute force never will.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.rahuldhar.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">If you enjoyed this breakdown, follow or subscribe for <strong>more real-world system design stories</strong> from top tech companies, broken down simply, with lessons you can apply to your own systems.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Why Being 'Busy' Doesn't Mean Being Productive]]></title><description><![CDATA[It all started with what I thought was a "quick question."]]></description><link>https://www.rahuldhar.me/p/context-switching-kills-productivity</link><guid isPermaLink="false">https://www.rahuldhar.me/p/context-switching-kills-productivity</guid><dc:creator><![CDATA[Rahul Dhar]]></dc:creator><pubDate>Thu, 25 Sep 2025 03:00:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!_O55!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542de9f8-afb3-4363-9816-1aea02a84d95_1024x1024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It all started with a single &#8220;quick question&#8221;, the kind that sneaks into your day like a ninja.</p><p>I was deep into building a tricky feature when a junior engineer popped by. &#8220;Hey, can you help me debug this?&#8221; I thought, five minutes won&#8217;t hurt. Fifteen minutes later, I finally got back to my code, only to be dragged into an ad-hoc meeting about some random issue. While I was on the call, Slack exploded with messages from other team members. By the time I returned to my code, I couldn&#8217;t even remember what I was working on. My focus was gone.</p><p>And that&#8217;s when it hit me: I wasn&#8217;t multitasking. I was constantly switching contexts, and it was silently sabotaging my productivity.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_O55!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542de9f8-afb3-4363-9816-1aea02a84d95_1024x1024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_O55!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542de9f8-afb3-4363-9816-1aea02a84d95_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!_O55!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542de9f8-afb3-4363-9816-1aea02a84d95_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!_O55!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542de9f8-afb3-4363-9816-1aea02a84d95_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!_O55!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542de9f8-afb3-4363-9816-1aea02a84d95_1024x1024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_O55!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542de9f8-afb3-4363-9816-1aea02a84d95_1024x1024.jpeg" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/542de9f8-afb3-4363-9816-1aea02a84d95_1024x1024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:962178,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/174462052?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542de9f8-afb3-4363-9816-1aea02a84d95_1024x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_O55!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542de9f8-afb3-4363-9816-1aea02a84d95_1024x1024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!_O55!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542de9f8-afb3-4363-9816-1aea02a84d95_1024x1024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!_O55!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542de9f8-afb3-4363-9816-1aea02a84d95_1024x1024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!_O55!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F542de9f8-afb3-4363-9816-1aea02a84d95_1024x1024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h3>The Real Problem: Context Switching</h3><p>The human brain isn&#8217;t wired to handle two demanding tasks at the same time. Every time we switch from coding to answering questions, to attending a meeting, to replying to Slack, our focus fractures. Research suggests it can take <strong>around 25 minutes</strong> to regain full attention after an interruption.</p><p>The cost isn&#8217;t just time. Constant context switching leads to mental fatigue, slower problem-solving, and less creativity. And in a fast-paced environment, with ad-hoc meetings, junior engineers needing guidance, and endless pings, this can quietly eat away at your entire day.</p><div><hr></div><h3>How I Fixed It</h3><p>Once I realized context switching was the culprit, I started experimenting with ways to protect my attention. Here&#8217;s what worked for me:</p><h4>1. Prioritize Ruthlessly</h4><p>Each morning, I pick <strong>1&#8211;2 tasks that truly matter</strong> and treat them like non-negotiable appointments. Emails, bug fixes, and minor questions come later. Saying &#8220;not now&#8221; felt uncomfortable at first, but finishing meaningful work made it worth it.</p><h4>2. Batch Interruptions</h4><p>Instead of answering Slack messages and emails the moment they arrive, I created <strong>specific time blocks</strong> for them. Junior engineers learned I had &#8220;collaboration meetings&#8221; for questions and problem solving. I could focus without guilt, and the team still got the help they needed.</p><h4>3. Time-Block Deep Work</h4><p>I reserved chunks of time purely for coding or writing and <strong>blocked them on my calendar as &#8220;Focus mode&#8220;</strong>. Notifications off, door closed, deep work mode on. For the first time, I could actually get into flow and stay there.</p><h4>4. Minimize Distractions</h4><p>Phone on silent, unnecessary tabs closed, notifications off. These small boundaries prevented tiny interruptions from breaking my focus and stealing 15&#8211;20 minutes at a time.</p><h4>5. Single-Task, Fully</h4><p>I stopped juggling multiple tasks. One task, full attention. Ironically, I finished things faster and with better quality. Letting go of the &#8220;always busy&#8221; mindset was surprisingly liberating.</p><h4>6. Build a Buffer for Chaos</h4><p>Unexpected things still happen, urgent meetings, production bugs, random questions. But leaving <strong>small gaps between tasks</strong> absorbed the chaos instead of letting it derail the day. My stress levels dropped, and I stayed in control.</p><div><hr></div><h3>Conclusion</h3><p>Multitasking might feel like a superpower, but it&#8217;s a trap. Context switching is the silent productivity killer that drains energy, focus, and creativity. Once I started protecting my attention with these habits, I finished tasks faster, felt more energized, and finally regained control over my workday.</p><p>Focus on one thing at a time, protect your flow, and let real productivity happen.</p>]]></content:encoded></item><item><title><![CDATA[How to be more assertive at your workplace]]></title><description><![CDATA[Finding the balance between silence and aggression at work]]></description><link>https://www.rahuldhar.me/p/how-to-be-more-assertive-at-your</link><guid isPermaLink="false">https://www.rahuldhar.me/p/how-to-be-more-assertive-at-your</guid><dc:creator><![CDATA[Rahul Dhar]]></dc:creator><pubDate>Mon, 15 Sep 2025 14:09:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DfYS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d6ea969-f28d-4e49-8348-0f80313ae5a3_1440x1440.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Dear readers,</strong></p><p>A few years ago, I found myself in a situation at work that taught me one of the most valuable lessons of my career.</p><p>I was leading a project that had tight deadlines. A colleague kept missing updates I needed from them, which started delaying my work. At first, I stayed silent&#8212;I didn&#8217;t want to come across as pushy. But the silence only made things worse. My manager asked why timelines were slipping, and I realized I was taking the hit for something I hadn&#8217;t spoken up about.</p><p>That was my turning point. I had to learn how to express my needs clearly without sounding rude. Instead of saying, <em>&#8220;You never give me updates,&#8221;</em> I started saying, <em>&#8220;I need the report by Wednesday so I can finalize the deck by Friday.&#8221;</em> That shift&#8212;from vague frustration to clear, respectful communication&#8212;changed everything. My colleague responded better, deadlines stopped slipping, and I felt a sense of control return to my work.</p><p>That&#8217;s the power of <strong>assertiveness</strong>. It&#8217;s not aggression, and it&#8217;s not silence. It&#8217;s about communicating in a way that values both your perspective and the other person&#8217;s.</p><p>In today&#8217;s fast-paced work environments, technical skills alone aren&#8217;t enough to thrive. Whether you&#8217;re presenting ideas to stakeholders, negotiating deadlines, or collaborating with peers, <strong>how you communicate</strong> makes a world of difference.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DfYS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d6ea969-f28d-4e49-8348-0f80313ae5a3_1440x1440.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DfYS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d6ea969-f28d-4e49-8348-0f80313ae5a3_1440x1440.jpeg 424w, https://substackcdn.com/image/fetch/$s_!DfYS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d6ea969-f28d-4e49-8348-0f80313ae5a3_1440x1440.jpeg 848w, https://substackcdn.com/image/fetch/$s_!DfYS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d6ea969-f28d-4e49-8348-0f80313ae5a3_1440x1440.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!DfYS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d6ea969-f28d-4e49-8348-0f80313ae5a3_1440x1440.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DfYS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d6ea969-f28d-4e49-8348-0f80313ae5a3_1440x1440.jpeg" width="1440" height="1440" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3d6ea969-f28d-4e49-8348-0f80313ae5a3_1440x1440.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1440,&quot;width&quot;:1440,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:463363,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/173417946?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d6ea969-f28d-4e49-8348-0f80313ae5a3_1440x1440.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DfYS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d6ea969-f28d-4e49-8348-0f80313ae5a3_1440x1440.jpeg 424w, https://substackcdn.com/image/fetch/$s_!DfYS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d6ea969-f28d-4e49-8348-0f80313ae5a3_1440x1440.jpeg 848w, https://substackcdn.com/image/fetch/$s_!DfYS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d6ea969-f28d-4e49-8348-0f80313ae5a3_1440x1440.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!DfYS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3d6ea969-f28d-4e49-8348-0f80313ae5a3_1440x1440.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>1. Understand Assertiveness (and How It Differs from Aggression)</h2><ul><li><p><strong>Aggressive communication</strong> pushes your perspective at the cost of others.</p></li><li><p><strong>Passive communication</strong> avoids conflict but often leaves you unheard.</p></li><li><p><strong>Assertive communication</strong> strikes the balance: it&#8217;s direct, clear, respectful, and solution-focused.</p></li></ul><p>For example, instead of writing:<br><em>&#8220;You never respond to my emails.&#8221;</em><br>Say:<br><em>&#8220;I had sent you an email last week regarding the project estimation. Could you share the details by today so I can finalize the report?&#8221;</em></p><p>This way, you&#8217;re stating facts, setting expectations, and staying professional</p><div><hr></div><h2>2. Use the Pyramid Principle for Clarity</h2><p>When making your point, avoid burying the key message under unnecessary details. Instead, follow a <strong>top-down structure</strong>:</p><ul><li><p><strong>Answer (Assertion):</strong> State the most important idea upfront.</p></li><li><p><strong>Argument:</strong> Support it with logical reasoning.</p></li><li><p><strong>Evidence:</strong> Back it up with data.</p></li></ul><p>For example, instead of:<br><em>&#8220;We should consider using the new tool because it might help some processes, though I know there&#8217;s resistance.&#8221;</em><br>Say:<br><em>&#8220;The new tool can save employees 10 hours a week (Answer). It automates repetitive workflows (Argument), and three other teams have already reported efficiency gains (Evidence).&#8221;</em></p><div><hr></div><h2>3. Master Assertiveness Tools</h2><p>The second deck introduces practical conversational techniques you can use when facing resistance.</p><ul><li><p><strong>Broken Record:</strong> Calmly repeat your request without getting sidetracked.</p></li><li><p><strong>Fogging:</strong> Acknowledge the valid part of criticism without absorbing exaggerations.</p></li><li><p><strong>Negative Inquiry:</strong> Ask questions to clarify vague criticism (e.g., <em>&#8220;Could you tell me what you&#8217;d like improved in my emails?&#8221;</em>).</p></li><li><p><strong>Positive No (Yes-No-Yes):</strong> Say no respectfully by affirming the other person&#8217;s needs, stating your refusal, and ending with a constructive alternative.</p></li></ul><p>Example:<br><em>&#8220;I understand this feature is important to you (Yes). However, we can&#8217;t deliver all ten changes in this release (No). What we can do is prioritize the top three and revisit the rest next sprint (Yes).&#8221;</em></p><div><hr></div><h2>4. Be Specific and Sensitive</h2><p>Generalizations like <em>&#8220;You&#8217;re always late&#8221;</em> or <em>&#8220;This timeline is impossible&#8221;</em> spark defensiveness. Assertiveness relies on <strong>specific facts and empathy</strong>:</p><ul><li><p>Instead of <em>&#8220;He is impatient&#8221;</em>, say <em>&#8220;He asked for updates five times in two hours yesterday, which disrupted the flow.&#8221;</em></p></li><li><p>Instead of <em>&#8220;The client is demanding&#8221;,</em> say <em>&#8220;The client requested five design changes within two days.&#8221;</em></p></li></ul><div><hr></div><h2>5. Say &#8220;No&#8221; Without Burning Bridges</h2><p>Many professionals struggle with saying no because they fear being seen as uncooperative. But over-committing leads to stress and poor delivery.</p><p>The <strong>Yes-No-Yes approach</strong> helps:</p><ul><li><p><strong>Yes:</strong> Show understanding (&#8220;I know this project is important to you&#8221;).</p></li><li><p><strong>No:</strong> State your boundary firmly (&#8220;We can&#8217;t add these changes to this release&#8221;).</p></li><li><p><strong>Yes:</strong> Offer a constructive way forward (&#8220;We can include them in the next sprint&#8221;).</p></li></ul><div><hr></div><h2>6. Match Your Communication to Context</h2><p>Assertiveness also means being mindful of <strong>when and how you communicate</strong>:</p><ul><li><p>Avoid sending emotional or confrontational emails&#8212;opt for a call or face-to-face chat.</p></li><li><p>Don&#8217;t schedule heavy discussions at times when people are mentally checked out (like late Friday evenings).</p></li><li><p>Keep your language human and jargon-free to ensure clarity</p></li></ul><div><hr></div><h2>7. Build Influence Without Authority</h2><p>Being assertive isn&#8217;t just about voicing your needs&#8212;it&#8217;s about influencing others while respecting their perspectives. Use the <strong>six principles of persuasion</strong> to strengthen your stance:</p><ul><li><p><strong>Reciprocity:</strong> Offer help before asking for it.</p></li><li><p><strong>Authority:</strong> Share expertise to earn credibility.</p></li><li><p><strong>Social Proof:</strong> Highlight how others have adopted your idea.</p></li><li><p><strong>Scarcity:</strong> Emphasize opportunities that may not last.</p></li><li><p><strong>Consistency:</strong> Link your request to past commitments.</p></li><li><p><strong>Liking:</strong> Build rapport so people <em>want</em> to say yes</p></li></ul><div><hr></div><h2>Final Thoughts</h2><p>Assertiveness is not about being the loudest voice in the room. It&#8217;s about being <strong>clear, confident, and considerate</strong>. When you communicate assertively, you:</p><ul><li><p>Earn respect without creating resentment</p></li><li><p>Manage conflicts productively</p></li><li><p>Influence decisions even without formal authority</p></li><li><p>Create stronger, more collaborative workplaces</p></li></ul><p>Start small&#8212;whether it&#8217;s rewriting one email to be more assertive, practicing a Positive No in your next meeting, or using the pyramid principle in your presentations. Over time, assertiveness becomes less of a skill you practice and more of a professional habit that elevates your impact.</p>]]></content:encoded></item><item><title><![CDATA[A Layman’s Guide to Dead Letter Queues (DLQ) with Google Pub/Sub]]></title><description><![CDATA[A guide to designing, and operating DLQs in production grade Pub/Sub systems]]></description><link>https://www.rahuldhar.me/p/a-laymans-guide-to-dead-letter-queues</link><guid isPermaLink="false">https://www.rahuldhar.me/p/a-laymans-guide-to-dead-letter-queues</guid><dc:creator><![CDATA[Rahul Dhar]]></dc:creator><pubDate>Thu, 11 Sep 2025 19:48:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!k937!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1dfb65-e66e-440a-ae13-a5416d1d59d7_1024x524.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Dear readers, </p><p>A few years back, I was working on a system that sent welcome emails to new users during sign-up. The design was simple: a user performed an action, an event was published to a queue, a consumer picked it up and called the SendGrid API to deliver the email.</p><p>One day, we hit an unexpected snag. Our consumer was calling the SendGrid API just fine, but due to a small bug, it never marked the event as processed (acknowledged). The queue assumed the event wasn&#8217;t handled, so after the ack deadline expired, it redelivered the same event. The consumer retried the call, SendGrid happily sent the email again, and this loop kept repeating.</p><p>To make matters worse, we only had a single consumer instance. That one stuck event essentially blocked the entire queue. What looked like a harmless bug turned into dozens of duplicate emails and a completely clogged pipeline.</p><p>This kind of issue is known as the &#8220;poison pill&#8221; problem: one bad or unacknowledged message keeps coming back, poisoning the queue, and preventing healthy messages from being processed.</p><div><hr></div><h3>From Poison Pills to Dead Letter Queues</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k937!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1dfb65-e66e-440a-ae13-a5416d1d59d7_1024x524.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k937!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1dfb65-e66e-440a-ae13-a5416d1d59d7_1024x524.jpeg 424w, https://substackcdn.com/image/fetch/$s_!k937!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1dfb65-e66e-440a-ae13-a5416d1d59d7_1024x524.jpeg 848w, https://substackcdn.com/image/fetch/$s_!k937!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1dfb65-e66e-440a-ae13-a5416d1d59d7_1024x524.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!k937!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1dfb65-e66e-440a-ae13-a5416d1d59d7_1024x524.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k937!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1dfb65-e66e-440a-ae13-a5416d1d59d7_1024x524.jpeg" width="728" height="372.53125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b1dfb65-e66e-440a-ae13-a5416d1d59d7_1024x524.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:524,&quot;width&quot;:1024,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:85905,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/173287768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ee0496a-9662-45d6-b03d-90dbc2bcfe79_1024x1024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!k937!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1dfb65-e66e-440a-ae13-a5416d1d59d7_1024x524.jpeg 424w, https://substackcdn.com/image/fetch/$s_!k937!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1dfb65-e66e-440a-ae13-a5416d1d59d7_1024x524.jpeg 848w, https://substackcdn.com/image/fetch/$s_!k937!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1dfb65-e66e-440a-ae13-a5416d1d59d7_1024x524.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!k937!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b1dfb65-e66e-440a-ae13-a5416d1d59d7_1024x524.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In this post, we&#8217;ll see how to stop a &#8220;poison pill&#8221; from poisoning our system. But before diving into the technical details, let&#8217;s imagine a simple real-world queue.</p><p>Imagine you&#8217;re at a bank where people line up to deposit checks. Each person hands their check to the cashier, who processes it one by one.</p><p>Now, if someone hands in a torn or illegible check, the cashier can&#8217;t process it. If they just keep holding on to that bad check, the entire line behind them comes to a halt.</p><p>A smarter system would be: put that bad check aside in a separate tray, let the rest of the line move forward, and later have a manager review the bad check to see what went wrong.</p><p>That&#8217;s exactly what a Dead Letter Queue (DLQ) does:</p><ul><li><p>If a message in the queue can&#8217;t be processed (like the invalid check), it gets moved aside into a special holding area (the DLQ).</p></li><li><p>The rest of the messages keep flowing smoothly.</p></li><li><p>Later, engineers can look at the failed messages in the DLQ, fix them, and reprocess them if needed.</p></li></ul><div><hr></div><h3>Designing the Dead Letter Queue</h3><p>Now that we know what a Dead Letter Queue (DLQ) is, the real question is: <em>how do we actually implement one?</em></p><p>Before jumping into code, it&#8217;s important to step back and design the mechanics carefully. A DLQ isn&#8217;t just a &#8220;dumping ground&#8221; for failed messages &#8212; it&#8217;s a deliberate safety net that requires us to make a few key decisions:</p><ol><li><p><strong>Identify failure modes</strong></p><ul><li><p>What kinds of errors can occur in your system?</p></li><li><p>Are they transient (like a temporary network blip or API timeout) or permanent (like malformed data or a missing required field)?</p></li></ul></li><li><p><strong>Define failure handling strategy</strong></p><ul><li><p>Which types of errors should trigger a retry?</p></li><li><p>Which should immediately route the message to the DLQ?</p></li></ul></li><li><p><strong>Set retry policies</strong></p><ul><li><p>How many times should a message be retried before we officially &#8220;give up&#8221;?</p></li><li><p>Should retries follow exponential backoff, fixed intervals, or custom logic?</p></li></ul></li><li><p><strong>Decide on metadata for DLQ messages</strong></p><ul><li><p>What extra information should we attach when moving a message to the DLQ? (e.g., error type, stack trace, retry count, correlation ID).</p></li><li><p>This metadata is crucial for debugging and reprocessing later.</p></li></ul></li><li><p><strong>Establish a &#8220;poison message&#8221; threshold</strong></p><ul><li><p>At what point do we stop retrying and mark a message as <em>poisonous</em>?</p></li><li><p>Once it&#8217;s in the DLQ, we don&#8217;t want it re-entering the main processing loop automatically and causing the same failure cycle.</p></li></ul></li></ol><div><hr></div><h3>Key Failure Modes and DLQ Strategies</h3><p>Once we&#8217;ve defined what a DLQ should capture, the next step is to map out the different <strong>failure modes</strong> that can occur in a Pub/Sub-based system. Each type of failure needs its own handling strategy so that messages don&#8217;t get stuck in endless loops.</p><h4>1. Consumer Failure Modes</h4><p>When a Pub/Sub consumer cannot process a message:</p><ul><li><p><strong>Ack deadline</strong>: If the consumer doesn&#8217;t acknowledge within the ack deadline (default 10s, extendable to 600s), Pub/Sub assumes failure and redelivers the message.</p></li><li><p><strong>Retries</strong>: Pub/Sub keeps retrying until the message is acknowledged or until it reaches the maximum delivery attempts (if a DLQ is configured).</p></li><li><p><strong>Poison messages</strong>: Without a DLQ, one bad message can cycle forever, blocking or delaying other messages.</p></li></ul><p><strong>Strategies</strong></p><ul><li><p>Use idempotent consumers to make retries safe and avoid duplicate side effects (e.g., sending duplicate emails).</p></li><li><p>Implement ack deadline extensions for long-running jobs.</p></li><li><p>Configure a DLQ with max delivery attempts to prevent poison loops.</p></li></ul><h4>2. API Call Failures Inside Consumer</h4><p>A common failure scenario is when your consumer depends on an <strong>external API</strong> (e.g., SendGrid, payment gateway) and that API either times out or returns a 5xx error.</p><p><strong>Best practices</strong></p><ul><li><p>Use exponential backoff with jitter to retry API calls locally before giving up.</p></li><li><p>Apply circuit breakers to stop hammering unstable dependencies.</p></li><li><p>Ensure all external calls use idempotency keys so retries don&#8217;t cause duplicates.</p></li><li><p>If retries exceed the configured limit, route the message to DLQ.</p></li></ul><h4>3. Non-Recoverable Application Errors (e.g., NPE)</h4><p>Not all failures are retryable. Some represent <strong>logic or data issues</strong> that will never succeed no matter how many times you retry:</p><ul><li><p><strong>Schema mismatch</strong> (e.g., invalid JSON).</p></li><li><p><strong>Missing required fields</strong> in the payload.</p></li><li><p><strong>Null pointer exceptions</strong> or bugs in business logic.</p></li></ul><p><strong>In such cases:</strong></p><ul><li><p>A DLQ helps capture the bad payload for later analysis.</p></li><li><p>But fixing the code (or correcting the data source) is mandatory before replaying.</p></li><li><p>Use a DB error log table to persist stack traces, consumer version, and raw payloads for debugging.</p></li></ul><h4>4. Pub/Sub DLQ vs. Error Log Table in a Database</h4><p>When deciding how to capture failed messages, teams usually debate between using:</p><ol><li><p><strong>Pub/Sub Dead Letter Queues (DLQs)</strong></p></li><li><p><strong>A database-backed error log table</strong></p></li></ol><p>Both approaches have strengths and trade-offs.</p><p><strong>Pub/Sub DLQ (Dead Letter Queue)</strong></p><p><strong>Pros</strong></p><ul><li><p>Automatically supported by Pub/Sub &#8212; easy to configure.</p></li><li><p>Handles high-throughput failures without manual scaling.</p></li><li><p>Allows attaching <strong>message attributes</strong> as metadata.</p></li><li><p>Messages can be re-subscribed to and reprocessed later.</p></li></ul><p><strong>Cons</strong></p><ul><li><p>Limited querying/filtering &#8212; replaying or analyzing requires additional tooling.</p></li><li><p>Costs grow with message volume and retention.</p></li></ul><p><strong>Error Log Table (e.g., Cloud SQL / PostgreSQL)</strong></p><p><strong>Pros</strong></p><ul><li><p>Full control: you can store payload + metadata + stack trace + retry count + consumer version.</p></li><li><p>Easy to query/filter (e.g., &#8220;show me all errors for consumer v2.1 in the last 24h&#8221;).</p></li><li><p>Can enrich with business metadata (user ID, category, severity).</p></li></ul><p><strong>Cons</strong></p><ul><li><p>Requires schema design, storage management, and indexing.</p></li><li><p>Not built for massive throughput unless carefully optimized.</p></li><li><p>Reprocessing requires custom scripts/tools.</p></li></ul><p><strong>Rule of thumb</strong>: <em>Use Pub/Sub DLQ for operational resilience and quick retries, and use an error log DB for observability and analytics.</em></p><h4>5. Republishing / Re-consuming Events</h4><p>A DLQ is only useful if you can reprocess messages safely. Here&#8217;s how it works in both approaches:</p><p><strong>From Pub/Sub Dead Letter Topic</strong></p><ul><li><p>Messages sit in a separate subscription.</p></li><li><p>Use <code>gcloud pubsub subscriptions pull</code> or a custom consumer to fetch them.</p></li><li><p>After analysis/fix, republish back to the main topic.</p></li><li><p>Use idempotency keys to avoid re-triggering duplicates.</p></li></ul><p><strong>From Error Log Table</strong></p><ul><li><p>Query failed rows (e.g., based on error category or timestamp).</p></li><li><p>Batch re-publish messages via a script or tool.</p></li><li><p>Ensure the re-publish process:</p><ul><li><p>Updates a <code>reprocessed_at</code> timestamp.</p></li><li><p>Prevents double-sending if the same row is picked up again.</p></li></ul></li></ul><h4>6. Hybrid Approach</h4><p>In practice, many production systems use a hybrid strategy:</p><ul><li><p><strong>Pub/Sub DLQ</strong> handles high-throughput retries and keeps the main pipeline healthy.</p></li><li><p><strong>Error Log Table</strong> provides deep visibility into why failures occurred, enriched with stack traces, error categories, and metadata for debugging.</p></li></ul><p><strong>Recommended flow:</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rXMN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759f4418-fedc-410d-a186-2a152313ec65_1124x1106.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rXMN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759f4418-fedc-410d-a186-2a152313ec65_1124x1106.png 424w, https://substackcdn.com/image/fetch/$s_!rXMN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759f4418-fedc-410d-a186-2a152313ec65_1124x1106.png 848w, https://substackcdn.com/image/fetch/$s_!rXMN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759f4418-fedc-410d-a186-2a152313ec65_1124x1106.png 1272w, https://substackcdn.com/image/fetch/$s_!rXMN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759f4418-fedc-410d-a186-2a152313ec65_1124x1106.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rXMN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759f4418-fedc-410d-a186-2a152313ec65_1124x1106.png" width="1124" height="1106" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/759f4418-fedc-410d-a186-2a152313ec65_1124x1106.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1106,&quot;width&quot;:1124,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:104487,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/173287768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759f4418-fedc-410d-a186-2a152313ec65_1124x1106.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rXMN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759f4418-fedc-410d-a186-2a152313ec65_1124x1106.png 424w, https://substackcdn.com/image/fetch/$s_!rXMN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759f4418-fedc-410d-a186-2a152313ec65_1124x1106.png 848w, https://substackcdn.com/image/fetch/$s_!rXMN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759f4418-fedc-410d-a186-2a152313ec65_1124x1106.png 1272w, https://substackcdn.com/image/fetch/$s_!rXMN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759f4418-fedc-410d-a186-2a152313ec65_1124x1106.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Explaining the System Flow</h3><p>This flowchart represents the end-to-end lifecycle of a message in a Google Pub/Sub&#8211;based system with Dead Letter Queues (DLQs) and a Parked Letter Queue (PLQ) or the error table.</p><p>Let&#8217;s walk through each step:</p><div><hr></div><h4>1. Message Entry &#8211; Pub/Sub Topic</h4><ul><li><p>Everything begins with the Pub/Sub Topic.</p></li><li><p>Producers publish events here (e.g., &#8220;user signup,&#8221; &#8220;order created&#8221;).</p></li><li><p>Messages are then delivered to one or more Consumers for processing.</p></li></ul><h4>2. Consumer Processing</h4><ul><li><p>The <strong>Consumer</strong> is the application logic that processes each message.</p></li><li><p>Example: validating the payload, calling an external API, saving to DB, etc.</p></li><li><p>Two outcomes are possible:</p><ul><li><p><strong>Success</strong>: Message is acknowledged and marked as <strong>Processing Complete</strong>.</p></li><li><p><strong>Failure after retries</strong>: Message is handed off to the <strong>Dead Letter Topic (DLQ)</strong>.</p></li></ul></li></ul><h4>3. Dead Letter Topic (DLQ) &#8211; Handling Transient Errors</h4><ul><li><p>The DLQ captures messages that couldn&#8217;t be processed even after retries.</p></li><li><p>These are typically transient issues, such as:</p><ul><li><p>External API outages</p></li><li><p>Temporary DB lock</p></li><li><p>Network timeouts</p></li></ul></li><li><p>To avoid losing data, we don&#8217;t discard these messages immediately. Instead:</p><ul><li><p>A Reprocessor Service picks them up later and republishes them to the main topic for another attempt.</p></li><li><p>If successful, they continue as normal through the pipeline.</p></li></ul></li></ul><h4>4. Parked Letter Queue (PLQ) &#8211; Handling Persistent Failures</h4><ul><li><p>If a DLQ message still fails even after reprocessing, it moves to the Parked Letter Queue (PLQ) or in this case, an error table.</p></li><li><p>The PLQ is the &#8220;quarantine area&#8221; for persistent or poison messages that won&#8217;t succeed automatically.</p></li><li><p>We can enhance the DLQ reprocessor to attach a custom attribute indicating how many times a message has landed in the DLQ. This way, the consumer application can use that metadata to decide when to route the message to the error log table.</p></li><li><p>Example causes:</p><ul><li><p>Schema mismatch (e.g., missing required field)</p></li><li><p>Invalid payload (malformed JSON, bad encoding)</p></li><li><p>Application bug (NullPointerException)</p></li></ul></li></ul><h4>5. Persistence and Analytics</h4><ul><li><p>Every message that enters the PLQ is persisted in an Error Log Table.</p></li><li><p>This table contains:</p><ul><li><p>Raw payload</p></li><li><p>Error category</p></li><li><p>Stack trace or exception details</p></li><li><p>Retry count and consumer version</p></li></ul></li><li><p>From here, the data flows into BigQuery for analytics and dashboards.</p><ul><li><p>Teams can analyze trends (e.g., &#8220;What % of errors are API failures vs. schema mismatches?&#8221;).</p></li><li><p>Helps prioritize fixes and monitor long-term stability.</p></li></ul></li></ul><h4>6. Manual Intervention</h4><ul><li><p>Engineers can manually inspect PLQ messages.</p></li><li><p>Based on investigation, they have two options:</p><ul><li><p><strong>Reprocess</strong>: Publish back into the Pub/Sub Topic once the root cause is fixed.</p></li><li><p><strong>Discard</strong>: If the message is invalid and cannot be salvaged.</p></li></ul></li><li><p>This ensures no data is lost silently and every failure path has a resolution.</p></li></ul><div><hr></div><h3>Implementation: Building a True DLQ Pattern with Google Pub/Sub</h3><h4>Overview</h4><p>This basic implementation would demonstrate a true Dead Letter Queue pattern using Google Pub/Sub, where:</p><ul><li><p>ALL failures go to DLQ first (not directly to PLQ)</p></li><li><p>DLQ Reprocessor adds retry count metadata</p></li><li><p>Main Consumer makes PLQ decisions based on retry count</p></li><li><p>Manual intervention available via web dashboard</p></li></ul><p><strong>Key Principle</strong></p><pre><code><code>Message &#8594; Consumer &#8594; Fail &#8594; DLQ &#8594; Reprocessor &#8594; Add Retry Count &#8594; Main Topic &#8594; Consumer &#8594; Check Retry Count &#8594; If &#8805;3 &#8594; PLQ (Database)</code></code></pre><blockquote><p>Prefer to skip the explanation and go straight to the implementation? The code is available <a href="https://github.com/rahul-dhar-e5609/dlq-test">here</a>.</p></blockquote><h4><strong>1. Producer Service</strong></h4><p><strong>Location</strong>: <code>producer-service/app.py</code></p><p>The producer publishes messages to the main topic:</p><pre><code><code>@app.route('/publish', methods=['POST'])
def publish_message():
    """Publish a single message to the main topic"""
    try:
        data = request.get_json()
        
        # Add metadata
        message_data = {
            **data,
            "timestamp": datetime.now().isoformat(),
            "producer_version": "1.0"
        }
        
        # Publish to main topic
        message_id = pubsub_manager.publish_message(MAIN_TOPIC, message_data)
        
        return jsonify({
            "message_id": message_id,
            "status": "published", 
            "timestamp": datetime.now().isoformat(),
            "topic": MAIN_TOPIC
        })</code></code></pre><p><strong>Key Features</strong>:</p><ul><li><p>REST API for message publishing</p></li><li><p>Automatic timestamp metadata</p></li><li><p>Sample message generation for testing</p></li></ul><h4><strong>2. Consumer Service (Core DLQ Logic)</strong></h4><p><strong>Location</strong>: <code>consumer-service/app.py</code></p><p>This is the heart of the DLQ pattern implementation:</p><pre><code><code>def process_message(self, message_data: Dict[Any, Any], message_id: str, 
                   correlation_id: str = None, dlq_retry_count: int = 0) -&gt; bool:
    """Process message with DLQ retry count awareness"""
    try:
        logger.info(f"Processing message {message_id}, type: {message_data.get('type')}, "
                   f"dlq_retry_count: {dlq_retry_count}")
        
        # KEY LOGIC: Check if message exceeded DLQ retry limit
        if dlq_retry_count &gt;= 3:  # Max DLQ retries
            logger.info(f"Message {message_id} exceeded DLQ retry limit, sending to PLQ")
            self._log_to_plq(
                message_data, message_id, 'dlq_max_retries_exceeded', 
                f'Message failed after {dlq_retry_count} DLQ retries',
                correlation_id=correlation_id, dlq_retry_count=dlq_retry_count
            )
            return True  # Acknowledge the message (handled via PLQ)
        
        # Check for simulated failures
        failure_type = simulate_failure(message_data)
        if failure_type:
            logger.warning(f"Processing failed ({failure_type}), will go to DLQ")
            # ALL failures go to DLQ (not directly to PLQ)
            return False
            
        # Normal processing...
        return True
        
    except Exception as e:
        # Handle unexpected errors
        if dlq_retry_count &gt;= 3:
            self._log_to_plq(message_data, message_id, error_type, error_message)
            return True
        return False  # Let it go to DLQ for retry
</code></code></pre><p><strong>Message Callback</strong> extracts retry count from attributes:</p><pre><code><code>def callback(self, message):
    """Handle incoming Pub/Sub message"""
    try:
        message_data = json.loads(message.data.decode('utf-8'))
        message_id = message.message_id
        correlation_id = message.attributes.get('correlation_id')
        
        # Extract DLQ retry count from message attributes
        dlq_retry_count = int(message.attributes.get('dlq_retry_count', 0))
        
        # Process with retry count awareness
        success = self.process_message(
            message_data, message_id, correlation_id, dlq_retry_count
        )
        
        if success:
            message.ack()  # Successfully processed
            logger.info(f"Message processed successfully: {message_id}")
        else:
            message.nack()  # Send to DLQ
            logger.warning(f"Message nacked, will go to DLQ: {message_id}")
</code></code></pre><h4><strong>3. DLQ Reprocessor (Retry Count Manager)</strong></h4><p><strong>Location</strong>: <code>dlq-reprocessor/app.py</code></p><p>The reprocessor adds retry count metadata and republishes:</p><pre><code><code>def reprocess_message(self, message_data: Dict[Any, Any], message_id: str, 
                     correlation_id: str = None, dlq_retry_count: int = 0) -&gt; bool:
    """Reprocess a message from DLQ with retry count tracking"""
    try:
        logger.info(f"Reprocessing message {message_id} (DLQ retry: {dlq_retry_count})")
        
        # Check if should reprocess
        should_reprocess, reason = self.should_reprocess(message_data, dlq_retry_count)
        if not should_reprocess:
            logger.info(f"Message not eligible: {reason}")
            return True
        
        # Enrich message with metadata
        enriched_message = self.enrich_message_for_reprocess(message_data, dlq_retry_count)
        
        # Apply exponential backoff delay
        delay = self.reprocess_delay * (2 ** dlq_retry_count)
        time.sleep(min(delay, 60))  # Cap at 60 seconds
        
        # KEY: Add DLQ retry count to message attributes
        attributes = {
            'correlation_id': correlation_id or f"dlq_reprocess_{message_id}",
            'dlq_retry_count': str(dlq_retry_count + 1),  # Increment retry count!
            'reprocess_timestamp': datetime.now().isoformat(),
            'from_dlq': 'true'
        }
        
        # Republish to main topic with retry count
        new_message_id = self.pubsub_manager.publish_message(
            self.main_topic, 
            enriched_message,
            attributes  # This adds the retry count header!
        )
        
        logger.info(f"Message {message_id} republished as {new_message_id} "
                   f"with dlq_retry_count={dlq_retry_count + 1}")
        return True
</code></code></pre><h4><strong>4. Error Storage (PLQ Implementation)</strong></h4><p><strong>Location</strong>: <code>support/database.py</code></p><p>Failed messages are stored in PostgreSQL for manual intervention:</p><pre><code><code>def log_error(self, message_id: str, message_data: dict, error_type: str, 
              error_message: str, stack_trace: str = None, 
              correlation_id: str = None, dlq_retry_count: int = 0):
    """Log error to PLQ database"""
    try:
        query = """
        INSERT INTO error_log 
        (message_id, message_data, error_type, error_message, stack_trace, 
         correlation_id, dlq_retry_count, created_at)
        VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
        """
        
        with self.get_connection() as conn:
            with conn.cursor() as cursor:
                cursor.execute(query, (
                    message_id,
                    json.dumps(message_data),
                    error_type,
                    error_message, 
                    stack_trace,
                    correlation_id,
                    dlq_retry_count,  # Track retry count in database
                    datetime.now()
                ))
                conn.commit()
</code></code></pre><h4><strong>5. Failure Simulation</strong></h4><p><strong>Location</strong>: <code>support/common.py</code></p><p>Realistic failure scenarios for testing:</p><pre><code><code>def simulate_failure(message_data: Dict[Any, Any]) -&gt; Optional[str]:
    """Simulate various failure scenarios for DLQ testing"""
    
    # Check for explicit failure simulation
    fail_simulation = message_data.get('fail_simulation')
    if fail_simulation:
        return fail_simulation
    
    # Simulate random failures based on message content
    email = message_data.get('email', '')
    if email and not '@' in email:
        return 'validation_error'
    
    message_type = message_data.get('type')
    if message_type == 'malformed_message':
        return 'schema_error'
    
    # Random API timeout (10% chance)
    if random.random() &lt; 0.1:
        return 'api_timeout'
    
    return None  # No failure
</code></code></pre><div><hr></div><h3>Testing Guide</h3><h4>Step 1: Start the System</h4><pre><code><code># Start all services
./start-system.sh

# Verify services are running
docker-compose ps
</code></code></pre><p><strong>Expected Output</strong>:</p><pre><code><code>NAME                                STATUS                      PORTS                                
dlq-test-consumer-service-1   Up X minutes                                                     
dlq-test-dlq-reprocessor-1    Up X minutes                                                     
dlq-test-postgres-1           Up X minutes (healthy)     0.0.0.0:5432-&gt;5432/tcp               
dlq-test-producer-service-1   Up X minutes               0.0.0.0:8080-&gt;8080/tcp               
dlq-test-pubsub-emulator-1    Up X minutes (healthy)     0.0.0.0:8085-&gt;8085/tcp               
dlq-test-web-dashboard-1      Up X minutes               0.0.0.0:3000-&gt;3000/tcp
</code></code></pre><h4><strong>Step 2: Test Message Publishing</strong></h4><p><strong>Test 1: Valid Message (Should Succeed)</strong></p><pre><code><code>curl -X POST http://localhost:8080/publish \
  -H 'Content-Type: application/json' \
  -d '{
    "type": "user_signup",
    "email": "test@example.com", 
    "name": "Test User"
  }'
</code></code></pre><p><strong>Expected Response</strong>:</p><pre><code><code>{
  "message_id": "1",
  "status": "published",
  "timestamp": "2025-09-11T18:42:31.069694",
  "topic": "user-events"
}
</code></code></pre><h4>Test 2: Invalid Message (Should Fail &#8594; DLQ)</h4><pre><code><code>curl -X POST http://localhost:8080/publish \
  -H 'Content-Type: application/json' \
  -d '{
    "type": "user_signup",
    "email": "invalid-email",
    "name": "Invalid User",
    "fail_simulation": "validation_error"
  }'</code></code></pre><h4>Test 3: Batch Test Messages</h4><pre><code><code>curl -X POST http://localhost:8080/publish/samples</code></code></pre><p><strong>Expected Response</strong>:</p><pre><code><code>{
  "description": "Published sample messages including some that will fail for DLQ testing",
  "messages": [
    {"message_id": "3", "type": "user_signup", "will_fail": false},
    {"message_id": "4", "type": "user_signup", "will_fail": true},
    {"message_id": "5", "type": "user_signup", "will_fail": true}
  ],
  "published_count": 5
}
</code></code></pre><h4><strong>Step 3: Monitor DLQ Flow</strong></h4><p><strong>Check Consumer Logs</strong></p><pre><code><code>docker-compose logs -f consumer-service</code></code></pre><p><strong>Expected Output</strong>:</p><pre><code><code>consumer-service-1  | INFO - Processing message 1, type: user_signup, dlq_retry_count: 0
consumer-service-1  | WARNING - Processing failed (validation_error), will go to DLQ, message_id: 2
consumer-service-1  | WARNING - Message nacked, will go to DLQ: 2</code></code></pre><p><strong>Check DLQ Reprocessor Logs</strong></p><pre><code><code>docker-compose logs -f dlq-reprocessor</code></code></pre><p><strong>Expected Output</strong>:</p><pre><code><code>dlq-reprocessor-1  | INFO - Reprocessing message 2 (DLQ retry: 0)
dlq-reprocessor-1  | INFO - Message 2 republished as 8 with dlq_retry_count=1</code></code></pre><p><strong>Check Consumer Processing Retry</strong></p><pre><code><code>docker-compose logs -f consumer-service | grep "dlq_retry_count"</code></code></pre><p><strong>Expected Output</strong>:</p><pre><code><code>consumer-service-1  | INFO - Processing message 8, type: user_signup, dlq_retry_count: 1
consumer-service-1  | INFO - Processing message 9, type: user_signup, dlq_retry_count: 2
consumer-service-1  | INFO - Message exceeded DLQ retry limit, sending to PLQ</code></code></pre><h4><strong>Step 4: Verify PLQ Storage</strong></h4><pre><code><code>docker-compose exec postgres psql -U dlq_user -d dlq_system</code></code></pre><pre><code><code>-- Check error logs
SELECT id, message_id, error_type, dlq_retry_count, created_at 
FROM error_log 
ORDER BY created_at DESC 
LIMIT 10;

-- Check PLQ messages (retry count &gt;= 3)
SELECT message_id, error_type, dlq_retry_count 
FROM error_log 
WHERE dlq_retry_count &gt;= 3;</code></code></pre><h4><strong>Step 5: Web Dashboard Testing</strong></h4><ol><li><p><strong>Open Dashboard</strong>: <a href="http://localhost:3000">http://localhost:3000</a></p></li><li><p><strong>View Error Statistics</strong>: Check failed message counts</p></li><li><p><strong>Analytics Page</strong>: <a href="http://localhost:3000/analytics">http://localhost:3000/analytics</a></p></li><li><p><strong>Manual Intervention</strong>: Click on failed messages for details</p></li></ol><h4><strong>Step 6: Comprehensive Test Script</strong></h4><pre><code><code># Run all test scenarios
./test-dlq-scenarios.sh</code></code></pre><blockquote><p>Access the code here: <strong><a href="https://github.com/rahul-dhar-e5609/dlq-test">Github</a></strong></p></blockquote><div><hr></div><h3>Summary</h3><p>The poison pill problem shows how a single unprocessed message can bring down an entire pipeline. Dead Letter Queues (DLQs) provide a safety net by isolating failed messages so the rest of the system continues running smoothly.</p><p>In this post, we covered:</p><ul><li><p><strong>What DLQs are</strong> and how they solve the poison pill problem.</p></li><li><p><strong>Design decisions</strong> around retries, failure modes, metadata, and thresholds.</p></li><li><p><strong>Strategies</strong> for handling transient vs. permanent failures.</p></li><li><p><strong>Trade-offs</strong> between Pub/Sub DLQs and database error tables.</p></li><li><p><strong>Hybrid approach</strong> for combining operational resilience with deep visibility.</p></li><li><p><strong>Implementation details</strong> with a true DLQ pattern in Google Pub/Sub, complete with reprocessing, retry counts, and error logging.</p></li></ul><p>By designing DLQs carefully, you ensure that your system is:</p><ul><li><p><strong>Resilient</strong>: no single message can clog the pipeline.</p></li><li><p><strong>Visible</strong>: errors are captured and analyzed.</p></li><li><p><strong>Flexible</strong>: transient errors can be retried, permanent ones quarantined.</p></li><li><p><strong>Controllable</strong>: engineers always have the final say through manual intervention.</p></li></ul><p>Dead Letter Queues aren&#8217;t just about &#8220;dumping&#8221; bad messages, they&#8217;re about building <strong>robust, fault-tolerant pipelines</strong> that can gracefully handle the unexpected.</p>]]></content:encoded></item><item><title><![CDATA[Why “Just Publish to Kafka” Almost Sank Our Service]]></title><description><![CDATA[How we went from 4-5 escalations per week to zero by adopting a simple but powerful pattern.]]></description><link>https://www.rahuldhar.me/p/how-outbox-pattern-saved-our-service</link><guid isPermaLink="false">https://www.rahuldhar.me/p/how-outbox-pattern-saved-our-service</guid><dc:creator><![CDATA[Rahul Dhar]]></dc:creator><pubDate>Sun, 07 Sep 2025 04:46:50 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!PFBH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10640812-5fa2-4f1b-a5df-072e153e362f_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Dear readers,</p><p>It was 2:00 AM when one of our downstream services pinged us:</p><blockquote><p>&#8220;Our system says a lane exists, but your APIs disagree. Which one&#8217;s right?&#8221;</p></blockquote><p>Slack was on fire. Kafka showed a brand-new lane between two facilities, but when someone queried our service, the database claimed nothing had changed. Downstream systems were already processing this phantom lane.</p><p>That was the night we learned the hard way about <strong>dual transactions</strong>.</p><p>In this post, I&#8217;ll walk you through:</p><ul><li><p><strong>The problem with dual transactions:</strong> why Postgres and Kafka can easily fall out of sync.</p></li><li><p><strong>How the Outbox Pattern works:</strong> the schema, publisher design, and flow.</p></li><li><p><strong>Real-world debugging challenges we faced:</strong> stuck rows, duplicates, backlogs, and how we fixed them.</p></li><li><p><strong>The role of BigQuery:</strong> how we used it for observability, debugging, and analytics.</p></li><li><p><strong>Trade-offs and best practices:</strong> where Outbox shines, where it doesn&#8217;t, and a checklist if you want to adopt it.</p></li><li><p><strong>Business impact:</strong> how we went from weekly 2 AM escalations to zero.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PFBH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10640812-5fa2-4f1b-a5df-072e153e362f_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PFBH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10640812-5fa2-4f1b-a5df-072e153e362f_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!PFBH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10640812-5fa2-4f1b-a5df-072e153e362f_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!PFBH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10640812-5fa2-4f1b-a5df-072e153e362f_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!PFBH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10640812-5fa2-4f1b-a5df-072e153e362f_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PFBH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10640812-5fa2-4f1b-a5df-072e153e362f_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/10640812-5fa2-4f1b-a5df-072e153e362f_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3128481,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/172951741?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10640812-5fa2-4f1b-a5df-072e153e362f_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PFBH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10640812-5fa2-4f1b-a5df-072e153e362f_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!PFBH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10640812-5fa2-4f1b-a5df-072e153e362f_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!PFBH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10640812-5fa2-4f1b-a5df-072e153e362f_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!PFBH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10640812-5fa2-4f1b-a5df-072e153e362f_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>The World We Lived In</h2><p>Our service was the <strong>source of truth for all facilities</strong>: warehouses, carrier hubs, depots, cross docks, suppliers. We also managed <strong>lanes</strong>, the entities connecting facilities.</p><p>The rules were simple but strict:</p><ul><li><p>If a facility&#8217;s schedule changed, downstream needed to know.</p></li><li><p>If a lane opened or closed, downstream needed to know.</p></li></ul><p>We used <strong>Postgres</strong> as the database and <strong>Kafka</strong> as the event bus. Updates had to land in both. And that&#8217;s where the nightmare began.</p><div><hr></div><h2>Dual transactions, Dual Headaches</h2><p>Two failure modes haunted us:</p><ol><li><p><strong>DB succeeds, Kafka fails</strong></p><ul><li><p>Data saved in Postgres.</p></li><li><p>Event never published.</p></li><li><p>Downstream blind to reality.</p></li></ul></li><li><p><strong>Kafka succeeds, DB rolls back</strong></p><ul><li><p>Event goes out.</p></li><li><p>DB rejects transaction.</p></li><li><p>Downstream believes a lie.</p></li></ul></li></ol><p>Both cases left us with <strong>corrupt, unsynced worlds</strong>. We had dashboards in red, and multiple &#8220;let&#8217;s sync the facility and lanes data&#8221; firefights.</p><p>At our peak, this caused <strong>4-5 escalations per week</strong>. After fixing it, escalations dropped to <strong>zero</strong>.</p><div><hr></div><h2>Enter the Outbox Pattern</h2><p>Instead of trying to write to <strong>two systems atomically</strong> (which distributed systems hate), we decided to only trust <strong>one write path</strong>: the database.</p><p>Here&#8217;s how:</p><p><strong>Write Once</strong></p><ul><li><p>Every DB transaction that modifies business data also inserts an event into an <strong>outbox table</strong>.</p></li><li><p>If the DB commit succeeds, both data + outbox entry are durable.</p></li></ul><pre><code><code>CREATE TABLE facility_outbox (
    id BIGSERIAL PRIMARY KEY,
    aggregate_type VARCHAR(50),
    aggregate_id UUID,
    event_type VARCHAR(50),
    payload JSONB,
    created_at TIMESTAMP DEFAULT now(),
    processed BOOLEAN DEFAULT false
);</code></code></pre><p><strong>Query-based CDC</strong></p><ul><li><p>A lightweight publisher job polls this outbox table.</p></li><li><p>It picks unprocessed rows, marks them, then pushes them to Kafka.</p></li></ul><pre><code><code>SELECT * FROM facility_outbox
WHERE processed = false
FOR UPDATE SKIP LOCKED
LIMIT 100;</code></code></pre><p><strong>Mark as Processed</strong></p><ul><li><p>After a successful publish, mark rows as <code>processed = true</code>.</p></li><li><p>If publish fails, retry safely.</p></li></ul><p>Now the <strong>DB was the single source of truth</strong>, and Kafka simply reflected it. No more split-brain.</p><div><hr></div><h2>Architecture at a Glance</h2><p>Here&#8217;s what the flow looked like after Outbox adoption:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Mz3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99c994a4-cd60-4c9c-bc58-52dcbdd31fe7_3018x886.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Mz3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99c994a4-cd60-4c9c-bc58-52dcbdd31fe7_3018x886.png 424w, https://substackcdn.com/image/fetch/$s_!4Mz3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99c994a4-cd60-4c9c-bc58-52dcbdd31fe7_3018x886.png 848w, https://substackcdn.com/image/fetch/$s_!4Mz3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99c994a4-cd60-4c9c-bc58-52dcbdd31fe7_3018x886.png 1272w, https://substackcdn.com/image/fetch/$s_!4Mz3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99c994a4-cd60-4c9c-bc58-52dcbdd31fe7_3018x886.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Mz3!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99c994a4-cd60-4c9c-bc58-52dcbdd31fe7_3018x886.png" width="862" height="252.79807692307693" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/99c994a4-cd60-4c9c-bc58-52dcbdd31fe7_3018x886.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:427,&quot;width&quot;:1456,&quot;resizeWidth&quot;:862,&quot;bytes&quot;:102381,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.rahuldhar.me/i/172951741?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99c994a4-cd60-4c9c-bc58-52dcbdd31fe7_3018x886.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4Mz3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99c994a4-cd60-4c9c-bc58-52dcbdd31fe7_3018x886.png 424w, https://substackcdn.com/image/fetch/$s_!4Mz3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99c994a4-cd60-4c9c-bc58-52dcbdd31fe7_3018x886.png 848w, https://substackcdn.com/image/fetch/$s_!4Mz3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99c994a4-cd60-4c9c-bc58-52dcbdd31fe7_3018x886.png 1272w, https://substackcdn.com/image/fetch/$s_!4Mz3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99c994a4-cd60-4c9c-bc58-52dcbdd31fe7_3018x886.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Debugging Nightmares We Solved</h2><p>This wasn&#8217;t just plug-and-play. We hit plenty of edge cases:</p><h3>1. <strong>Stuck Rows</strong></h3><ul><li><p>Sometimes publishers crashed mid-batch.</p></li><li><p>Rows got locked forever.</p></li><li><p>Fix: <code>SKIP LOCKED</code> ensured other workers could move on.</p></li></ul><h3>2. <strong>Duplicate Events</strong></h3><ul><li><p>Retries led to re-publishing.</p></li><li><p>Solution: consumers became <strong>idempotent</strong>, keyed by <code>event_id</code>.</p></li></ul><h3>3. <strong>Slow Publishers</strong></h3><ul><li><p>Outbox grew faster than we could drain it.</p></li><li><p>Fix: batch publishes + horizontal scaling of publishers.</p></li></ul><h3>4. <strong>Backfill Hell</strong></h3><ul><li><p>New systems wanted <em>all historical events</em>.</p></li><li><p>Answer: replay directly from the outbox table &#8594; no custom scripts.</p></li></ul><div><hr></div><h2>Why BigQuery Sink?</h2><p>We weren&#8217;t just worried about delivery&#8212;we also needed <strong>observability</strong>.</p><ul><li><p>Every Kafka topic had a <strong>BigQuery sink</strong>.</p></li><li><p>This let us:</p><ul><li><p>Run ad-hoc queries: &#8220;Show all lane updates in the last 6h.&#8221;</p></li><li><p>Debug mismatches: compare DB state vs. published events.</p></li><li><p>Prove SLAs: how fast events left the outbox.</p></li></ul></li></ul><p>It was our <strong>black box recorder</strong> for event history.</p><div><hr></div><h2>Trade-offs We Faced</h2><p>The Outbox pattern isn&#8217;t a silver bullet. We had to weigh:</p><ul><li><p><strong>Extra DB Load</strong>: every change means an extra insert.</p></li><li><p><strong>Latency</strong>: events are near-real-time, not instant (polling adds ~200&#8211;500ms).</p></li><li><p><strong>Storage Growth</strong>: outbox table can bloat if not purged/archived.</p></li><li><p><strong>Operational Complexity</strong>: need monitoring, retries, and cleanup policies.</p></li></ul><p>But for us, consistency &gt; absolute latency. We&#8217;d rather downstream be <strong>slightly late</strong> than <strong>totally wrong</strong>.</p><div><hr></div><h2>When to (and NOT to) Use Outbox</h2><p><strong>Use Outbox if&#8230;</strong></p><ul><li><p>Your service is a source of truth and must publish events reliably.</p></li><li><p>You&#8217;re dealing with critical entities (orders, payments, facilities, lanes).</p></li><li><p>Dual writes would cause corruption.</p></li></ul><p><strong>Don&#8217;t bother if&#8230;</strong></p><ul><li><p>Events are non-critical logs/metrics.</p></li><li><p>Latency requirements are <em>microseconds</em>.</p></li><li><p>You already have a rock-solid streaming CDC pipeline like Debezium.</p></li></ul><div><hr></div><h2>Checklist for Adopting Outbox</h2><ol><li><p>Design outbox schema (include metadata + payload).</p></li><li><p>Use <code>FOR UPDATE SKIP LOCKED</code> to avoid stuck workers.</p></li><li><p>Make consumers idempotent.</p></li><li><p>Add monitoring: queue size, lag, failed publishes.</p></li><li><p>Decide retention: archive or delete processed rows.</p></li><li><p>(Optional) Sink to BigQuery or warehouse for debugging.</p></li></ol><div><hr></div><h2>Final Thought</h2><p>After moving to Outbox, our 2 AM &#8220;facility lane mismatch&#8221; alerts vanished.<br>No more phantom lanes. No more split-brain facilities.</p><p>The system wasn&#8217;t perfect&#8212;there was extra DB load, and events weren&#8217;t <em>instant</em>. But consistency beat chaos.</p><p>And honestly? With Outbox in place, the only thing waking us up at 3 AM was the baby, not broken events. &#128118;&#10024;</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.rahuldhar.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading RD&#8217;s Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Beyond the DLQ: What Really Happens After a Message Fails]]></title><description><![CDATA[How to transform Dead Letter Queues into a source of resilience and learning]]></description><link>https://www.rahuldhar.me/p/beyond-the-dlq-what-really-happens</link><guid isPermaLink="false">https://www.rahuldhar.me/p/beyond-the-dlq-what-really-happens</guid><dc:creator><![CDATA[Rahul Dhar]]></dc:creator><pubDate>Sat, 30 Aug 2025 04:30:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!8MDF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24f88f53-95a8-4485-b796-44e8d6a5a811_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Dear readers,</p><p>I still remember the moment clearly. I was in the middle of presenting a design for a new event-driven system. I had my architecture diagram up, services, topics, retries, and yes, a neat little Dead Letter Queue (DLQ) sitting off to the side.</p><p>One of the staff engineers raised a hand and asked me a simple question:</p><p><strong>&#8220;Once a message is sent to the DLQ, then what? How do you actually know what the issue is?&#8221;</strong></p><p>For a second, I froze. I had accounted for retries. I had thought about scaling. I had made sure failures wouldn&#8217;t break the main pipeline. But I realized I hadn&#8217;t thought deeply about what happens <em>after</em> something lands in the DLQ.</p><p>That question stuck with me. Because the truth is, in many system designs, a DLQ is treated like a checkbox. We draw it in the diagram, it gives a sense of safety, and we move on. But if you&#8217;ve ever actually operated such a system, you know that a DLQ is not the end of the story&#8212;it&#8217;s the beginning of one.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8MDF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24f88f53-95a8-4485-b796-44e8d6a5a811_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8MDF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24f88f53-95a8-4485-b796-44e8d6a5a811_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!8MDF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24f88f53-95a8-4485-b796-44e8d6a5a811_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!8MDF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24f88f53-95a8-4485-b796-44e8d6a5a811_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!8MDF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24f88f53-95a8-4485-b796-44e8d6a5a811_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8MDF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24f88f53-95a8-4485-b796-44e8d6a5a811_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/24f88f53-95a8-4485-b796-44e8d6a5a811_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2563989,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://rahuldhar47.substack.com/i/172282682?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24f88f53-95a8-4485-b796-44e8d6a5a811_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8MDF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24f88f53-95a8-4485-b796-44e8d6a5a811_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!8MDF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24f88f53-95a8-4485-b796-44e8d6a5a811_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!8MDF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24f88f53-95a8-4485-b796-44e8d6a5a811_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!8MDF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24f88f53-95a8-4485-b796-44e8d6a5a811_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>The Day We Lost Data Quietly</h2><p>Not long after that presentation, I had my wake-up call.</p><p>One of our downstream services had made a schema change. The main queue retried the failing messages, but the consumers could not process them anymore. Eventually, those messages were moved into the DLQ.</p><p>And then nothing happened.</p><p>There was no alert. There was no dashboard entry calling attention to it. The DLQ sat quietly in the background, collecting failed messages that nobody was watching.</p><p>Two weeks later, those messages expired because in AWS SQS, DLQs automatically purge messages after fourteen days. By the time someone noticed missing data in a downstream report, it was gone forever. We had no way of knowing what the payloads were, no trace IDs, no metadata. Just silence.</p><p>That incident drove home a painful lesson: DLQs do not save you. They only delay the inevitable. Unless you monitor them and act, they become silent data graves.</p><div><hr></div><h2>So, What Is a DLQ?</h2><p>Think of a DLQ like the &#8220;Lost and Found&#8221; counter at an airport. When baggage gets misplaced, it doesn&#8217;t vanish, it shows up at the counter. But if no one ever checks the counter, or if the bag doesn&#8217;t have a proper tag, it&#8217;s almost as good as lost.</p><p>That&#8217;s what happens in many systems. Messages that can&#8217;t be processed get pushed into the DLQ. They sit there quietly, like unclaimed baggage, until someone eventually notices there&#8217;s a growing pile of them. By then, the real issue, bad data, a downstream failure, or a misconfiguration, may have already caused damage.</p><p>So the real question isn&#8217;t <em>do you have a DLQ?</em> It&#8217;s <em>what do you do with it?</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7Vp0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5661b5-b959-4cf3-adac-141fb64ac7b5_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7Vp0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5661b5-b959-4cf3-adac-141fb64ac7b5_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!7Vp0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5661b5-b959-4cf3-adac-141fb64ac7b5_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!7Vp0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5661b5-b959-4cf3-adac-141fb64ac7b5_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!7Vp0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5661b5-b959-4cf3-adac-141fb64ac7b5_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7Vp0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5661b5-b959-4cf3-adac-141fb64ac7b5_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb5661b5-b959-4cf3-adac-141fb64ac7b5_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2566301,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://rahuldhar47.substack.com/i/172282682?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5661b5-b959-4cf3-adac-141fb64ac7b5_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7Vp0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5661b5-b959-4cf3-adac-141fb64ac7b5_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!7Vp0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5661b5-b959-4cf3-adac-141fb64ac7b5_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!7Vp0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5661b5-b959-4cf3-adac-141fb64ac7b5_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!7Vp0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5661b5-b959-4cf3-adac-141fb64ac7b5_1536x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>The Questions We Rarely Ask</h2><p>Looking back, I realized that my original design doc never asked the right questions:</p><ul><li><p>What happens when a message actually lands in the DLQ?</p></li><li><p>Do we consume and log it anywhere permanent?</p></li><li><p>Should metadata, like error type, trace ID, and timestamp be attached so engineers can diagnose the cause?</p></li><li><p>Who owns the DLQ, and how quickly should they act when it fills up?</p></li><li><p>How do we prevent silent expiration?</p></li></ul><p>Without answers, the DLQ is just a checkbox. It exists, but it does not help.</p><div><hr></div><h2>Poison Pills and Infinite Loops</h2><p>Later, I ran into another problem: the poison pill message.</p><p>This is the message that can never succeed. Perhaps the payload is malformed beyond repair, or the downstream service rejects it permanently.</p><p>At one point, we had such a poison pill bouncing endlessly between the main queue and the DLQ because of a naive &#8220;automatic redrive&#8221; process. Every time it was replayed, it failed in the same way.</p><p>Instead of healing the system, our DLQ had turned into a loop of repeated failure. That was the moment I realized: redrives must never be blind. You need context. You need human eyes or smart triage before you send a message back into the system.</p><div><hr></div><h2>Turning DLQs Into Debugging Tools</h2><p>The breakthrough came when we stopped treating DLQs as storage and started treating them as debugging pipelines.</p><p>Instead of dumping raw payloads, we enriched every DLQ message with structured metadata:</p><ul><li><p>Error category (validation error, downstream failure, business rule violation)</p></li><li><p>Originating service</p></li><li><p>Retry count</p></li><li><p>Trace or correlation ID</p></li><li><p>Timestamp</p></li></ul><p>Now, when a message appeared in the DLQ, engineers could immediately understand why it was there. We had transformed the DLQ from a black box into a window into system health.</p><div><hr></div><h2>From DLQ to PLQ: A Permanent Record</h2><p>Even with metadata, DLQs came with another limitation: expiration.</p><p>We solved this by creating what we called a <strong>Parked Letter Queue (PLQ)</strong>, a database-backed error log that stored all failed messages, enriched with context.</p><p>The way I like to describe it is that the DLQ is like the emergency room: temporary, chaotic, and focused on urgent triage. The PLQ is the hospital&#8217;s archive: a permanent record of every case.</p><p>With the PLQ, we finally had:</p><ul><li><p>A searchable history of failures</p></li><li><p>An audit trail for debugging and compliance</p></li><li><p>The ability to identify recurring failure patterns over time</p></li></ul><p>That single addition meant we no longer woke up to missing data with no explanation.</p><div><hr></div><h2>A Simple Analogy</h2><p>Imagine a conveyor belt in a factory. Most items pass through fine, but occasionally one is defective. Instead of halting the belt, the item is pushed into a <strong>reject bin</strong>, that is your DLQ.</p><p>Now imagine if no one ever checks the reject bin. Products keep disappearing, and nobody knows why.</p><p>But if every defective item is labeled with why it failed, which machine broke, and at what time, and those labels are logged in a permanent book, you start to see patterns. Maybe Machine A jams every Monday. Maybe Supplier X always ships faulty parts. Suddenly, the reject bin is not just waste; it is a source of insight.</p><p>That is how DLQs should be treated.</p><div><hr></div><h2>Designing DLQs That Actually Help</h2><p>From these experiences, I realized that a DLQ isn&#8217;t just a &#8220;safety bucket&#8221;, it&#8217;s a tool for learning, accountability, and resilience. But that only works if it&#8217;s designed with intention. Here are some principles that have stuck with me:</p><p><strong>1. Monitor them like you mean it.</strong><br>A DLQ that no one watches is just a black hole. Track not only how many messages land there (depth), but also how long they&#8217;ve been sitting (age) and how fast new ones are arriving (throughput). A sudden spike could mean a broken deployment. A slow accumulation might indicate a silent data quality issue. Alerts on these metrics often become your first signal that the system is quietly burning.</p><p><strong>2. Enrich messages with metadata.</strong><br>A raw payload alone is often meaningless at 3 AM when you are debugging. Include trace IDs, service names, timestamps, error codes, or even a short reason for failure. Think of it as leaving breadcrumbs for your future self (or the unlucky teammate on call). The easier it is to reconstruct the story of &#8220;what went wrong,&#8221; the faster you can fix it.</p><p><strong>3. Redrive with discipline, not desperation.</strong><br>The temptation is to just shovel DLQ messages back into the main queue once the pipeline looks healthy again. That&#8217;s risky. If the root cause hasn&#8217;t been fixed, you&#8217;re just replaying the same movie. Instead, validate fixes, test carefully, and only then reintroduce messages. Redriving should be a controlled process, not a panic button.</p><p><strong>4. Consider a PLQ or permanent error log.</strong><br>DLQs often have retention limits. If you don&#8217;t act in time, your evidence just expires. A Parked Letter Queue (PLQ) or dedicated error store solves this by keeping failed messages along with all the metadata, forever if needed. It becomes the system&#8217;s diary of failures, a reliable record you can revisit for audits, debugging, or even training better validation rules.</p><div><hr></div><h2>Final Reflection</h2><p>That simple question during my design presentation, &#8220;Once a message is sent to the DLQ, then what?&#8221; was one of the most valuable lessons of my career.</p><p>DLQs are not the end of the story. They are the beginning of an investigation. They are not insurance policies; they are mirrors reflecting where your system is fragile.</p><p>The real measure of resilience is not whether you have a DLQ. It is whether you know what is inside it, and whether you are doing something about it.</p><div><hr></div><p><em>If you would like me to share a guide on how to implement DLQs effectively, whether on AWS SQS, Google Pub/Sub, or Kafka, drop a comment below.</em></p>]]></content:encoded></item><item><title><![CDATA[Idempotent Keys Explained: How Stripe's Approach Prevents Duplicate Requests and Builds Trust]]></title><description><![CDATA[A developer's guide to understanding idempotent keys, why they matter beyond payments, and how to implement them in Spring Boot for safer, more reliable systems.]]></description><link>https://www.rahuldhar.me/p/idempotent-keys-explained-how-stripes</link><guid isPermaLink="false">https://www.rahuldhar.me/p/idempotent-keys-explained-how-stripes</guid><dc:creator><![CDATA[Rahul Dhar]]></dc:creator><pubDate>Wed, 27 Aug 2025 15:30:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!AS5_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c16ef62-58c8-40b7-a4d2-e9448561a988_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Dear readers,</p><p>Let&#8217;s be honest, most of us have faced that moment of frustration when technology doesn&#8217;t behave the way we expect. Maybe you clicked &#8220;Pay&#8221; on a checkout page, the internet blinked, and you weren&#8217;t sure whether you should hit the button again. Or maybe you submitted a form twice because the first time seemed to fail, only to realize later that your action got recorded multiple times.</p><p>These little glitches don&#8217;t just annoy us as users; they can create <strong>big headaches for the systems and teams behind the scenes</strong>. Duplicate payments, repeated orders, multiple job submissions, these are not just bugs, they&#8217;re trust-breakers. When users lose confidence that your system will &#8220;do the right thing&#8221;, they start doubting whether they should use it at all.</p><p>This is where <strong>idempotent keys</strong>, a concept Stripe brought into the spotlight, come in. And trust me, it&#8217;s one of those rare engineering ideas that&#8217;s both simple and deeply impactful.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AS5_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c16ef62-58c8-40b7-a4d2-e9448561a988_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AS5_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c16ef62-58c8-40b7-a4d2-e9448561a988_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!AS5_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c16ef62-58c8-40b7-a4d2-e9448561a988_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!AS5_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c16ef62-58c8-40b7-a4d2-e9448561a988_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!AS5_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c16ef62-58c8-40b7-a4d2-e9448561a988_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AS5_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c16ef62-58c8-40b7-a4d2-e9448561a988_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c16ef62-58c8-40b7-a4d2-e9448561a988_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:926373,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://rahuldhar47.substack.com/i/171959229?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c16ef62-58c8-40b7-a4d2-e9448561a988_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AS5_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c16ef62-58c8-40b7-a4d2-e9448561a988_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!AS5_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c16ef62-58c8-40b7-a4d2-e9448561a988_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!AS5_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c16ef62-58c8-40b7-a4d2-e9448561a988_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!AS5_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c16ef62-58c8-40b7-a4d2-e9448561a988_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>What Stripe Did with Idempotent Keys</h2><p>Stripe, being a payments company, had a problem to solve:</p><ul><li><p>Customers might refresh or retry during checkout.</p></li><li><p>Network requests might time out and be retried by clients automatically.</p></li><li><p>Without safeguards, this could result in someone being charged <strong>twice for the same transaction</strong>.</p></li></ul><p>Their solution was beautifully straightforward:</p><p>Every request sent to Stripe can include a unique <strong>idempotent key</strong>, essentially a unique identifier (like a UUID).</p><p>Here&#8217;s how it works:</p><ul><li><p>The <strong>first time</strong> Stripe sees that key, it processes the request and stores the result.</p></li><li><p>If the <strong>same key</strong> shows up again, Stripe doesn&#8217;t re-run the process, it simply returns the same result it gave the first time.</p></li></ul><p>The magic? To the customer, the action always feels safe and predictable. No matter how many times they click &#8220;Pay&#8221;, they&#8217;ll only ever be charged once.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8S5z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc469a6d-b125-4e23-b08c-ffe10c0c17e4_1505x1507.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8S5z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc469a6d-b125-4e23-b08c-ffe10c0c17e4_1505x1507.png 424w, https://substackcdn.com/image/fetch/$s_!8S5z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc469a6d-b125-4e23-b08c-ffe10c0c17e4_1505x1507.png 848w, https://substackcdn.com/image/fetch/$s_!8S5z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc469a6d-b125-4e23-b08c-ffe10c0c17e4_1505x1507.png 1272w, https://substackcdn.com/image/fetch/$s_!8S5z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc469a6d-b125-4e23-b08c-ffe10c0c17e4_1505x1507.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8S5z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc469a6d-b125-4e23-b08c-ffe10c0c17e4_1505x1507.png" width="1456" height="1458" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc469a6d-b125-4e23-b08c-ffe10c0c17e4_1505x1507.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1458,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:176153,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://rahuldhar47.substack.com/i/171959229?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc469a6d-b125-4e23-b08c-ffe10c0c17e4_1505x1507.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8S5z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc469a6d-b125-4e23-b08c-ffe10c0c17e4_1505x1507.png 424w, https://substackcdn.com/image/fetch/$s_!8S5z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc469a6d-b125-4e23-b08c-ffe10c0c17e4_1505x1507.png 848w, https://substackcdn.com/image/fetch/$s_!8S5z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc469a6d-b125-4e23-b08c-ffe10c0c17e4_1505x1507.png 1272w, https://substackcdn.com/image/fetch/$s_!8S5z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc469a6d-b125-4e23-b08c-ffe10c0c17e4_1505x1507.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Why This Matters Beyond Payments</h2><p>At first glance, idempotent keys sound like a &#8220;payments thing&#8221;. But if you take a closer look, the underlying principle applies everywhere:</p><ul><li><p><strong>APIs &amp; Microservices</strong><br>When your services call each other, retries are inevitable. Idempotency ensures retries don&#8217;t create duplicate records or trigger duplicate jobs.</p></li><li><p><strong>Event-Driven Systems</strong><br>Events often get delivered more than once. Without idempotency, you risk creating duplicate entities or processing the same task twice.</p></li><li><p><strong>Batch Jobs &amp; Uploads</strong><br>Imagine uploading a large file, only for the system to fail midway. With idempotency, rerunning the job won&#8217;t duplicate data, it just resumes gracefully.</p></li><li><p><strong>User Actions</strong><br>From submitting feedback forms to booking tickets, idempotency makes sure each &#8220;intention&#8221; from the user gets captured once, not multiple times.</p></li></ul><p>In other words, <strong>anywhere retries or duplicate requests can happen, idempotency helps you stay safe.</strong></p><div><hr></div><h2>Benefits of Idempotent Keys</h2><ol><li><p><strong>Reliability in the face of retries</strong><br>Your system can handle the messy realities of networks and user actions without breaking trust.</p></li><li><p><strong>Better user experience</strong><br>Users feel reassured knowing they won&#8217;t be double-billed, double-registered, or double-subscribed.</p></li><li><p><strong>Operational safety</strong><br>Engineers and support teams spend less time cleaning up duplicate data or refunding accidental charges.</p></li><li><p><strong>Simplicity in design</strong><br>Instead of inventing complex duplicate-handling logic, you get a clean, repeatable pattern.</p></li><li><p><strong>Traceability</strong><br>Each request is tied to a key, making it easier to debug, audit, and reason about the system&#8217;s behaviour.</p></li></ol><div><hr></div><h2>How to Implement Idempotent Keys in Spring Boot</h2><p>Let&#8217;s make this concrete with a <strong>Java Spring Boot</strong> example. We&#8217;ll design a reusable solution using <strong>annotations</strong> so that any endpoint can become idempotent just by adding <code>@Idempotent</code>.</p><blockquote><p>Don&#8217;t feel like reading the full breakdown? Explore the code directly <a href="https://github.com/rahul-dhar-e5609/idempotent-key">here</a>.</p></blockquote><h3>Step 1. Create the Annotation</h3><p>We&#8217;ll mark methods that should support idempotency.</p><pre><code><code>@Target({ElementType.METHOD})
@Retention(RetentionPolicy.RUNTIME)
public @interface Idempotent {
    // Optional: custom expiry time in seconds (default = 24 hours)
    long expirySeconds() default 86400; 
}</code></code></pre><div><hr></div><h3>Step 2. Create the Aspect</h3><p>This Aspect will:</p><ul><li><p>Check for the <code>Idempotency-Key</code> header.</p></li><li><p>If the key exists in Redis, return cached response.</p></li><li><p>If not, process request, save response, and return it.</p></li></ul><pre><code><code>@Aspect
@Component
public class IdempotencyAspect {

    private final RedisTemplate&lt;String, String&gt; redisTemplate;
    private final ObjectMapper objectMapper;

    public IdempotencyAspect(RedisTemplate&lt;String, String&gt; redisTemplate, ObjectMapper objectMapper) {
        this.redisTemplate = redisTemplate;
        this.objectMapper = objectMapper;
    }

    @Around("@annotation(idempotent)")
    public Object handleIdempotency(ProceedingJoinPoint joinPoint, Idempotent idempotent) throws Throwable {
        
        ServletRequestAttributes requestAttributes = (ServletRequestAttributes) RequestContextHolder.getRequestAttributes();
        if (requestAttributes == null) {
            // Not in a web request context, proceed normally
            return joinPoint.proceed();
        }

        HttpServletRequest request = requestAttributes.getRequest();
        HttpServletResponse response = requestAttributes.getResponse();

        String idempotencyKey = request.getHeader("Idempotency-Key");
        if (idempotencyKey == null || idempotencyKey.isEmpty()) {
            if (response != null) {
                response.sendError(HttpServletResponse.SC_BAD_REQUEST, "Missing Idempotency-Key header");
            }
            return null;
        }

        // Add the idempotency key to response headers for visibility
        if (response != null) {
            response.setHeader("X-Idempotency-Key", idempotencyKey);
        }

        // Check if we've seen this key
        String cachedResponse = redisTemplate.opsForValue().get(idempotencyKey);
        if (cachedResponse != null) {
            // Cache HIT - return cached response
            if (response != null) {
                response.setHeader("X-Idempotency-Cache-Status", "HIT");
            }
            
            // For cached responses, we always return a ResponseEntity with the cached body
            // since our controller methods return ResponseEntity
            try {
                Object responseBody = objectMapper.readValue(cachedResponse, Object.class);
                return ResponseEntity.ok(responseBody);
            } catch (Exception e) {
                // If deserialization fails, proceed with normal execution
                // This ensures the system remains robust
                System.err.println("Failed to deserialize cached response: " + e.getMessage());
            }
        }

        // Cache MISS - process the request and cache the response
        if (response != null) {
            response.setHeader("X-Idempotency-Cache-Status", "MISS");
        }

        Object result = joinPoint.proceed();

        // Cache the response if it's successful
        if (result != null) {
            try {
                String responseBody;
                if (result instanceof ResponseEntity&lt;?&gt; responseEntity) {
                    // Extract the body from ResponseEntity and serialize it
                    responseBody = objectMapper.writeValueAsString(responseEntity.getBody());
                } else {
                    responseBody = objectMapper.writeValueAsString(result);
                }
                
                long expirySeconds = idempotent.expirySeconds();
                redisTemplate.opsForValue().set(idempotencyKey, responseBody, expirySeconds, TimeUnit.SECONDS);
            } catch (Exception e) {
                // Log the exception but don't fail the request
                System.err.println("Failed to cache idempotent response: " + e.getMessage());
            }
        }

        return result;
    }
}
</code></code></pre><div><hr></div><h3>Step 3. Use It in a Controller</h3><p>Now, simply annotate your endpoint:</p><pre><code><code>@RestController
@RequestMapping("/api")
public class SampleController {
    @PostMapping("/orders")
    @Idempotent(expirySeconds = 3600) // 1 hour custom expiry
    public ResponseEntity&lt;Map&lt;String, Object&gt;&gt; createOrder(@RequestBody Map&lt;String, Object&gt; orderRequest) {
        // Simulate order creation
        Map&lt;String, Object&gt; response = new HashMap&lt;&gt;();
        response.put("orderId", UUID.randomUUID().toString());
        response.put("status", "created");
        response.put("timestamp", LocalDateTime.now().toString());
        response.put("customerData", orderRequest);
        
        return ResponseEntity.ok(response);
    }
}</code></code></pre><div><hr></div><h3>Step 4. Client Example</h3><p>Clients must send the header:</p><p>First request: processes and creates an order.</p><pre><code><code>curl -v -X POST http://localhost:8080/api/orders \
  -H "Content-Type: application/json" \
  -H "Idempotency-Key: fixed-test-123" \
  -d '{"customerId": "customer-1", "items": [{"id": "item-1", "quantity": 2}]}'

&gt; Content-Type: application/json
&gt; Idempotency-Key: fixed-test-123
&gt; Content-Length: 72
&gt; 
* upload completely sent off: 72 bytes
&lt; HTTP/1.1 200 
&lt; X-Idempotency-Key: fixed-test-123
&lt; X-Idempotency-Cache-Status: MISS
&lt; Content-Type: application/json
&lt; Transfer-Encoding: chunked
&lt; Date: Tue, 26 Aug 2025 16:53:22 GMT
&lt; 
* Connection #0 to host localhost left intact
{"orderId":"c9431b5d-2a01-48b4-b788-442a82f5827c","customerData":{"customerId":"customer-1","items":[{"id":"
item-1","quantity":2}]},"status":"created","timestamp":"2025-08-26T16:53:22.806485375"}</code></code></pre><p>Second request with same key: returns the <em>exact same response</em>.</p><pre><code>curl -v -X POST http://localhost:8080/api/orders \
  -H "Content-Type: application/json" \
  -H "Idempotency-Key: fixed-test-123" \
  -d '{"customerId": "customer-1", "items": [{"id": "item-1", "quantity": 2}]}'

&gt; Content-Type: application/json
&gt; Idempotency-Key: fixed-test-123
&gt; Content-Length: 72
&gt; 
* upload completely sent off: 72 bytes
&lt; HTTP/1.1 200 
&lt; X-Idempotency-Key: fixed-test-123
&lt; X-Idempotency-Cache-Status: HIT
&lt; Content-Type: application/json
&lt; Transfer-Encoding: chunked
&lt; Date: Tue, 26 Aug 2025 16:53:30 GMT
&lt; 
* Connection #0 to host localhost left intact
{"orderId":"c9431b5d-2a01-48b4-b788-442a82f5827c","customerData":{"customerId":"customer-1","items":[{"id":"
item-1","quantity":2}]},"status":"created","timestamp":"2025-08-26T16:53:22.806485375"}</code></pre><blockquote><p>Dive into the full code example on <a href="https://github.com/rahul-dhar-e5609/idempotent-key">Github</a>.</p></blockquote><div><hr></div><h2>Things to Keep in Mind</h2><ul><li><p><strong>Expiration</strong><br>Don&#8217;t keep keys forever. Set a time limit that makes sense (e.g., a day for payments, longer for data uploads).</p></li><li><p><strong>Scope</strong><br>Define what the key represents. Is it for a whole order? A single file upload? A single API call?</p></li><li><p><strong>Atomicity</strong><br>Make sure storing and checking the key happen atomically to avoid race conditions.</p></li><li><p><strong>Consistency</strong><br>Be clear about which parts of the request are idempotent. For example, in payments, the <em>amount</em> and <em>currency</em> should remain constant for the same key.</p></li></ul><div><hr></div><h2>Closing Thoughts</h2><p>Idempotent keys are one of those rare engineering practices that quietly do wonders. Stripe made them famous by solving a very human problem: the fear of being charged twice. But the principle is universal, <strong>make actions safe, predictable, and trustworthy.</strong></p><p>Whether you&#8217;re running a payment system, a booking platform, or just an API that retries requests, adopting idempotent keys means you&#8217;re building resilience into your foundation. And resilience, at the end of the day, is what separates systems people merely use from systems people deeply trust.</p>]]></content:encoded></item><item><title><![CDATA[The Engineer Who Stayed Quiet]]></title><description><![CDATA[What happens when engineers choose to stay quiet at the wrong moment.]]></description><link>https://www.rahuldhar.me/p/the-engineer-who-stayed-quiet</link><guid isPermaLink="false">https://www.rahuldhar.me/p/the-engineer-who-stayed-quiet</guid><dc:creator><![CDATA[Rahul Dhar]]></dc:creator><pubDate>Tue, 26 Aug 2025 15:31:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!CHJo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3399b754-cb56-4118-b449-d388e6709bbf_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Dear Readers,</p><p>In this post, I will talk about:</p><ul><li><p>Why software engineers usually get fired?</p></li><li><p>Kunal&#8217;s story: how not surfacing a design risk led to a failure in production.</p></li><li><p>The difference between invisible work and visible communication.</p></li><li><p>A simple challenge to make your thinking louder in everyday updates.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CHJo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3399b754-cb56-4118-b449-d388e6709bbf_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CHJo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3399b754-cb56-4118-b449-d388e6709bbf_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!CHJo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3399b754-cb56-4118-b449-d388e6709bbf_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!CHJo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3399b754-cb56-4118-b449-d388e6709bbf_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!CHJo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3399b754-cb56-4118-b449-d388e6709bbf_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CHJo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3399b754-cb56-4118-b449-d388e6709bbf_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3399b754-cb56-4118-b449-d388e6709bbf_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3525325,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://rahuldhar47.substack.com/i/171882863?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3399b754-cb56-4118-b449-d388e6709bbf_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CHJo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3399b754-cb56-4118-b449-d388e6709bbf_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!CHJo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3399b754-cb56-4118-b449-d388e6709bbf_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!CHJo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3399b754-cb56-4118-b449-d388e6709bbf_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!CHJo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3399b754-cb56-4118-b449-d388e6709bbf_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><p>Most engineers don&#8217;t lose their jobs because they wrote bad code. They lose them because they stayed quiet.</p><p>I used to work with an engineer named Kunal.</p><p>Kunal wasn&#8217;t the loudest voice in the room. In fact, he was the kind of engineer every manager loves: reliable, never missed deadlines, never complained. But that reliability came with silence.</p><p>When requirements were vague, he just assumed and started coding.<br>When timelines were unrealistic, he pulled late nights instead of pushing back.<br>When he noticed flaws in the design that might cause scaling issues down the line, he kept it to himself.</p><p>Each time, silence felt easier than debate. Until it wasn&#8217;t.</p><p>Three months later, exactly what Kunal had feared came true. The API couldn&#8217;t handle the surge in traffic. Latency spiked. Upstream services started failing. Customers complained. Leadership was in midnight war-room calls. And the hardest part? Kunal had seen it coming all along. He just never said it out loud.</p><p>That&#8217;s when I realized something important:<br>You can ship bugs and still keep your job. But if you don&#8217;t communicate risk, you become the risk.</p><p>Invisible work doesn&#8217;t get you promoted.<br>Invisible problems get you replaced.</p><p>And the truth is, many of us have a little bit of Kunal in us. We notice risks but tell ourselves, <em>&#8220;I&#8217;ll just work harder to cover it up.&#8221;</em> We think silence will protect us, but it only isolates us.</p><p>The best engineers I&#8217;ve seen don&#8217;t just write great code. They make their thinking visible. They clarify the scope. They surface risks early. They translate technical trade-offs into business clarity. And because of that, people trust them.</p><p>If you want to become indispensable, don&#8217;t just improve your coding skills. Improve how loudly and clearly you share your thinking.</p><p>So here&#8217;s a small challenge for you this week: in your next standup or update, don&#8217;t just share what you did. Share one risk you see, one trade-off you&#8217;re making, and one thing you need clarity on.</p><p>Small, consistent signals build big trust.</p>]]></content:encoded></item><item><title><![CDATA[Managing Managers: A Skill Everyone Needs]]></title><description><![CDATA[You are the main character in your career journey. It is you, not your manager, who leads your career journey.]]></description><link>https://www.rahuldhar.me/p/managing-up-a-skill-everyone-needs</link><guid isPermaLink="false">https://www.rahuldhar.me/p/managing-up-a-skill-everyone-needs</guid><dc:creator><![CDATA[Rahul Dhar]]></dc:creator><pubDate>Sun, 24 Aug 2025 03:30:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!EdFp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18892d44-f790-43dd-aa03-822c81aa4475_1920x1080.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Dear Readers,</p><p>Most of us will eventually become managers. But all of us, at some point, will have managers.</p><p>And here is the tricky part: very few of us actually <em>manage</em> our managers well.</p><p>I have worked across startups and large organizations, and one pattern I have noticed is this: talented engineers burn out not because of the work itself, but because their relationship with their manager is misaligned. They either see their manager as an obstacle or as a saviour. Both are flawed mental models.</p><p>So let us talk about how to make this relationship work, not by waiting for your manager to magically &#8220;get better,&#8221; but by learning how to manage up effectively.</p><p>Warmly,<br>~Rahul</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EdFp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18892d44-f790-43dd-aa03-822c81aa4475_1920x1080.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EdFp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18892d44-f790-43dd-aa03-822c81aa4475_1920x1080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!EdFp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18892d44-f790-43dd-aa03-822c81aa4475_1920x1080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!EdFp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18892d44-f790-43dd-aa03-822c81aa4475_1920x1080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!EdFp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18892d44-f790-43dd-aa03-822c81aa4475_1920x1080.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EdFp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18892d44-f790-43dd-aa03-822c81aa4475_1920x1080.jpeg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18892d44-f790-43dd-aa03-822c81aa4475_1920x1080.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:136439,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://rahuldhar47.substack.com/i/171711519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18892d44-f790-43dd-aa03-822c81aa4475_1920x1080.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EdFp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18892d44-f790-43dd-aa03-822c81aa4475_1920x1080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!EdFp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18892d44-f790-43dd-aa03-822c81aa4475_1920x1080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!EdFp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18892d44-f790-43dd-aa03-822c81aa4475_1920x1080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!EdFp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18892d44-f790-43dd-aa03-822c81aa4475_1920x1080.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h3>In this post</h3><p>We will talk about:</p><ul><li><p>What your manager is not</p></li><li><p>When your relationship with your manager feels tough</p></li><li><p>What a great manager&#8211;report relationship looks like</p></li><li><p>How to think about performance reviews</p></li><li><p>What you should expect from your manager</p></li><li><p>How to give feedback to your manager</p></li><li><p>How I manage sync-ups and alignment with my manager</p></li></ul><h2>What your manager is not</h2><p>When we are new to the workforce, we often put managers into extreme roles.</p><ul><li><p><strong>The executioner.</strong><br>This is the belief that &#8220;If I make a mistake, my manager will punish me, maybe even fire me.&#8221; Every interaction feels like walking on eggshells.</p></li><li><p><strong>The saviour.</strong><br>This is the belief that &#8220;My manager will fight for me, protect me, and guarantee my promotion.&#8221; Here, we rely on them to make everything better for us.</p></li></ul><p>Both of these mental models are misleading. Why? Because both give away <em>your power</em>.</p><p>Yes, your manager has influence. They can promote you, give you challenging opportunities, or if things go badly, let you go. But that doesn&#8217;t mean they control your entire career. You do.</p><p>Think of it like this: your career is a book, and you are the author. Your manager? At best, they are a key character, sometimes even a co-author of a few chapters. But they are not the hero, and they are not the one holding the pen.</p><p>The shift begins when you stop asking, <em>&#8220;What will my manager do for me?&#8221;</em> and start asking, <em>&#8220;What can I do to make this relationship better for both of us?&#8221;</em></p><div><hr></div><h2>When your relationship with your manager feels tough</h2><p>Maybe your manager doesn&#8217;t give you feedback. Maybe they never seem to have time for you. Or maybe you feel like they don&#8217;t understand what you are capable of. It happens a lot.</p><p>But here is the good news: you can take steps to improve things.</p><ol><li><p><strong>Know what good looks like.</strong><br>Before fixing to do anything, define what a good end product means to you and the manager. Do you want more context about the work or the strategy? Are the goals clear? Make sure to have an honest feedback loop? Write down everything.</p></li><li><p><strong>Talk about it openly.</strong><br>Use your 1:1s. Say something like: <em>&#8220;Here is the kind of working relationship I would love us to have. For example, I would really appreciate more feedback on my projects. Does that make sense to you?&#8221;</em></p></li><li><p><strong>Be clear about your needs.</strong><br>Vague complaints don&#8217;t help. Instead of &#8220;I feel unsupported,&#8221; say: <em>&#8220;It would help me if I got feedback after the demo, so I know how to improve next time.&#8221;</em></p></li><li><p><strong>Give your manager feedback.</strong><br>Once a month, tell them what is working. <em>&#8220;I appreciated that you gave me context before that meeting, it made me more confident.&#8221;</em> Then add what you would like more of.</p></li><li><p><strong>Ask for feedback.</strong><br>Don&#8217;t wait for annual reviews. Simple questions work: <em>&#8220;What is one thing I could do differently that would make your job easier?&#8221;</em></p></li><li><p><strong>Escalate if ignored.</strong><br>If you have given clear, specific feedback several times and nothing changes, it is fair to bring it up with their manager. Do it respectfully and factually.</p></li><li><p><strong>Leave if trust is gone.</strong><br>If you have tried everything and still don&#8217;t trust your manager, it may be time to move on. Don&#8217;t let one person is limitations block your career.</p></li></ol><div><hr></div><h2>What a great manager&#8211;report relationship looks like</h2><p>The best relationships between managers and their team members are built on two things: <strong>trust and alignment</strong>.</p><p>Ask yourself:</p><ul><li><p>Do I trust my manager to do their job well?</p></li><li><p>Does my manager trust me to do my job well?</p></li></ul><p>If either answer is &#8220;not really,&#8221; there is probably misalignment.</p><p>For example:</p><ul><li><p>You might think your manager is role is to back your ideas.</p></li><li><p>But your manager might think their role is to challenge your ideas and push for better solutions.</p></li></ul><p>Neither is wrong, but if you don&#8217;t talk about it, you will constantly misunderstand each other.</p><p>So, ask questions:</p><ul><li><p><em>&#8220;What should I do differently to get more visibility?&#8221;</em></p></li><li><p>How to give feedback to your manager</p></li></ul><p>Once both of you have laid out your definitions of success, you can co-create a shared vision. This clarity transforms the relationship from transactional (&#8220;Do this task&#8221;) to collaborative (&#8220;Let is make each other successful&#8221;).</p><div><hr></div><h2>How to think about performance reviews</h2><p>Performance reviews often get a bad reputation as judgment days. But if you approach them differently, they can become powerful growth moments.</p><ul><li><p><strong>Celebrate strengths.</strong> If your manager praises something, don&#8217;t dismiss it. Lean into it. Strengths are often where your biggest opportunities lie.</p></li><li><p><strong>Over-correct on patterns.</strong> If you keep hearing &#8220;speak up more in meetings,&#8221; don&#8217;t just nudge slightly. Set yourself a goal: &#8220;I&#8217;ll contribute at least three points in every meeting.&#8221; It might feel unnatural, but that is often what is needed to shift perception.</p></li><li><p><strong>Shape your next review.</strong> Instead of passively waiting, imagine the review you want six months from now. Do you want to be known as a strong mentor? A clear communicator? Tell your manager: <em>&#8220;By the next review, I would love to be recognized for X. Can you help me get there?&#8221;</em></p></li></ul><div><hr></div><h2>What you should expect from your manager</h2><p>This isn&#8217;t one-sided. Managers owe you certain things. At a minimum, expect them to:</p><ul><li><p>Give you clarity on priorities and goals.</p></li><li><p>Provide regular, constructive feedback.</p></li><li><p>Create opportunities for you to grow.</p></li><li><p>Remove blockers you can&#8217;t clear alone.</p></li><li><p>Advocate for your work when recognition or promotions are on the line.</p></li></ul><p>If they are not doing these things, speak up. It is not disrespectful, it is necessary.</p><div><hr></div><h2>How to give feedback to your manager</h2><p>This is where most people freeze. But here is the truth: managers need feedback too.</p><ul><li><p><strong>Start with curiosity.</strong> Instead of &#8220;You don&#8217;t support me,&#8221; try: <em>&#8220;I noticed we don&#8217;t often do feedback sessions. Is that intentional? How do you prefer to share feedback?&#8221;</em></p></li><li><p><strong>Frame it around impact.</strong> <em>&#8220;When I don&#8217;t get feedback after presentations or product demos, I am not sure if I met expectations. It would help me improve faster if you shared your thoughts.&#8221;</em></p></li><li><p><strong>Be specific.</strong> Concrete requests are easier to act on.</p></li><li><p><strong>Follow up.</strong> If things don&#8217;t improve, ask again: <em>&#8220;I wanted to check if the changes we discussed are working for you as well.&#8221;</em></p></li><li><p><strong>Escalate respectfully.</strong> If nothing changes, involve their manager with facts, not emotions.</p></li></ul><p>Giving feedback to your manager is not rebellion &#8212; it is leadership.</p><div><hr></div><h3>How I manage sync-ups and alignment with my manager</h3><p>A strong manager&#8211;report relationship doesn&#8217;t just happen in performance reviews. It is built in the small, consistent ways you communicate. Here is what works for me:</p><p><strong>1. Biweekly 1:1s with structure</strong><br>I keep a simple structure for my one-on-ones so that both of us walk away with clarity:</p><ul><li><p><strong>20%</strong>: What I have accomplished</p></li><li><p><strong>20%</strong>: What I can do better</p></li><li><p><strong>20%</strong>: Where I fell short</p></li><li><p><strong>40%</strong>: Action items</p><ul><li><p><strong>50% on me</strong></p></li><li><p><strong>50% on my manager</strong></p></li></ul></li></ul><p>This balance makes the conversation honest, forward-looking, and accountable on both sides. It is not just me reporting status. It is us aligning and committing to actions together</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y19a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d75fa2-f745-4a15-b2e5-8444c3fd307a_2604x1308.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y19a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d75fa2-f745-4a15-b2e5-8444c3fd307a_2604x1308.png 424w, https://substackcdn.com/image/fetch/$s_!Y19a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d75fa2-f745-4a15-b2e5-8444c3fd307a_2604x1308.png 848w, https://substackcdn.com/image/fetch/$s_!Y19a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d75fa2-f745-4a15-b2e5-8444c3fd307a_2604x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!Y19a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d75fa2-f745-4a15-b2e5-8444c3fd307a_2604x1308.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y19a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d75fa2-f745-4a15-b2e5-8444c3fd307a_2604x1308.png" width="1456" height="731" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c9d75fa2-f745-4a15-b2e5-8444c3fd307a_2604x1308.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:731,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:210936,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://rahuldhar47.substack.com/i/171711519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d75fa2-f745-4a15-b2e5-8444c3fd307a_2604x1308.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y19a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d75fa2-f745-4a15-b2e5-8444c3fd307a_2604x1308.png 424w, https://substackcdn.com/image/fetch/$s_!Y19a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d75fa2-f745-4a15-b2e5-8444c3fd307a_2604x1308.png 848w, https://substackcdn.com/image/fetch/$s_!Y19a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d75fa2-f745-4a15-b2e5-8444c3fd307a_2604x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!Y19a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d75fa2-f745-4a15-b2e5-8444c3fd307a_2604x1308.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>2. Dedicated Slack channels for visibility</strong><br>I use Slack heavily to reduce back-and-forth and increase transparency. For each project, I create dedicated channels that include my manager and my team. We use them for:</p><ul><li><p>Design discussions</p></li><li><p>Deliverables and sign-offs</p></li><li><p>Roadblocks and clarifications</p></li></ul><p>This ensures my manager always has visibility into the work, without needing to chase me for updates. One principle I live by: <strong>your manager should already know the state of your work before they ask.</strong></p><p>This combination of structured sync-ups and transparent async updates creates trust, alignment, and accountability.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.rahuldhar.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Rahul&#8217;s Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[My Journey Understanding an Inherited PostgreSQL System]]></title><description><![CDATA[Why Your Database Ignores Indexes: A Journey Through Planner Mysteries and Design Choices]]></description><link>https://www.rahuldhar.me/p/understanding-postgres-query-planner</link><guid isPermaLink="false">https://www.rahuldhar.me/p/understanding-postgres-query-planner</guid><dc:creator><![CDATA[Rahul Dhar]]></dc:creator><pubDate>Tue, 19 Aug 2025 19:17:45 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/39d95d03-25ce-4492-9cc0-edb4880fa7f3_1920x1281.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>The Puzzle That Started It All</strong></h2><p>In a recent project I worked on, I was puzzled by what seemed like an inconsistent indexing situation. This inconsistency bothered me, so I decided to dig deeper. What I discovered taught me that "wrong" isn&#8217;t always wrong. It is often about trade-offs I hadn&#8217;t considered.</p><h2><strong>The Starting Point: A Confusing Primary Key</strong></h2><p>The first thing that caught my attention was the primary key design:</p><pre><code><code>CREATE TABLE locations (
    location_id TEXT NOT NULL,      -- e.g., 'A01020304' (hierarchical location code)
    warehouse_id TEXT NOT NULL,     -- e.g., '25' 
    -- ... other columns
    CONSTRAINT pk_locations_location_id_warehouse_id 
        PRIMARY KEY (location_id, warehouse_id)
);
</code></code></pre><p>This struck me as backwards. In most multi-tenant systems I've worked with, you'd cluster by <code>warehouse_id</code> first for tenant isolation. But here, <code>location_id</code> comes first. I wondered if this was a mistake or if there was reasoning I wasn't seeing.</p><h2><strong>Understanding the Location ID Structure</strong></h2><p>As I dug deeper, I discovered that these location IDs weren't just random strings. They followed a specific hierarchical structure that was key to understanding the whole system. Each warehouse had thousands (sometimes millions) of storage locations, and every location was encoded into a structured string ID.</p><p>For example, a location ID like <code>A01020304</code> breaks down as:</p><ul><li><p><strong>Aisle</strong>: <code>A01</code> (an uppercase letter + 2 digits)</p></li><li><p><strong>Bay</strong>: <code>02</code> (2 digits)</p></li><li><p><strong>Level</strong>: <code>03</code> (2 digits)</p></li><li><p><strong>Position</strong>: <code>04</code> (2 digits)</p></li></ul><p>When someone searches for <code>A01%</code>, they're looking for all locations in aisle A01 across any warehouse. When they search for <code>H1%</code>, they want all locations in aisles starting with H1. This spatial organization is crucial for warehouse operations. workers need to find adjacent locations, analyze storage density by aisle, or audit entire sections of the warehouse.</p><h2><strong>My First Breakthrough: The Migration History</strong></h2><p>Curious about the primary key design, I dug through the database migration history. What I found surprised me:</p><p><strong>Original Design </strong><em>(October 2023)</em></p><pre><code><code>PRIMARY KEY (warehouse_id, location_id)  -- Warehouse first</code></code></pre><p><strong>Changed to </strong><em>(February 2025)</em></p><pre><code><code>PRIMARY KEY (location_id, warehouse_id)  -- Location first</code></code></pre><p>So it wasn't a mistake! The previous team had deliberately switched the primary key ordering. This made me even more curious. What drove that decision? And more importantly, I started noticing something concerning in my testing: <strong>PostgreSQL kept avoiding this index, even when it seemed like it would be faster</strong>.</p><h2><strong>My Testing Adventure: When the Planner Surprised Me</strong></h2><p>This is where my investigation got really interesting. I decided to systematically test different query patterns to understand when PostgreSQL was making good choices versus poor ones. What I found challenged everything I thought I knew about query optimization.</p><h3><strong>Test 1: Location Pattern Query</strong></h3><pre><code><code>EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT location_id, warehouse_id
FROM locations
WHERE location_id LIKE 'H1%';
</code></code></pre><p><strong>Result</strong>: 477ms with parallel sequential scan for 92,689 rows (~4% of table)</p><p><em>PostgreSQL chose sequential scan, which makes sense for large result sets.</em></p><h3><strong>Test 2: Warehouse-Specific Query</strong></h3><pre><code><code>EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT location_id, warehouse_id
FROM locations
WHERE warehouse_id = '25';
</code></code></pre><p><strong>Natural planner choice</strong>: 567ms (sequential scan) <strong>Forced index usage</strong>: 282ms (<strong>50% faster!</strong>)</p><pre><code><code>SET enable_seqscan = false;
-- Same query runs in 282ms with bitmap index scan
RESET enable_seqscan;
</code></code></pre><h3><strong>Test 3: Combined Query</strong></h3><pre><code><code>EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT location_id, warehouse_id
FROM locations
WHERE warehouse_id = '25' AND location_id LIKE 'H1000%';
</code></code></pre><p><strong>Natural planner choice</strong>: 501ms (sequential scan) <strong>Forced index usage</strong>: 354ms (<strong>29% faster</strong>)</p><h3><strong>Test 4: LIMIT with OFFSET Queries</strong></h3><pre><code><code>EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT location_id, warehouse_id
FROM locations
WHERE location_id LIKE 'H1%' AND warehouse_id = '09' 
LIMIT 100 OFFSET 100;
</code></code></pre><p><strong>Natural planner choice</strong>: 436ms (sequential scan) <strong>Forced index usage</strong>: 204ms (<strong>53% faster</strong>)</p><pre><code><code>-- Natural planner result (sequential scan)
Limit  (cost=6560.89..12121.78 rows=100 width=12) (actual time=412.845..436.060 rows=0 loops=1)
  -&gt;  Gather
        -&gt;  Parallel Seq Scan on locations
              Filter: (((location_id)::text ~~ 'H1%'::text) AND ((warehouse_id)::text = '09'::text))
              Rows Removed by Filter: 746758

-- Forced index result  
Limit  (cost=9743.66..19486.90 rows=100 width=12) (actual time=204.298..204.300 rows=0 loops=1)
  -&gt;  Index Only Scan using pk_locations_location_id_warehouse_id on locations
        Index Cond: (warehouse_id = '09'::text)
        Filter: ((location_id)::text ~~ 'H1%'::text)
        Rows Removed by Filter: 30265
</code></code></pre><p><strong>Key insight</strong>: Even with LIMIT + OFFSET, the index scan won significantly. The index could filter by warehouse_id first (30k rows examined) vs. sequential scan examining 746k rows across workers.</p><h3><strong>Test 5: Simple LIMIT Queries (When the Planner Gets It Right)</strong></h3><pre><code><code>EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT location_id, warehouse_id
FROM locations
WHERE location_id LIKE 'H1%' 
LIMIT 100;
</code></code></pre><p><strong>Natural planner choice</strong>: 0.579ms (sequential scan) <strong>Forced index usage</strong>: 321ms (<strong>550x slower!</strong>)</p><pre><code><code>-- Sequential scan result
Limit  (cost=0.00..91.49 rows=100 width=12) (actual time=0.035..0.553 rows=100 loops=1)
  -&gt;  Seq Scan on locations
        Filter: ((location_id)::text ~~ 'H1%'::text)
        Rows Removed by Filter: 3060
        
-- Forced index scan result  
Limit  (cost=0.43..224.78 rows=100 width=12) (actual time=321.386..321.501 rows=100 loops=1)
  -&gt;  Index Only Scan using pk_locations_location_id_warehouse_id on locations
        Filter: ((location_id)::text ~~ 'H1%'::text)
        Rows Removed by Filter: 433527
</code></code></pre><p><strong>Here, the planner was absolutely correct!</strong> Sequential scan found 100 results after examining only 3,160 rows, while the index scan had to examine 433,627 rows to find the same 100 results.</p><p><strong>Why LIMIT changes everything:</strong></p><ul><li><p>Sequential scan can <strong>stop as soon as it finds enough rows</strong></p></li><li><p>H1% patterns are distributed throughout the data</p></li><li><p>Index scan must traverse scattered index pages to find matches</p></li><li><p>For small result sets, "first 100 found" beats "optimally ordered access"</p></li></ul><h2><strong>Summary: When Indexes Help vs. Hurt</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KKH1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87ebaef-5fda-41d8-b5ae-cb84fc69de2c_2372x734.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KKH1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87ebaef-5fda-41d8-b5ae-cb84fc69de2c_2372x734.png 424w, https://substackcdn.com/image/fetch/$s_!KKH1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87ebaef-5fda-41d8-b5ae-cb84fc69de2c_2372x734.png 848w, https://substackcdn.com/image/fetch/$s_!KKH1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87ebaef-5fda-41d8-b5ae-cb84fc69de2c_2372x734.png 1272w, https://substackcdn.com/image/fetch/$s_!KKH1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87ebaef-5fda-41d8-b5ae-cb84fc69de2c_2372x734.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KKH1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87ebaef-5fda-41d8-b5ae-cb84fc69de2c_2372x734.png" width="1456" height="451" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d87ebaef-5fda-41d8-b5ae-cb84fc69de2c_2372x734.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:451,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:225394,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://rahuldhar47.substack.com/i/171376154?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87ebaef-5fda-41d8-b5ae-cb84fc69de2c_2372x734.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KKH1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87ebaef-5fda-41d8-b5ae-cb84fc69de2c_2372x734.png 424w, https://substackcdn.com/image/fetch/$s_!KKH1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87ebaef-5fda-41d8-b5ae-cb84fc69de2c_2372x734.png 848w, https://substackcdn.com/image/fetch/$s_!KKH1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87ebaef-5fda-41d8-b5ae-cb84fc69de2c_2372x734.png 1272w, https://substackcdn.com/image/fetch/$s_!KKH1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd87ebaef-5fda-41d8-b5ae-cb84fc69de2c_2372x734.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>The Pattern</strong>:</p><ul><li><p><strong>Large scans</strong>: Sequential wins or is reasonable</p></li><li><p><strong>Medium selectivity with filtering</strong>: Index wins significantly (even with LIMIT + OFFSET)</p></li><li><p><strong>Simple LIMIT queries</strong>: Sequential wins dramatically (can stop early)</p></li></ul><h2><strong>Understanding Why PostgreSQL Made These Choices</strong></h2><p>The more I tested, the more I realized there was a pattern to the planner's behavior. <strong>PostgreSQL's cost-based optimizer sometimes chose suboptimal plans, but sometimes it was exactly right</strong>. I needed to understand the logic behind these decisions.</p><h3><strong>1. The Cost Model Mystery</strong></h3><p>I discovered that PostgreSQL uses abstract "cost units" to compare execution strategies. Looking at my system's configuration, I found these parameters:</p><pre><code><code>-- My system's cost parameters (defaults)
SHOW random_page_cost;        -- 4.0 (random reads cost 4x sequential!)
SHOW effective_cache_size;    -- 1502984kB (~1.5GB in my case)
seq_page_cost = 1.0          -- Cost to read a page sequentially  
cpu_tuple_cost = 0.01        -- Cost to process one row
</code></code></pre><p>For my warehouse query (<code>warehouse_id = '25'</code> with 6.5% selectivity), the planner calculated:</p><ul><li><p><strong>Sequential scan cost</strong>: ~67,000 units (read entire table sequentially, low per-page cost)</p></li><li><p><strong>Index scan cost</strong>: ~130,000 units (random index + heap access, 4x page cost penalty)</p></li></ul><p>This explained why PostgreSQL kept choosing sequential scans! <strong>The planner assumed random I/O was 4x more expensive than sequential I/O</strong>. But when I forced the index usage, it was actually 50% faster. The cost model seemed to be wrong about modern storage characteristics.</p><h3><strong>2. The Index Scattering Problem</strong></h3><p>I realized that with the <code>(location_id, warehouse_id)</code> primary key structure I inherited:</p><ul><li><p>All warehouse '25' locations were <strong>scattered throughout the index</strong></p></li><li><p>To find them, the index scan had to jump between many different location_id ranges (A01..., B01..., H1...)</p></li><li><p>This created the "random" access pattern that the cost model was heavily penalizing</p></li></ul><h3><strong>3. Statistics vs. Reality</strong></h3><p>Interestingly, PostgreSQL's row count estimates were quite accurate (144k estimated vs 145k actual rows). The problem wasn't with statistics - <strong>it was with cost estimation</strong>. The planner didn't account for:</p><ul><li><p>How efficiently modern SSDs handle the "random" access patterns</p></li><li><p>How much of the data was already cached in the 1.5GB buffer pool</p></li><li><p>How well bitmap heap scans actually perform on this hardware</p></li></ul><h2><strong>My Analysis: Was This Bad Design?</strong></h2><p>The big question that kept bugging me: <strong>When the planner consistently chooses slower execution plans, does that mean the database design is wrong?</strong></p><h3><strong>My Conclusion: It's All About Trade-offs</strong></h3><p>As I analyzed the system deeper, I realized <strong>the primary key design created a fundamental trade-off:</strong></p><ol><li><p><strong>Location-first ordering </strong><code>(location_id, warehouse_id)</code> optimized for:</p><ul><li><p>Global operations spanning all warehouses.</p></li><li><p>Spatial analysis across the entire location hierarchy</p></li></ul></li><li><p><strong>But it penalized warehouse-specific queries</strong> because:</p><ul><li><p>Warehouse data became scattered throughout the index</p></li><li><p>This forced "random" access patterns that the cost model heavily penalized</p></li></ul></li></ol><h3><strong>Understanding the Previous Team's Logic</strong></h3><p>Looking back at the migration history, I realized this was a <strong>completely deliberate decision</strong>:</p><p><strong>October 2023</strong>: <code>PRIMARY KEY (warehouse_id, location_id)</code> (conventional multi-tenant design) </p><p><strong>February 2025</strong>: <code>PRIMARY KEY (location_id, warehouse_id)</code> (location-first optimization)</p><p>The team had switched from the "obvious" warehouse-first design to location-first. This told me:</p><ul><li><p>Cross-warehouse location queries were more critical to the business than warehouse isolation</p></li><li><p>They had probably measured the performance impact and decided the trade-off was worth it</p></li><li><p>The 50% penalty for warehouse queries was acceptable given their workload priorities</p></li></ul><h3><strong>Issue with the query planner: Outdated cost assumptions</strong></h3><p>The more I studied this, the more I realized <strong>the problem wasn't the index design - it was PostgreSQL's cost model being outdated for modern hardware:</strong></p><ul><li><p>The cost model assumed random I/O was 4x more expensive than sequential</p></li><li><p>Modern SSDs and the 1.5GB buffer pool made this assumption much less relevant</p></li><li><p>Bitmap heap scans performed better than the planner estimated on this hardware</p></li></ul><h2><strong>What I Learned and How I'd Optimize This System</strong></h2><h3><strong>1. Query Planner "Mistakes" Are Often Normal</strong></h3><p>My investigation taught me that query planners making suboptimal choices is more common than I expected, especially:</p><ul><li><p>On modern hardware (SSDs, large RAM)</p></li><li><p>With index designs that optimize for specific business patterns</p></li><li><p>When cost models haven't kept up with hardware evolution</p></li></ul><h3><strong>2. The Power of Testing Assumptions</strong></h3><p>I developed a methodical approach to test when the planner might be wrong, <strong>but learned to be careful with LIMIT queries</strong>:</p><pre><code><code>-- My testing approach for the warehouse queries
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM locations WHERE warehouse_id = '25';

-- Force index usage to compare
SET enable_seqscan = false;
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM locations WHERE warehouse_id = '25';
RESET enable_seqscan;

-- But I learned LIMIT queries often favor sequential scans
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM locations WHERE location_id LIKE 'H1%' LIMIT 100;</code></code></pre><h3><strong>3. How I'd Optimize: Supplemental Indexes</strong></h3><p>Based on my analysis, if I were optimizing this system, I'd add a supporting index to get the best of both worlds:</p><pre><code><code>-- Add this to optimize warehouse-specific queries
CREATE INDEX CONCURRENTLY idx_locations_warehouse_location 
ON locations (warehouse_id, location_id);
</code></code></pre><p>This would preserve the benefits of the location-first primary key while eliminating the 50% performance penalty for warehouse queries.</p><h3><strong>4. Cost Model Tuning Experiments I'd Try</strong></h3><p>I realized this system would benefit from tuning PostgreSQL's cost model to reflect modern hardware realities:</p><pre><code><code>-- Current settings that explain the planner behavior
SHOW random_page_cost;        -- 4.0 (assumes spinning disks!)
SHOW effective_cache_size;    -- 1502984kB (~1.5GB in our case)

-- Experiments I'd try:
SET random_page_cost = 1.1;   -- Reflect SSD performance characteristics
SET effective_cache_size = '4GB';  -- If more memory is available
</code></code></pre><p>Lowering <code>random_page_cost</code> would make index scans more attractive to the planner, since it would reduce the penalty for the "random" I/O operations that modern SSDs handle efficiently.</p><p><strong>References supporting these tuning approaches:</strong></p><ul><li><p><a href="https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server">PostgreSQL Wiki: Tuning Your PostgreSQL Server</a> - Recommends <code>random_page_cost = 1.1</code> for SSDs</p></li><li><p><a href="https://www.postgresql.org/docs/current/runtime-config-query.html#GUC-RANDOM-PAGE-COST">PostgreSQL Documentation: Runtime Configuration</a> - Official guidance on cost parameters</p></li><li><p><a href="https://postgresqlco.nf/doc/en/param/random_page_cost/">PostgreSQL Performance Tuning Guide</a> - SSD optimization recommendations</p></li></ul><h2><strong>My Key Discoveries</strong></h2><ol><li><p><strong>Query planners aren't perfect, but they're not always wrong</strong> - PostgreSQL made poor choices for most of my tests but excellent choices for simple LIMIT queries</p></li><li><p><strong>Context matters enormously</strong> - Simple LIMIT queries favoured sequential scans (0.5ms vs 321ms), but LIMIT + OFFSET + filtering favoured indexes (204ms vs 436ms)</p></li><li><p><strong>Index design creates trade-offs</strong> - the <code>(location_id, warehouse_id)</code> design optimized cross-warehouse patterns but penalized warehouse isolation</p></li><li><p><strong>Cost models lag hardware</strong> - PostgreSQL's assumptions about I/O costs didn't match modern SSD and RAM characteristics in my testing</p></li><li><p><strong>Testing assumptions is crucial</strong> - when queries seemed slow, testing alternative execution strategies with <code>SET enable_seqscan = false</code> revealed significant performance gaps</p></li><li><p><strong>LIMIT queries favour sequential scans</strong> - when I only needed the first N results, sequential access often won dramatically</p></li></ol><p><strong>My Bottom Line</strong>: This inherited system wasn't "wrong" - PostgreSQL's query planner was being conservative based on outdated I/O cost assumptions. The primary key design reflected deliberate business optimization choices, and the performance gaps I observed could be addressed with supplemental indexes if needed.</p><div><hr></div><p><em>Testing performed on PostgreSQL 15+ with ~2.24M rows. Results may vary based on hardware, configuration, and data distribution.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.rahuldhar.me/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Rahul&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item></channel></rss>