<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>docker &amp;mdash; saint</title>
    <link>https://avys.group.lt/saint/tag:docker</link>
    <description>scrapbook of a sysadmin</description>
    <pubDate>Sun, 03 May 2026 23:30:32 +0000</pubDate>
    <item>
      <title>How a Single Docker Tag Cost Us 51 Minutes of Downtime</title>
      <link>https://avys.group.lt/saint/how-a-single-docker-tag-cost-us-51-minutes-of-downtime</link>
      <description>&lt;![CDATA[A Tale of Kubernetes Image Caching and What We Learned&#xA;&#xA;TL;DR: Rebuilt a Docker image with the same tag. Kubernetes cached the old broken image. Pods crashed for 51 minutes. The fix? One line: imagePullPolicy: Always. Here&#39;s the full story.&#xA;&#xA;---&#xA;&#xA;The Setup&#xA;&#xA;It was a Sunday morning. We were upgrading BookWyrm (a federated social reading platform) from v0.8.1 to v0.8.2 on our Kubernetes cluster. The plan was simple:&#xA;&#xA;Update the version tag&#xA;Trigger the GitHub Actions workflow&#xA;Wait for the build&#xA;Deploy&#xA;Celebrate&#xA;&#xA;What could go wrong?&#xA;&#xA;---&#xA;&#xA;Everything Goes Wrong&#xA;&#xA;9:20 AM: The Deployment&#xA;&#xA;I triggered the workflow. GitHub Actions spun up, built the Docker image, pushed it to the registry, and deployed to Kubernetes.&#xA;&#xA;✓ Build complete&#xA;✓ Image pushed: release-0.8.2&#xA;✓ Deployment applied&#xA;&#xA;Looking good! I watched the pods start rolling out.&#xA;&#xA;9:24 AM: The Crash&#xA;&#xA;NAME                   READY   STATUS&#xA;web-86f4676f8b-zwgfs   0/1     CrashLoopBackOff&#xA;&#xA;Every. Single. Pod. Crashed.&#xA;&#xA;I pulled the logs:&#xA;&#xA;ModuleNotFoundError: No module named &#39;bookwyrm&#39;&#xA;&#xA;The entire BookWyrm application was missing from the container.&#xA;&#xA;The Investigation&#xA;&#xA;I dove into the Dockerfile. We had accidentally used the upstream bookwyrm/Dockerfile instead of our custom one. That Dockerfile only copied requirements.txt - not the actual application code.&#xA;&#xA;The broken Dockerfile&#xA;FROM python:3.10&#xA;COPY requirements.txt .&#xA;RUN pip install -r requirements.txt&#xA;... but WHERE&#39;S THE CODE? 😱&#xA;&#xA;Classic. Easy fix!&#xA;&#xA;---&#xA;&#xA;The First &#34;Fix&#34; (That Wasn&#39;t)&#xA;&#xA;10:37 AM: The Quick Fix&#xA;&#xA;I created a fix commit that switched to the correct Dockerfile:&#xA;&#xA;The correct Dockerfile&#xA;FROM python:3.10&#xA;RUN git clone https://github.com/bookwyrm-social/bookwyrm .&#xA;RUN git checkout v0.8.2&#xA;RUN pip install -r requirements.txt&#xA;Now we have the code!&#xA;&#xA;I committed the changes... and forgot to push to GitHub.&#xA;&#xA;Then I triggered the workflow again.&#xA;&#xA;Naturally, GitHub Actions built from the old code (because I hadn&#39;t pushed). The broken image was rebuilt and redeployed.&#xA;&#xA;Pods still crashing. Facepalm moment #1.&#xA;&#xA;10:55 AM: Actually Pushed This Time&#xA;&#xA;I realized my mistake, pushed the commits, and triggered the workflow again.&#xA;&#xA;This time the build actually used the fixed Dockerfile. I watched it clone BookWyrm, install dependencies, everything. The build logs looked perfect:&#xA;&#xA;9 [ 5/10] RUN git clone https://github.com/bookwyrm-social/bookwyrm .&#xA;9 0.153 Cloning into &#39;.&#39;...&#xA;9 DONE 5.1s&#xA;&#xA;Success! The image was built correctly and pushed.&#xA;&#xA;I watched the pods roll out... and they crashed again.&#xA;&#xA;ModuleNotFoundError: No module named &#39;bookwyrm&#39;&#xA;&#xA;The exact same error.&#xA;&#xA;This made no sense. The image was built correctly. I verified the build logs. The code was definitely in the image. What was happening?&#xA;&#xA;---&#xA;&#xA;The Real Problem&#xA;&#xA;I checked what image the pods were actually running:&#xA;&#xA;kubectl get pod web-c98d458c4-x5p6z -o jsonpath=&#39;{.status.containerStatuses[0].imageID}&#39;&#xA;&#xA;ghcr.io/nycterent/ziurkes/bookwyrm@sha256:934ea0399adad...&#xA;&#xA;Then I checked what digest we just pushed:&#xA;&#xA;release-0.8.2: digest: sha256:0a2242691956c24c687cc05d...&#xA;&#xA;Different digests. The pods were running the OLD image!&#xA;&#xA;The Kubernetes Image Cache Trap&#xA;&#xA;Here&#39;s what I didn&#39;t know (but definitely know now):&#xA;&#xA;When you specify an image in Kubernetes without :latest:&#xA;&#xA;image: myregistry.com/app:v1.0.0&#xA;&#xA;Kubernetes defaults to imagePullPolicy: IfNotPresent. This means:&#xA;&#xA;If the image tag exists locally on the node → use cached version&#xA;If the image tag doesn&#39;t exist → pull from registry&#xA;&#xA;We rebuilt the image with the same tag (release-0.8.2). The node already had an image with that tag (the broken one). So Kubernetes said &#34;great, I already have release-0.8.2&#34; and used the cached broken image.&#xA;&#xA;Even when I ran kubectl rollout restart, it created new pods... which immediately used the same cached image.&#xA;&#xA;Why This Happens&#xA;&#xA;This behavior makes sense for immutable tags. If release-0.8.2 is supposed to be immutable, there&#39;s no reason to re-pull it every time.&#xA;&#xA;But we had mutated the tag by rebuilding it with the same name.&#xA;&#xA;But Wait - What&#39;s the REAL Root Cause?&#xA;&#xA;At this point, you might think &#34;Ah, the root cause is image caching!&#34;&#xA;&#xA;Not quite.&#xA;&#xA;The image caching is what broke. But the root cause is why could this happen in the first place?&#xA;&#xA;Root cause analysis isn&#39;t about what failed—it&#39;s about what we can change to prevent it from happening again.&#xA;&#xA;The actual root causes:&#xA;&#xA;No deployment validation - Nothing checked if our image contained application code&#xA;No image management policy - We had no rules about tag reuse or imagePullPolicy&#xA;No process guardrails - Our workflow let us deploy untested changes to production&#xA;No automated testing - No smoke tests, no staging environment, no safety net&#xA;&#xA;The wrong Dockerfile and the image caching were symptoms. The root cause was missing processes that would have caught these mistakes.&#xA;&#xA;---&#xA;&#xA;The Solution&#xA;&#xA;The fix ended up being multi-part:&#xA;&#xA;1. Migrate to Harbor Registry&#xA;&#xA;We consolidated all images into our Harbor registry instead of split between GitHub Container Registry and Harbor. This gave us better control over image management.&#xA;&#xA;2. Add imagePullPolicy: Always&#xA;&#xA;The critical fix in every deployment:&#xA;&#xA;spec:&#xA;  containers:&#xA;    name: web&#xA;      image: uostas/ziurkes/bookwyrm:release-0.8.2&#xA;      imagePullPolicy: Always  # ← This one line&#xA;&#xA;With imagePullPolicy: Always, Kubernetes pulls the image every time, regardless of what&#39;s cached.&#xA;&#xA;3. Update imagePullSecrets&#xA;&#xA;Since we moved to Harbor, we needed to update the registry credentials:&#xA;&#xA;imagePullSecrets:&#xA;  name: uostas-registry  # Harbor credentials&#xA;&#xA;We deployed these changes and... 🎉&#xA;&#xA;NAME                     READY   STATUS    RESTARTS   AGE&#xA;web-5cd76dfd5b-qv4ln     1/1     Running   0          51s&#xA;celery-worker-...        1/1     Running   0          73s&#xA;celery-beat-...          1/1     Running   0          74s&#xA;flower-...               1/1     Running   0          65s&#xA;&#xA;All pods healthy! Service restored!&#xA;&#xA;---&#xA;&#xA;Lessons Learned&#xA;&#xA;1. Build Process Validation (Prevention   Detection)&#xA;&#xA;The Real Lesson: We had no validation that our images contained working code.&#xA;&#xA;What we should have had:&#xA;&#xA;In Dockerfile - fail build if app code missing&#xA;RUN test -f /app/bookwyrm/init.py || \&#xA;    (echo &#34;ERROR: BookWyrm code not found!&#34; &amp;&amp; exit 1)&#xA;&#xA;In deployment - fail pod startup if app broken&#xA;livenessProbe:&#xA;  exec:&#xA;    command: [&#34;python&#34;, &#34;-c&#34;, &#34;import bookwyrm&#34;]&#xA;&#xA;If we&#39;d had these, the broken image would never have reached production.&#xA;&#xA;2. Image Management Policy (Not Just Best Practices)&#xA;&#xA;The Real Lesson: &#34;Best practices&#34; aren&#39;t enough - you need enforced policies.&#xA;&#xA;What we implemented:&#xA;&#xA;✅ Required: imagePullPolicy: Always in all deployments&#xA;✅ Required: Images must go to Harbor registry (not ghcr.io)&#xA;✅ Recommended: Include git SHA in tags: release-0.8.2-a1b2c3d&#xA;✅ Alternative: Pin to digest: image@sha256:abc123...&#xA;&#xA;These aren&#39;t suggestions - they&#39;re now requirements in our deployment YAMLs.&#xA;&#xA;3. Deployment Guardrails (Make Mistakes Impossible)&#xA;&#xA;The Real Lesson: Manual processes need automated checks.&#xA;&#xA;What we added:&#xA;&#xA;Pre-deployment checks (automated)&#xA;Commits pushed to remote? ✅&#xA;CI build passed? ✅&#xA;Image exists at expected digest? ✅&#xA;Staging environment healthy? ✅&#xA;&#xA;Can&#39;t deploy to production without passing all checks.&#xA;&#xA;4. The &#34;Five Whys&#34; Actually Works&#xA;&#xA;The incident:&#xA;Pods crashed → Why? Missing code&#xA;Missing code → Why? Wrong Dockerfile&#xA;Wrong Dockerfile → Why? Unclear which to use&#xA;Unclear → Why? Inadequate documentation&#xA;Inadequate docs → Why? No review process for critical changes&#xA;&#xA;The root cause wasn&#39;t &#34;wrong Dockerfile&#34; - it was no process to prevent deploying wrong Dockerfiles.&#xA;&#xA;5. Root Cause vs. Proximate Cause&#xA;&#xA;Proximate causes (what broke):&#xA;Used wrong Dockerfile&#xA;Reused image tag&#xA;Forgot to push commits&#xA;&#xA;Root causes (what we can change):&#xA;No validation of build artifacts&#xA;No image management policy&#xA;No deployment guardrails&#xA;&#xA;Fix the proximate causes: You solve this incident.&#xA;Fix the root causes: You prevent the whole class of incidents.&#xA;&#xA;---&#xA;&#xA;The Cost&#xA;&#xA;Downtime: 51 minutes (9:24 - 10:15 AM)&#xA;Total investigation time: ~70 minutes&#xA;Number of failed deployment attempts: 3&#xA;Lesson learned: Priceless&#xA;&#xA;But seriously - this was a production outage for a social platform people rely on. 51 minutes of &#34;sorry, we&#39;re down&#34; is not acceptable.&#xA;&#xA;---&#xA;&#xA;Prevention Checklist&#xA;&#xA;Here&#39;s what we now do before every deployment:&#xA;&#xA;Pre-Deployment&#xA;[ ] Changes committed and pushed to remote&#xA;[ ] CI build passed successfully&#xA;[ ] Image tag is unique (includes git SHA or build number)&#xA;[ ] Or: imagePullPolicy: Always is set&#xA;[ ] Smoke tests verify app code exists in image&#xA;&#xA;During Deployment&#xA;[ ] Watch pod status (kubectl get pods -w)&#xA;[ ] Check logs immediately if crashes occur&#xA;[ ] Verify image digest matches what was built&#xA;&#xA;Post-Deployment&#xA;[ ] All pods healthy&#xA;[ ] Health endpoints responding&#xA;[ ] Run database migrations if needed&#xA;[ ] Check error tracking (Sentry) for issues&#xA;&#xA;---&#xA;&#xA;The Technical Details&#xA;&#xA;For those who want to reproduce this behavior (in a safe environment!):&#xA;&#xA;Build image v1&#xA;docker build -t myapp:v1.0.0 .&#xA;docker push myregistry.com/myapp:v1.0.0&#xA;&#xA;Deploy to Kubernetes&#xA;kubectl apply -f deployment.yaml&#xA;Pods start with image from registry&#xA;&#xA;Now rebuild THE SAME TAG with different code&#xA;docker build -t myapp:v1.0.0 .  # Different code!&#xA;docker push myregistry.com/myapp:v1.0.0&#xA;&#xA;Try to redeploy&#xA;kubectl rollout restart deployment/myapp&#xA;&#xA;Pods will use CACHED image (old v1.0.0), not new one&#xA;Because imagePullPolicy defaults to IfNotPresent&#xA;&#xA;Fix it:&#xA;&#xA;spec:&#xA;  template:&#xA;    spec:&#xA;      containers:&#xA;        name: myapp&#xA;          image: myregistry.com/myapp:v1.0.0&#xA;          imagePullPolicy: Always  # Now it works!&#xA;&#xA;---&#xA;&#xA;Resources&#xA;&#xA;Kubernetes Image Pull Policy Docs&#xA;Docker Image Tagging Best Practices&#xA;Why You Shouldn&#39;t Use :latest Tag&#xA;&#xA;---&#xA;&#xA;Conclusion&#xA;&#xA;A single line - imagePullPolicy: Always - would have prevented 51 minutes of downtime.&#xA;&#xA;The silver lining? We learned this lesson in a relatively low-stakes environment, documented it thoroughly, and now have processes to prevent it from happening again.&#xA;&#xA;And hopefully, by sharing this story, we&#39;ve saved someone else from the same headache.&#xA;&#xA;The next time you rebuild a Docker image with the same tag, remember this story. And add that one line.&#xA;&#xA;---&#xA;&#xA;Have you encountered similar Kubernetes caching issues? How did you solve them? Drop a comment on Mastodon.&#xA;&#xA;---&#xA;&#xA;Update: Migration Complete ✅&#xA;&#xA;After all pods came up healthy, we still needed to run database migrations for BookWyrm v0.8.2. Migration 0220 took about 10 minutes to complete (it was a large data migration). Once finished, the service was fully operational.&#xA;&#xA;Final timeline: 70 minutes from first crash to fully operational service.&#xA;&#xA;---&#xA;&#xA;Tags: #kubernetes #docker #devops #incident-response #lessons-learned #image-caching #imagepullpolicy #bookwyrm #harbor-registry #troubleshooting&#xA;&#xA;---&#xA;&#xA;This post is based on a real production incident on 2025-11-16. Names and some details have been preserved because documenting failures helps everyone learn.&#xA;&#xA;Comment in the Fediverse @saint@river.group.lt]]&gt;</description>
      <content:encoded><![CDATA[<h2 id="a-tale-of-kubernetes-image-caching-and-what-we-learned">A Tale of Kubernetes Image Caching and What We Learned</h2>

<p><strong>TL;DR</strong>: Rebuilt a Docker image with the same tag. Kubernetes cached the old broken image. Pods crashed for 51 minutes. The fix? One line: <code>imagePullPolicy: Always</code>. Here&#39;s the full story.</p>

<hr>

<h2 id="the-setup">The Setup</h2>

<p>It was a Sunday morning. We were upgrading BookWyrm (a federated social reading platform) from v0.8.1 to v0.8.2 on our Kubernetes cluster. The plan was simple:</p>
<ol><li>Update the version tag</li>
<li>Trigger the GitHub Actions workflow</li>
<li>Wait for the build</li>
<li>Deploy</li>
<li>Celebrate</li></ol>

<p>What could go wrong?</p>

<hr>

<h2 id="everything-goes-wrong">Everything Goes Wrong</h2>

<h3 id="9-20-am-the-deployment">9:20 AM: The Deployment</h3>

<p>I triggered the workflow. GitHub Actions spun up, built the Docker image, pushed it to the registry, and deployed to Kubernetes.</p>

<pre><code class="language-bash">✓ Build complete
✓ Image pushed: release-0.8.2
✓ Deployment applied
</code></pre>

<p>Looking good! I watched the pods start rolling out.</p>

<h3 id="9-24-am-the-crash">9:24 AM: The Crash</h3>

<pre><code>NAME                   READY   STATUS
web-86f4676f8b-zwgfs   0/1     CrashLoopBackOff
</code></pre>

<p>Every. Single. Pod. Crashed.</p>

<p>I pulled the logs:</p>

<pre><code class="language-python">ModuleNotFoundError: No module named &#39;bookwyrm&#39;
</code></pre>

<p>The <strong>entire BookWyrm application was missing</strong> from the container.</p>

<h3 id="the-investigation">The Investigation</h3>

<p>I dove into the Dockerfile. We had accidentally used the upstream <code>bookwyrm/Dockerfile</code> instead of our custom one. That Dockerfile only copied <code>requirements.txt</code> – not the actual application code.</p>

<pre><code class="language-dockerfile"># The broken Dockerfile
FROM python:3.10
COPY requirements.txt .
RUN pip install -r requirements.txt
# ... but WHERE&#39;S THE CODE? 😱
</code></pre>

<p>Classic. Easy fix!</p>

<hr>

<h2 id="the-first-fix-that-wasn-t">The First “Fix” (That Wasn&#39;t)</h2>

<h3 id="10-37-am-the-quick-fix">10:37 AM: The Quick Fix</h3>

<p>I created a fix commit that switched to the correct Dockerfile:</p>

<pre><code class="language-dockerfile"># The correct Dockerfile
FROM python:3.10
RUN git clone https://github.com/bookwyrm-social/bookwyrm .
RUN git checkout v0.8.2
RUN pip install -r requirements.txt
# Now we have the code!
</code></pre>

<p>I committed the changes... and forgot to push to GitHub.</p>

<p>Then I triggered the workflow again.</p>

<p>Naturally, GitHub Actions built from the old code (because I hadn&#39;t pushed). The broken image was rebuilt and redeployed.</p>

<p>Pods still crashing. Facepalm moment #1.</p>

<h3 id="10-55-am-actually-pushed-this-time">10:55 AM: Actually Pushed This Time</h3>

<p>I realized my mistake, pushed the commits, and triggered the workflow again.</p>

<p>This time the build actually used the fixed Dockerfile. I watched it clone BookWyrm, install dependencies, everything. The build logs looked perfect:</p>

<pre><code>#9 [ 5/10] RUN git clone https://github.com/bookwyrm-social/bookwyrm .
#9 0.153 Cloning into &#39;.&#39;...
#9 DONE 5.1s
</code></pre>

<p>Success! The image was built correctly and pushed.</p>

<p>I watched the pods roll out... and they crashed again.</p>

<pre><code class="language-python">ModuleNotFoundError: No module named &#39;bookwyrm&#39;
</code></pre>

<p><strong>The exact same error.</strong></p>

<p>This made no sense. The image was built correctly. I verified the build logs. The code was definitely in the image. What was happening?</p>

<hr>

<h2 id="the-real-problem">The Real Problem</h2>

<p>I checked what image the pods were actually running:</p>

<pre><code class="language-bash">kubectl get pod web-c98d458c4-x5p6z -o jsonpath=&#39;{.status.containerStatuses[0].imageID}&#39;
</code></pre>

<pre><code>ghcr.io/nycterent/ziurkes/bookwyrm@sha256:934ea0399adad...
</code></pre>

<p>Then I checked what digest we just pushed:</p>

<pre><code>release-0.8.2: digest: sha256:0a2242691956c24c687cc05d...
</code></pre>

<p><strong>Different digests.</strong> The pods were running the OLD image!</p>

<h3 id="the-kubernetes-image-cache-trap">The Kubernetes Image Cache Trap</h3>

<p>Here&#39;s what I didn&#39;t know (but definitely know now):</p>

<p>When you specify an image in Kubernetes without <code>:latest</code>:</p>

<pre><code class="language-yaml">image: myregistry.com/app:v1.0.0
</code></pre>

<p>Kubernetes defaults to <code>imagePullPolicy: IfNotPresent</code>. This means:</p>
<ul><li>If the image tag exists locally on the node → <strong>use cached version</strong></li>
<li>If the image tag doesn&#39;t exist → pull from registry</li></ul>

<p>We rebuilt the image with the <strong>same tag</strong> (<code>release-0.8.2</code>). The node already had an image with that tag (the broken one). So Kubernetes said “great, I already have <code>release-0.8.2</code>” and used the cached broken image.</p>

<p>Even when I ran <code>kubectl rollout restart</code>, it created new pods... which immediately used the same cached image.</p>

<h3 id="why-this-happens">Why This Happens</h3>

<p>This behavior makes sense for immutable tags. If <code>release-0.8.2</code> is supposed to be immutable, there&#39;s no reason to re-pull it every time.</p>

<p>But we had <strong>mutated</strong> the tag by rebuilding it with the same name.</p>

<h3 id="but-wait-what-s-the-real-root-cause">But Wait – What&#39;s the REAL Root Cause?</h3>

<p>At this point, you might think “Ah, the root cause is image caching!”</p>

<p><strong>Not quite.</strong></p>

<p>The image caching is what <em>broke</em>. But the root cause is <strong>why could this happen in the first place?</strong></p>

<p>Root cause analysis isn&#39;t about what failed—it&#39;s about <strong>what we can change</strong> to prevent it from happening again.</p>

<p>The actual root causes:</p>
<ol><li><strong>No deployment validation</strong> – Nothing checked if our image contained application code</li>
<li><strong>No image management policy</strong> – We had no rules about tag reuse or <code>imagePullPolicy</code></li>
<li><strong>No process guardrails</strong> – Our workflow let us deploy untested changes to production</li>
<li><strong>No automated testing</strong> – No smoke tests, no staging environment, no safety net</li></ol>

<p>The wrong Dockerfile and the image caching were <em>symptoms</em>. The root cause was <strong>missing processes that would have caught these mistakes</strong>.</p>

<hr>

<h2 id="the-solution">The Solution</h2>

<p>The fix ended up being multi-part:</p>

<h3 id="1-migrate-to-harbor-registry">1. Migrate to Harbor Registry</h3>

<p>We consolidated all images into our Harbor registry instead of split between GitHub Container Registry and Harbor. This gave us better control over image management.</p>

<h3 id="2-add-imagepullpolicy-always">2. Add imagePullPolicy: Always</h3>

<p>The critical fix in every deployment:</p>

<pre><code class="language-yaml">spec:
  containers:
    - name: web
      image: uostas/ziurkes/bookwyrm:release-0.8.2
      imagePullPolicy: Always  # ← This one line
</code></pre>

<p>With <code>imagePullPolicy: Always</code>, Kubernetes pulls the image every time, regardless of what&#39;s cached.</p>

<h3 id="3-update-imagepullsecrets">3. Update imagePullSecrets</h3>

<p>Since we moved to Harbor, we needed to update the registry credentials:</p>

<pre><code class="language-yaml">imagePullSecrets:
  - name: uostas-registry  # Harbor credentials
</code></pre>

<p>We deployed these changes and... 🎉</p>

<pre><code>NAME                     READY   STATUS    RESTARTS   AGE
web-5cd76dfd5b-qv4ln     1/1     Running   0          51s
celery-worker-...        1/1     Running   0          73s
celery-beat-...          1/1     Running   0          74s
flower-...               1/1     Running   0          65s
</code></pre>

<p>All pods healthy! Service restored!</p>

<hr>

<h2 id="lessons-learned">Lessons Learned</h2>

<h3 id="1-build-process-validation-prevention-detection">1. Build Process Validation (Prevention &gt; Detection)</h3>

<p><strong>The Real Lesson</strong>: We had no validation that our images contained working code.</p>

<p><strong>What we should have had:</strong></p>

<pre><code class="language-dockerfile"># In Dockerfile - fail build if app code missing
RUN test -f /app/bookwyrm/__init__.py || \
    (echo &#34;ERROR: BookWyrm code not found!&#34; &amp;&amp; exit 1)
</code></pre>

<pre><code class="language-yaml"># In deployment - fail pod startup if app broken
livenessProbe:
  exec:
    command: [&#34;python&#34;, &#34;-c&#34;, &#34;import bookwyrm&#34;]
</code></pre>

<p>If we&#39;d had these, the broken image would never have reached production.</p>

<h3 id="2-image-management-policy-not-just-best-practices">2. Image Management Policy (Not Just Best Practices)</h3>

<p><strong>The Real Lesson</strong>: “Best practices” aren&#39;t enough – you need enforced policies.</p>

<p><strong>What we implemented:</strong></p>
<ul><li>✅ <strong>Required</strong>: <code>imagePullPolicy: Always</code> in all deployments</li>
<li>✅ <strong>Required</strong>: Images must go to Harbor registry (not ghcr.io)</li>
<li>✅ <strong>Recommended</strong>: Include git SHA in tags: <code>release-0.8.2-a1b2c3d</code></li>
<li>✅ <strong>Alternative</strong>: Pin to digest: <code>image@sha256:abc123...</code></li></ul>

<p>These aren&#39;t suggestions – they&#39;re now requirements in our deployment YAMLs.</p>

<h3 id="3-deployment-guardrails-make-mistakes-impossible">3. Deployment Guardrails (Make Mistakes Impossible)</h3>

<p><strong>The Real Lesson</strong>: Manual processes need automated checks.</p>

<p><strong>What we added:</strong></p>

<pre><code class="language-bash"># Pre-deployment checks (automated)
- Commits pushed to remote? ✅
- CI build passed? ✅
- Image exists at expected digest? ✅
- Staging environment healthy? ✅
</code></pre>

<p>Can&#39;t deploy to production without passing all checks.</p>

<h3 id="4-the-five-whys-actually-works">4. The “Five Whys” Actually Works</h3>

<p><strong>The incident:</strong>
– Pods crashed → Why? Missing code
– Missing code → Why? Wrong Dockerfile
– Wrong Dockerfile → Why? Unclear which to use
– Unclear → Why? Inadequate documentation
– Inadequate docs → Why? <strong>No review process for critical changes</strong></p>

<p>The root cause wasn&#39;t “wrong Dockerfile” – it was <strong>no process to prevent deploying wrong Dockerfiles</strong>.</p>

<h3 id="5-root-cause-vs-proximate-cause">5. Root Cause vs. Proximate Cause</h3>

<p><strong>Proximate causes</strong> (what broke):
– Used wrong Dockerfile
– Reused image tag
– Forgot to push commits</p>

<p><strong>Root causes</strong> (what we can change):
– No validation of build artifacts
– No image management policy
– No deployment guardrails</p>

<p><strong>Fix the proximate causes</strong>: You solve this incident.
<strong>Fix the root causes</strong>: You prevent the whole class of incidents.</p>

<hr>

<h2 id="the-cost">The Cost</h2>

<p><strong>Downtime</strong>: 51 minutes (9:24 – 10:15 AM)
<strong>Total investigation time</strong>: ~70 minutes
<strong>Number of failed deployment attempts</strong>: 3
<strong>Lesson learned</strong>: Priceless</p>

<p>But seriously – this was a production outage for a social platform people rely on. 51 minutes of “sorry, we&#39;re down” is not acceptable.</p>

<hr>

<h2 id="prevention-checklist">Prevention Checklist</h2>

<p>Here&#39;s what we now do before every deployment:</p>

<h3 id="pre-deployment">Pre-Deployment</h3>
<ul><li>[ ] Changes committed <strong>and pushed</strong> to remote</li>
<li>[ ] CI build passed successfully</li>
<li>[ ] Image tag is unique (includes git SHA or build number)</li>
<li>[ ] Or: <code>imagePullPolicy: Always</code> is set</li>
<li>[ ] Smoke tests verify app code exists in image</li></ul>

<h3 id="during-deployment">During Deployment</h3>
<ul><li>[ ] Watch pod status (<code>kubectl get pods -w</code>)</li>
<li>[ ] Check logs immediately if crashes occur</li>
<li>[ ] Verify image digest matches what was built</li></ul>

<h3 id="post-deployment">Post-Deployment</h3>
<ul><li>[ ] All pods healthy</li>
<li>[ ] Health endpoints responding</li>
<li>[ ] Run database migrations if needed</li>
<li>[ ] Check error tracking (Sentry) for issues</li></ul>

<hr>

<h2 id="the-technical-details">The Technical Details</h2>

<p>For those who want to reproduce this behavior (in a safe environment!):</p>

<pre><code class="language-bash"># Build image v1
docker build -t myapp:v1.0.0 .
docker push myregistry.com/myapp:v1.0.0

# Deploy to Kubernetes
kubectl apply -f deployment.yaml
# Pods start with image from registry

# Now rebuild THE SAME TAG with different code
docker build -t myapp:v1.0.0 .  # Different code!
docker push myregistry.com/myapp:v1.0.0

# Try to redeploy
kubectl rollout restart deployment/myapp

# Pods will use CACHED image (old v1.0.0), not new one
# Because imagePullPolicy defaults to IfNotPresent
</code></pre>

<p>Fix it:</p>

<pre><code class="language-yaml">spec:
  template:
    spec:
      containers:
        - name: myapp
          image: myregistry.com/myapp:v1.0.0
          imagePullPolicy: Always  # Now it works!
</code></pre>

<hr>

<h2 id="resources">Resources</h2>
<ul><li><a href="https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy" rel="nofollow">Kubernetes Image Pull Policy Docs</a></li>
<li><a href="https://docs.docker.com/develop/dev-best-practices/" rel="nofollow">Docker Image Tagging Best Practices</a></li>
<li><a href="https://vsupalov.com/docker-latest-tag/" rel="nofollow">Why You Shouldn&#39;t Use :latest Tag</a></li></ul>

<hr>

<h2 id="conclusion">Conclusion</h2>

<p>A single line – <code>imagePullPolicy: Always</code> – would have prevented 51 minutes of downtime.</p>

<p>The silver lining? We learned this lesson in a relatively low-stakes environment, documented it thoroughly, and now have processes to prevent it from happening again.</p>

<p>And hopefully, by sharing this story, we&#39;ve saved someone else from the same headache.</p>

<p><strong>The next time you rebuild a Docker image with the same tag, remember this story. And add that one line.</strong></p>

<hr>

<p><em>Have you encountered similar Kubernetes caching issues? How did you solve them? Drop a comment on Mastodon.</em></p>

<hr>

<h2 id="update-migration-complete">Update: Migration Complete ✅</h2>

<p>After all pods came up healthy, we still needed to run database migrations for BookWyrm v0.8.2. Migration 0220 took about 10 minutes to complete (it was a large data migration). Once finished, the service was fully operational.</p>

<p><strong>Final timeline</strong>: 70 minutes from first crash to fully operational service.</p>

<hr>

<p><strong>Tags</strong>: <a href="/saint/tag:kubernetes" class="hashtag" rel="nofollow"><span>#</span><span class="p-category">kubernetes</span></a> <a href="/saint/tag:docker" class="hashtag" rel="nofollow"><span>#</span><span class="p-category">docker</span></a> <a href="/saint/tag:devops" class="hashtag" rel="nofollow"><span>#</span><span class="p-category">devops</span></a> <a href="/saint/tag:incident" class="hashtag" rel="nofollow"><span>#</span><span class="p-category">incident</span></a>-response <a href="/saint/tag:lessons" class="hashtag" rel="nofollow"><span>#</span><span class="p-category">lessons</span></a>-learned <a href="/saint/tag:image" class="hashtag" rel="nofollow"><span>#</span><span class="p-category">image</span></a>-caching <a href="/saint/tag:imagepullpolicy" class="hashtag" rel="nofollow"><span>#</span><span class="p-category">imagepullpolicy</span></a> <a href="/saint/tag:bookwyrm" class="hashtag" rel="nofollow"><span>#</span><span class="p-category">bookwyrm</span></a> <a href="/saint/tag:harbor" class="hashtag" rel="nofollow"><span>#</span><span class="p-category">harbor</span></a>-registry <a href="/saint/tag:troubleshooting" class="hashtag" rel="nofollow"><span>#</span><span class="p-category">troubleshooting</span></a></p>

<hr>

<p><em>This post is based on a real production incident on 2025-11-16. Names and some details have been preserved because documenting failures helps everyone learn.</em></p>

<p>Comment in the Fediverse <a href="https://avys.group.lt/@/saint@river.group.lt" class="u-url mention" rel="nofollow">@<span>saint@river.group.lt</span></a></p>
]]></content:encoded>
      <guid>https://avys.group.lt/saint/how-a-single-docker-tag-cost-us-51-minutes-of-downtime</guid>
      <pubDate>Mon, 17 Nov 2025 06:50:19 +0000</pubDate>
    </item>
  </channel>
</rss>