incident — saint

How a Single Docker Tag Cost Us 51 Minutes of Downtime

A Tale of Kubernetes Image Caching and What We Learned

TL;DR: Rebuilt a Docker image with the same tag. Kubernetes cached the old broken image. Pods crashed for 51 minutes. The fix? One line: imagePullPolicy: Always. Here's the full story.

The Setup

It was a Sunday morning. We were upgrading BookWyrm (a federated social reading platform) from v0.8.1 to v0.8.2 on our Kubernetes cluster. The plan was simple:

Update the version tag
Trigger the GitHub Actions workflow
Wait for the build
Deploy
Celebrate

What could go wrong?

Everything Goes Wrong

9:20 AM: The Deployment

I triggered the workflow. GitHub Actions spun up, built the Docker image, pushed it to the registry, and deployed to Kubernetes.

✓ Build complete
✓ Image pushed: release-0.8.2
✓ Deployment applied

Looking good! I watched the pods start rolling out.

9:24 AM: The Crash

NAME                   READY   STATUS
web-86f4676f8b-zwgfs   0/1     CrashLoopBackOff

Every. Single. Pod. Crashed.

I pulled the logs:

ModuleNotFoundError: No module named 'bookwyrm'

The entire BookWyrm application was missing from the container.

The Investigation

I dove into the Dockerfile. We had accidentally used the upstream bookwyrm/Dockerfile instead of our custom one. That Dockerfile only copied requirements.txt – not the actual application code.

# The broken Dockerfile
FROM python:3.10
COPY requirements.txt .
RUN pip install -r requirements.txt
# ... but WHERE'S THE CODE? 😱

Classic. Easy fix!

The First “Fix” (That Wasn't)

10:37 AM: The Quick Fix

I created a fix commit that switched to the correct Dockerfile:

# The correct Dockerfile
FROM python:3.10
RUN git clone https://github.com/bookwyrm-social/bookwyrm .
RUN git checkout v0.8.2
RUN pip install -r requirements.txt
# Now we have the code!

I committed the changes... and forgot to push to GitHub.

Then I triggered the workflow again.

Naturally, GitHub Actions built from the old code (because I hadn't pushed). The broken image was rebuilt and redeployed.

Pods still crashing. Facepalm moment #1.

10:55 AM: Actually Pushed This Time

I realized my mistake, pushed the commits, and triggered the workflow again.

This time the build actually used the fixed Dockerfile. I watched it clone BookWyrm, install dependencies, everything. The build logs looked perfect:

#9 [ 5/10] RUN git clone https://github.com/bookwyrm-social/bookwyrm .
#9 0.153 Cloning into '.'...
#9 DONE 5.1s

Success! The image was built correctly and pushed.

I watched the pods roll out... and they crashed again.

ModuleNotFoundError: No module named 'bookwyrm'

The exact same error.

This made no sense. The image was built correctly. I verified the build logs. The code was definitely in the image. What was happening?

The Real Problem

I checked what image the pods were actually running:

kubectl get pod web-c98d458c4-x5p6z -o jsonpath='{.status.containerStatuses[0].imageID}'

ghcr.io/nycterent/ziurkes/bookwyrm@sha256:934ea0399adad...

Then I checked what digest we just pushed:

release-0.8.2: digest: sha256:0a2242691956c24c687cc05d...

Different digests. The pods were running the OLD image!

The Kubernetes Image Cache Trap

Here's what I didn't know (but definitely know now):

When you specify an image in Kubernetes without :latest:

image: myregistry.com/app:v1.0.0

Kubernetes defaults to imagePullPolicy: IfNotPresent. This means:

If the image tag exists locally on the node → use cached version
If the image tag doesn't exist → pull from registry

We rebuilt the image with the same tag (release-0.8.2). The node already had an image with that tag (the broken one). So Kubernetes said “great, I already have release-0.8.2” and used the cached broken image.

Even when I ran kubectl rollout restart, it created new pods... which immediately used the same cached image.

Why This Happens

This behavior makes sense for immutable tags. If release-0.8.2 is supposed to be immutable, there's no reason to re-pull it every time.

But we had mutated the tag by rebuilding it with the same name.

But Wait – What's the REAL Root Cause?

At this point, you might think “Ah, the root cause is image caching!”

Not quite.

The image caching is what broke. But the root cause is why could this happen in the first place?

Root cause analysis isn't about what failed—it's about what we can change to prevent it from happening again.

The actual root causes:

No deployment validation – Nothing checked if our image contained application code
No image management policy – We had no rules about tag reuse or imagePullPolicy
No process guardrails – Our workflow let us deploy untested changes to production
No automated testing – No smoke tests, no staging environment, no safety net

The wrong Dockerfile and the image caching were symptoms. The root cause was missing processes that would have caught these mistakes.

The Solution

The fix ended up being multi-part:

1. Migrate to Harbor Registry

We consolidated all images into our Harbor registry instead of split between GitHub Container Registry and Harbor. This gave us better control over image management.

2. Add imagePullPolicy: Always

The critical fix in every deployment:

spec:
  containers:
    - name: web
      image: uostas/ziurkes/bookwyrm:release-0.8.2
      imagePullPolicy: Always  # ← This one line

With imagePullPolicy: Always, Kubernetes pulls the image every time, regardless of what's cached.

3. Update imagePullSecrets

Since we moved to Harbor, we needed to update the registry credentials:

imagePullSecrets:
  - name: uostas-registry  # Harbor credentials

We deployed these changes and... 🎉

NAME                     READY   STATUS    RESTARTS   AGE
web-5cd76dfd5b-qv4ln     1/1     Running   0          51s
celery-worker-...        1/1     Running   0          73s
celery-beat-...          1/1     Running   0          74s
flower-...               1/1     Running   0          65s

All pods healthy! Service restored!

Lessons Learned

1. Build Process Validation (Prevention > Detection)

The Real Lesson: We had no validation that our images contained working code.

What we should have had:

# In Dockerfile - fail build if app code missing
RUN test -f /app/bookwyrm/__init__.py || \
    (echo "ERROR: BookWyrm code not found!" && exit 1)

# In deployment - fail pod startup if app broken
livenessProbe:
  exec:
    command: ["python", "-c", "import bookwyrm"]

If we'd had these, the broken image would never have reached production.

2. Image Management Policy (Not Just Best Practices)

The Real Lesson: “Best practices” aren't enough – you need enforced policies.

What we implemented:

✅ Required: imagePullPolicy: Always in all deployments
✅ Required: Images must go to Harbor registry (not ghcr.io)
✅ Recommended: Include git SHA in tags: release-0.8.2-a1b2c3d
✅ Alternative: Pin to digest: image@sha256:abc123...

These aren't suggestions – they're now requirements in our deployment YAMLs.

3. Deployment Guardrails (Make Mistakes Impossible)

The Real Lesson: Manual processes need automated checks.

What we added:

# Pre-deployment checks (automated)
- Commits pushed to remote? ✅
- CI build passed? ✅
- Image exists at expected digest? ✅
- Staging environment healthy? ✅

Can't deploy to production without passing all checks.

4. The “Five Whys” Actually Works

The incident: – Pods crashed → Why? Missing code – Missing code → Why? Wrong Dockerfile – Wrong Dockerfile → Why? Unclear which to use – Unclear → Why? Inadequate documentation – Inadequate docs → Why? No review process for critical changes

The root cause wasn't “wrong Dockerfile” – it was no process to prevent deploying wrong Dockerfiles.

5. Root Cause vs. Proximate Cause

Proximate causes (what broke): – Used wrong Dockerfile – Reused image tag – Forgot to push commits

Root causes (what we can change): – No validation of build artifacts – No image management policy – No deployment guardrails

Fix the proximate causes: You solve this incident. Fix the root causes: You prevent the whole class of incidents.

The Cost

Downtime: 51 minutes (9:24 – 10:15 AM) Total investigation time: ~70 minutes Number of failed deployment attempts: 3 Lesson learned: Priceless

But seriously – this was a production outage for a social platform people rely on. 51 minutes of “sorry, we're down” is not acceptable.

Prevention Checklist

Here's what we now do before every deployment:

Pre-Deployment

[ ] Changes committed and pushed to remote
[ ] CI build passed successfully
[ ] Image tag is unique (includes git SHA or build number)
[ ] Or: imagePullPolicy: Always is set
[ ] Smoke tests verify app code exists in image

During Deployment

[ ] Watch pod status (kubectl get pods -w)
[ ] Check logs immediately if crashes occur
[ ] Verify image digest matches what was built

Post-Deployment

[ ] All pods healthy
[ ] Health endpoints responding
[ ] Run database migrations if needed
[ ] Check error tracking (Sentry) for issues

The Technical Details

For those who want to reproduce this behavior (in a safe environment!):

# Build image v1
docker build -t myapp:v1.0.0 .
docker push myregistry.com/myapp:v1.0.0

# Deploy to Kubernetes
kubectl apply -f deployment.yaml
# Pods start with image from registry

# Now rebuild THE SAME TAG with different code
docker build -t myapp:v1.0.0 .  # Different code!
docker push myregistry.com/myapp:v1.0.0

# Try to redeploy
kubectl rollout restart deployment/myapp

# Pods will use CACHED image (old v1.0.0), not new one
# Because imagePullPolicy defaults to IfNotPresent

Fix it:

spec:
  template:
    spec:
      containers:
        - name: myapp
          image: myregistry.com/myapp:v1.0.0
          imagePullPolicy: Always  # Now it works!

Resources

Conclusion

A single line – imagePullPolicy: Always – would have prevented 51 minutes of downtime.

The silver lining? We learned this lesson in a relatively low-stakes environment, documented it thoroughly, and now have processes to prevent it from happening again.

And hopefully, by sharing this story, we've saved someone else from the same headache.

The next time you rebuild a Docker image with the same tag, remember this story. And add that one line.

Have you encountered similar Kubernetes caching issues? How did you solve them? Drop a comment on Mastodon.

Update: Migration Complete ✅

After all pods came up healthy, we still needed to run database migrations for BookWyrm v0.8.2. Migration 0220 took about 10 minutes to complete (it was a large data migration). Once finished, the service was fully operational.

Final timeline: 70 minutes from first crash to fully operational service.

Tags: #kubernetes #docker #devops #incident-response #lessons-learned #image-caching #imagepullpolicy #bookwyrm #harbor-registry #troubleshooting

This post is based on a real production incident on 2025-11-16. Names and some details have been preserved because documenting failures helps everyone learn.

Comment in the Fediverse @saint@river.group.lt