Zero-Downtime Castopod Upgrade on Kubernetes

Upgrading a production podcast platform without dropping a single listener connection.

The Challenge

Our Castopod instance at kastaspuods.lt needed an upgrade from v1.13.7 to v1.13.8. Requirements: – Zero downtime – listeners actively streaming podcasts – No data loss – database contains all podcast metadata and analytics – Include bug fix – v1.13.8 contains a fix we contributed for federated comments

The Strategy

1. Backup First, Always

Before touching anything, we ran a full backup using Borgmatic:

kubectl exec -n kastaspuods deploy/borgmatic -- borgmatic --stats

Result: 435MB database dumped, compressed to 199MB, shipped to Hetzner Storage Box.

2. Pin Your Versions

Our deployment was using castopod/castopod:latest – a ticking time bomb. We changed to:

image: castopod/castopod:1.13.8

Explicit versions mean reproducible deployments and controlled upgrades.

3. Rolling Update Strategy

The key to zero downtime is Kubernetes' RollingUpdate strategy:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 0
    maxSurge: 1

What this means: – maxUnavailable: 0 – Never terminate an old pod until a new one is ready – maxSurge: 1 – Allow one extra pod during rollout

With 2 replicas, the rollout proceeds: 1. Spin up 1 new pod (now 3 total) 2. Wait for new pod to be Ready 3. Terminate 1 old pod (back to 2) 4. Repeat until all pods are new

4. Apply and Watch

kubectl apply -f app-deployment.yaml
kubectl rollout status deployment/app --timeout=180s

Total rollout time: ~90 seconds. Zero dropped connections.

5. Post-Upgrade Verification

CodeIgniter handles most post-upgrade tasks automatically. We verified:

kubectl exec deploy/app -- php spark migrate:status
kubectl exec deploy/app -- php spark cache:clear
kubectl exec deploy/redis -- redis-cli flushall

The Result

Metric Value
Downtime 0 seconds
Rollout time ~90 seconds
Data loss None
Backup size 199MB compressed

Lessons Learned

  1. Backup before everything – Takes 60 seconds, saves hours of panic
  2. Pin versions explicitlylatest is not a version strategy
  3. Use maxUnavailable: 0 – The single most important setting for zero-downtime
  4. Keep yaml in sync with cluster – Our yaml said 1 replica, cluster had 2
  5. Check upstream releases – Our bug report was fixed, no patching needed

The Bug That Got Fixed

We had reported Issue #577 – federated comments from Mastodon showed “Jan 1, 1970” due to a column mismatch in a UNION query. We patched it manually, reported upstream, and v1.13.8 includes the official fix.

Architecture

Traffic: Ingress -> Nginx (S3 proxy) -> Castopod:8000
                                              |
                                    MariaDB + Redis

Backup: Borgmatic -> mysqldump -> Borg -> Hetzner

kastaspuods.lt is a Lithuanian podcast hosting platform running on Kubernetes.