saint

scrapbook of a sysadmin

Date: November 21, 2025 Author: Infrastructure Team @ River.group.lt Tags: Infrastructure, Open Source, Vendor Lock-in, Lessons Learned


TL;DR

Broadcom's acquisition of VMware (and Bitnami) resulted in the deprecation of free container images, affecting thousands of production deployments worldwide. Our Mastodon instance at river.group.lt was impacted, but we turned this crisis into an opportunity to build more resilient infrastructure. Here's what happened and what we learned.


The Wake-Up Call

On November 21st, 2025, while upgrading our Mastodon instance from v4.5.1 to v4.5.2, we discovered something concerning: several Elasticsearch pods were stuck in CrashLoopBackOff. The error was cryptic:

/bin/bash: line 1: sysctl: command not found

This wasn't a configuration issue or a bug in our deployment. This was the canary in the coal mine for a much larger industry-wide problem.

What Actually Happened

The Bitnami Story

If you've deployed anything on Kubernetes in the past few years, you've probably used Bitnami Helm charts. They were convenient, well-maintained, and free. The PostgreSQL chart, Redis chart, Elasticsearch chart—all trusted by thousands of organizations.

Then came the acquisition: – August 2021: VMware acquired Bitnami – 2024: Broadcom acquired VMware – August 28, 2025: Bitnami stopped publishing free Debian-based container images – September 29, 2025: All images moved to a read-only “legacy” repository

The new pricing? $50,000 to $72,000 per year for “Bitnami Secure” subscriptions.

Our Impact

Our entire Elasticsearch cluster was running on Bitnami images: – 4 Elasticsearch pods failing to start – Search functionality degraded – Running on unmaintained images with no security updates – Init containers expecting tools that no longer existed in the slimmed-down legacy images

But we weren't alone. This affected: – Major Kubernetes distributions – Thousands of Helm chart deployments – Production instances worldwide

The Detective Work

The debugging journey was educational:

  1. Pod events → Init container crashes
  2. Container logs → Missing sysctl command in debian:stable-slim
  3. Web research → Discovered the Bitnami deprecation
  4. Community investigation → Found Mastodon's response (new official chart)
  5. System verification → Realized our node already had correct kernel settings

The init container was trying to set vm.max_map_count=262144 for Elasticsearch, but: – The container image no longer included the required tools – Our node already had the correct settings – The init container was solving a problem that didn't exist

Classic case of inherited configuration outliving its purpose.

The Fix (and the Plan)

We took a two-phase approach:

Phase 1: Immediate Stabilization

What we did right away: 1. Disabled the unnecessary init container 2. Scaled down to single-node Elasticsearch (appropriate for our size) 3. Cleared old cluster state by deleting persistent volumes 4. Rebuilt the search index from scratch

Result: All systems operational within 2 hours, search functionality restored.

Phase 2: Strategic Migration

We didn't just patch the problem—we planned a proper solution:

Created comprehensive migration plan (MIGRATION-TO-NEW-CHART.md): – Migrate to official Mastodon Helm chart (removes all Bitnami dependencies) – Deploy OpenSearch instead of Elasticsearch (Apache 2.0 licensed) – Keep our existing DragonflyDB (we were already ahead of the curve!) – Timeline: Phased approach over next quarter

The new Mastodon chart removes bundled dependencies entirely, expecting you to provide your own: – PostgreSQL → CloudNativePG or managed service – Redis → DragonflyDB, Valkey, or managed service – Elasticsearch → OpenSearch or Elastic's official operator

This is actually better architecture—no magic, full control, and proper separation of concerns.

What We Learned

1. Vendor Lock-in Happens Gradually

We didn't consciously choose vendor lock-in. We just used convenient, well-maintained Helm charts. Before we knew it: – PostgreSQL: Bitnami – Redis: Bitnami – Elasticsearch: Bitnami

One vendor decision affected our entire stack.

New rule: Diversify dependency sources. Use official images where possible.

2. “Open Source” Doesn't Mean “Free Forever”

Recent examples of this pattern: – HashiCorp → IBM (Terraform moved to BSL license) – Redis → Redis Labs (licensing restrictions added) – Elasticsearch → Elastic NV (moved to SSPL) – Bitnami → Broadcom (deprecated free tier)

The pattern: Company acquisition → Business model change → Service monetization

New rule: For critical infrastructure, always have a migration plan ready.

3. Community Signals are Early Warnings

The Mastodon community started discussing this in August 2025. The official chart team had already removed Bitnami dependencies months before our incident. We could have been proactive instead of reactive.

New rule: Subscribe to community channels for critical dependencies. Monitor GitHub issues, Reddit discussions, and release notes.

4. Version Pinning Isn't Optional

We were using elasticsearch:8 instead of elasticsearch:8.18.0. When the vendor deprecated tags, we had no control over what :8 meant anymore.

New rule: Always pin to specific versions. Use image digests for critical services.

5. Init Containers Need Regular Audits

Our init container was setting kernel parameters that: – Were already set on the host – May have been necessary years ago – Nobody had questioned recently

New rule: Audit init containers quarterly. Verify they're still necessary.

The Bigger Picture

This incident is part of a broader trend in the cloud-native ecosystem:

The Consolidation Era: – Big Tech acquiring open-source companies – Monetization pressure from private equity – Shift from “community-first” to “enterprise-first”

The Community Response: – OpenTofu (Terraform fork) – Valkey (Redis fork) – OpenSearch (Elasticsearch fork) – New Mastodon chart (Bitnami-free)

The open-source community is resilient. When a vendor tries to close the garden, the community forks and continues.

Our Action Plan

Immediate (Done ✅)

  • [x] Fixed Elasticsearch crashes
  • [x] Restored search functionality
  • [x] Documented everything
  • [x] Created migration plan

Short-term

  • [ ] Add monitoring alerts for pod failures
  • [ ] Pin all container image versions
  • [ ] Deploy OpenSearch for testing

Long-term

  • [ ] Migrate to official Mastodon chart
  • [ ] Consider CloudNativePG for PostgreSQL
  • [ ] Regular dependency health audits

What You Should Do

If you're running infrastructure on Kubernetes:

1. Audit Your Dependencies

# Find all Bitnami images
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[].spec.containers[].image' | \
  grep bitnami | sort -u

2. Check Your Helm Charts

# List all Helm releases using Bitnami charts
helm list --all-namespaces -o json | \
  jq -r '.[] | select(.chart | contains("bitnami"))'

3. Create Migration Plans

Don't panic-migrate. Create proper plans: – Document current state – Research alternatives – Test migrations in non-production – Schedule maintenance windows – Have rollback procedures ready

4. Learn from Our Mistakes

We've documented everything: – Migration plan: Step-by-step guide to official Mastodon chart – Retrospective: What went wrong and why – Lessons learned: Patterns to avoid vendor lock-in

Resources

If you're dealing with similar issues:

Bitnami Alternatives: – PostgreSQL: Official images, CloudNativePG – Redis: DragonflyDB, Valkey – Elasticsearch: OpenSearch, ECK

Mastodon Resources:New Official ChartMigration Guide

Community Discussion:Bitnami Deprecation IssueReddit Discussion

Closing Thoughts

This incident reminded us of an important principle: Infrastructure should be boring. We want our database to just work, our cache to be reliable, and our search to be fast. We don't want vendor drama.

The irony? Bitnami made things “boring” by providing convenient, pre-packaged solutions. But convenience can become dependency. Dependency can become lock-in. And lock-in can become a crisis when business models change.

The path forward is clear: 1. Use official images where possible 2. Diversify dependency sources 3. Pin versions explicitly 4. Monitor community signals 5. Always have a Plan B

Our Mastodon instance at river.group.lt is now healthier than before. All pods are green, search is working, and we have a clear migration path to even better infrastructure.

Sometimes a crisis is just the push you need to build something more resilient.


Discussion

We'd love to hear your experiences: – Have you been affected by the Bitnami deprecation? – What alternatives are you using? – What lessons have you learned about vendor dependencies?


About the Author: This post is from the infrastructure team maintaining river.group.lt, a Mastodon instance running the glitch-soc fork. We believe in transparent operations and sharing knowledge with the community.

License: This post and associated migration documentation are published under CC BY-SA 4.0. Feel free to adapt for your own use.

Updates: – 2025-11-21: Initial publication – Search index rebuild completed successfully – All systems operational


P.S. – If you're running a Mastodon instance and need help with migration planning, reach out. We've documented everything and we're happy to help.

Comment in the Fediverse @saint@river.group.lt

A Tale of Kubernetes Image Caching and What We Learned

TL;DR: Rebuilt a Docker image with the same tag. Kubernetes cached the old broken image. Pods crashed for 51 minutes. The fix? One line: imagePullPolicy: Always. Here's the full story.


The Setup

It was a Sunday morning. We were upgrading BookWyrm (a federated social reading platform) from v0.8.1 to v0.8.2 on our Kubernetes cluster. The plan was simple:

  1. Update the version tag
  2. Trigger the GitHub Actions workflow
  3. Wait for the build
  4. Deploy
  5. Celebrate

What could go wrong?


Everything Goes Wrong

9:20 AM: The Deployment

I triggered the workflow. GitHub Actions spun up, built the Docker image, pushed it to the registry, and deployed to Kubernetes.

✓ Build complete
✓ Image pushed: release-0.8.2
✓ Deployment applied

Looking good! I watched the pods start rolling out.

9:24 AM: The Crash

NAME                   READY   STATUS
web-86f4676f8b-zwgfs   0/1     CrashLoopBackOff

Every. Single. Pod. Crashed.

I pulled the logs:

ModuleNotFoundError: No module named 'bookwyrm'

The entire BookWyrm application was missing from the container.

The Investigation

I dove into the Dockerfile. We had accidentally used the upstream bookwyrm/Dockerfile instead of our custom one. That Dockerfile only copied requirements.txt – not the actual application code.

# The broken Dockerfile
FROM python:3.10
COPY requirements.txt .
RUN pip install -r requirements.txt
# ... but WHERE'S THE CODE? 😱

Classic. Easy fix!


The First “Fix” (That Wasn't)

10:37 AM: The Quick Fix

I created a fix commit that switched to the correct Dockerfile:

# The correct Dockerfile
FROM python:3.10
RUN git clone https://github.com/bookwyrm-social/bookwyrm .
RUN git checkout v0.8.2
RUN pip install -r requirements.txt
# Now we have the code!

I committed the changes... and forgot to push to GitHub.

Then I triggered the workflow again.

Naturally, GitHub Actions built from the old code (because I hadn't pushed). The broken image was rebuilt and redeployed.

Pods still crashing. Facepalm moment #1.

10:55 AM: Actually Pushed This Time

I realized my mistake, pushed the commits, and triggered the workflow again.

This time the build actually used the fixed Dockerfile. I watched it clone BookWyrm, install dependencies, everything. The build logs looked perfect:

#9 [ 5/10] RUN git clone https://github.com/bookwyrm-social/bookwyrm .
#9 0.153 Cloning into '.'...
#9 DONE 5.1s

Success! The image was built correctly and pushed.

I watched the pods roll out... and they crashed again.

ModuleNotFoundError: No module named 'bookwyrm'

The exact same error.

This made no sense. The image was built correctly. I verified the build logs. The code was definitely in the image. What was happening?


The Real Problem

I checked what image the pods were actually running:

kubectl get pod web-c98d458c4-x5p6z -o jsonpath='{.status.containerStatuses[0].imageID}'
ghcr.io/nycterent/ziurkes/bookwyrm@sha256:934ea0399adad...

Then I checked what digest we just pushed:

release-0.8.2: digest: sha256:0a2242691956c24c687cc05d...

Different digests. The pods were running the OLD image!

The Kubernetes Image Cache Trap

Here's what I didn't know (but definitely know now):

When you specify an image in Kubernetes without :latest:

image: myregistry.com/app:v1.0.0

Kubernetes defaults to imagePullPolicy: IfNotPresent. This means:

  • If the image tag exists locally on the node → use cached version
  • If the image tag doesn't exist → pull from registry

We rebuilt the image with the same tag (release-0.8.2). The node already had an image with that tag (the broken one). So Kubernetes said “great, I already have release-0.8.2” and used the cached broken image.

Even when I ran kubectl rollout restart, it created new pods... which immediately used the same cached image.

Why This Happens

This behavior makes sense for immutable tags. If release-0.8.2 is supposed to be immutable, there's no reason to re-pull it every time.

But we had mutated the tag by rebuilding it with the same name.

But Wait – What's the REAL Root Cause?

At this point, you might think “Ah, the root cause is image caching!”

Not quite.

The image caching is what broke. But the root cause is why could this happen in the first place?

Root cause analysis isn't about what failed—it's about what we can change to prevent it from happening again.

The actual root causes:

  1. No deployment validation – Nothing checked if our image contained application code
  2. No image management policy – We had no rules about tag reuse or imagePullPolicy
  3. No process guardrails – Our workflow let us deploy untested changes to production
  4. No automated testing – No smoke tests, no staging environment, no safety net

The wrong Dockerfile and the image caching were symptoms. The root cause was missing processes that would have caught these mistakes.


The Solution

The fix ended up being multi-part:

1. Migrate to Harbor Registry

We consolidated all images into our Harbor registry instead of split between GitHub Container Registry and Harbor. This gave us better control over image management.

2. Add imagePullPolicy: Always

The critical fix in every deployment:

spec:
  containers:
    - name: web
      image: uostas/ziurkes/bookwyrm:release-0.8.2
      imagePullPolicy: Always  # ← This one line

With imagePullPolicy: Always, Kubernetes pulls the image every time, regardless of what's cached.

3. Update imagePullSecrets

Since we moved to Harbor, we needed to update the registry credentials:

imagePullSecrets:
  - name: uostas-registry  # Harbor credentials

We deployed these changes and... 🎉

NAME                     READY   STATUS    RESTARTS   AGE
web-5cd76dfd5b-qv4ln     1/1     Running   0          51s
celery-worker-...        1/1     Running   0          73s
celery-beat-...          1/1     Running   0          74s
flower-...               1/1     Running   0          65s

All pods healthy! Service restored!


Lessons Learned

1. Build Process Validation (Prevention > Detection)

The Real Lesson: We had no validation that our images contained working code.

What we should have had:

# In Dockerfile - fail build if app code missing
RUN test -f /app/bookwyrm/__init__.py || \
    (echo "ERROR: BookWyrm code not found!" && exit 1)
# In deployment - fail pod startup if app broken
livenessProbe:
  exec:
    command: ["python", "-c", "import bookwyrm"]

If we'd had these, the broken image would never have reached production.

2. Image Management Policy (Not Just Best Practices)

The Real Lesson: “Best practices” aren't enough – you need enforced policies.

What we implemented:

  • Required: imagePullPolicy: Always in all deployments
  • Required: Images must go to Harbor registry (not ghcr.io)
  • Recommended: Include git SHA in tags: release-0.8.2-a1b2c3d
  • Alternative: Pin to digest: image@sha256:abc123...

These aren't suggestions – they're now requirements in our deployment YAMLs.

3. Deployment Guardrails (Make Mistakes Impossible)

The Real Lesson: Manual processes need automated checks.

What we added:

# Pre-deployment checks (automated)
- Commits pushed to remote? ✅
- CI build passed? ✅
- Image exists at expected digest? ✅
- Staging environment healthy? ✅

Can't deploy to production without passing all checks.

4. The “Five Whys” Actually Works

The incident: – Pods crashed → Why? Missing code – Missing code → Why? Wrong Dockerfile – Wrong Dockerfile → Why? Unclear which to use – Unclear → Why? Inadequate documentation – Inadequate docs → Why? No review process for critical changes

The root cause wasn't “wrong Dockerfile” – it was no process to prevent deploying wrong Dockerfiles.

5. Root Cause vs. Proximate Cause

Proximate causes (what broke): – Used wrong Dockerfile – Reused image tag – Forgot to push commits

Root causes (what we can change): – No validation of build artifacts – No image management policy – No deployment guardrails

Fix the proximate causes: You solve this incident. Fix the root causes: You prevent the whole class of incidents.


The Cost

Downtime: 51 minutes (9:24 – 10:15 AM) Total investigation time: ~70 minutes Number of failed deployment attempts: 3 Lesson learned: Priceless

But seriously – this was a production outage for a social platform people rely on. 51 minutes of “sorry, we're down” is not acceptable.


Prevention Checklist

Here's what we now do before every deployment:

Pre-Deployment

  • [ ] Changes committed and pushed to remote
  • [ ] CI build passed successfully
  • [ ] Image tag is unique (includes git SHA or build number)
  • [ ] Or: imagePullPolicy: Always is set
  • [ ] Smoke tests verify app code exists in image

During Deployment

  • [ ] Watch pod status (kubectl get pods -w)
  • [ ] Check logs immediately if crashes occur
  • [ ] Verify image digest matches what was built

Post-Deployment

  • [ ] All pods healthy
  • [ ] Health endpoints responding
  • [ ] Run database migrations if needed
  • [ ] Check error tracking (Sentry) for issues

The Technical Details

For those who want to reproduce this behavior (in a safe environment!):

# Build image v1
docker build -t myapp:v1.0.0 .
docker push myregistry.com/myapp:v1.0.0

# Deploy to Kubernetes
kubectl apply -f deployment.yaml
# Pods start with image from registry

# Now rebuild THE SAME TAG with different code
docker build -t myapp:v1.0.0 .  # Different code!
docker push myregistry.com/myapp:v1.0.0

# Try to redeploy
kubectl rollout restart deployment/myapp

# Pods will use CACHED image (old v1.0.0), not new one
# Because imagePullPolicy defaults to IfNotPresent

Fix it:

spec:
  template:
    spec:
      containers:
        - name: myapp
          image: myregistry.com/myapp:v1.0.0
          imagePullPolicy: Always  # Now it works!

Resources


Conclusion

A single line – imagePullPolicy: Always – would have prevented 51 minutes of downtime.

The silver lining? We learned this lesson in a relatively low-stakes environment, documented it thoroughly, and now have processes to prevent it from happening again.

And hopefully, by sharing this story, we've saved someone else from the same headache.

The next time you rebuild a Docker image with the same tag, remember this story. And add that one line.


Have you encountered similar Kubernetes caching issues? How did you solve them? Drop a comment on Mastodon.


Update: Migration Complete ✅

After all pods came up healthy, we still needed to run database migrations for BookWyrm v0.8.2. Migration 0220 took about 10 minutes to complete (it was a large data migration). Once finished, the service was fully operational.

Final timeline: 70 minutes from first crash to fully operational service.


Tags: #kubernetes #docker #devops #incident-response #lessons-learned #image-caching #imagepullpolicy #bookwyrm #harbor-registry #troubleshooting


This post is based on a real production incident on 2025-11-16. Names and some details have been preserved because documenting failures helps everyone learn.

Comment in the Fediverse @saint@river.group.lt

Environment: Kubernetes, Helm, glitch-soc v4.5.1

Problem

Default character limit: 500

Investigation

Checked glitch-soc documentation. Character limits are configurable via MAX_TOOT_CHARS environment variable.

Verified chart template handling:

$ grep -r "extraEnvVars" templates/
templates/configmap-env.yaml:  {{- range $k, $v := .Values.mastodon.extraEnvVars }}
templates/configmap-env.yaml:  {{ $k }}: {{ quote $v }}

Chart iterates over mastodon.extraEnvVars and renders into ConfigMap. Deployments load via envFrom.

Configuration

# values-river.yaml
mastodon:
  extraEnvVars:
    MAX_TOOT_CHARS: "42069"

Pre-deployment Verification

$ helm template river-mastodon . -f values-river.yaml | grep MAX_TOOT_CHARS
MAX_TOOT_CHARS: "42069"

Template renders correctly.

Deployment

$ helm upgrade river-mastodon . -n mastodon -f values-river.yaml
Release "river-mastodon" has been upgraded. Happy Helming!
REVISION: 167

$ kubectl rollout status deployment/river-mastodon-web -n mastodon
deployment "river-mastodon-web" successfully rolled out

$ kubectl rollout status deployment/river-mastodon-sidekiq-all-queues -n mastodon
deployment "river-mastodon-sidekiq-all-queues" successfully rolled out

$ kubectl rollout status deployment/river-mastodon-streaming -n mastodon
deployment "river-mastodon-streaming" successfully rolled out

Post-deployment Verification

$ kubectl exec -n mastodon deployment/river-mastodon-web -- env | grep MAX_TOOT_CHARS
MAX_TOOT_CHARS=42069

$ kubectl get pods -n mastodon | grep river-mastodon-web
river-mastodon-web-67586b449d-r5v2q   1/1   Running   0   32s

Result

Character limit: 500 → 42069 Downtime: 0s Issues: None

Notes for Other Admins

Works with standard Mastodon Helm chart. The extraEnvVars pattern:

  1. Add to values file
  2. Chart renders into ConfigMap
  3. Pods load via envFrom
  4. Rolling update applies change

No chart modifications needed.


Deployed on river.group.lt

Comment in the Fediverse @saint@river.group.lt

Published: November 13, 2025 Author: River Instance Team Reading Time: 8 minutes


The Mission

Today we upgraded our Mastodon instance (river.group.lt) from version 4.5.0 to 4.5.1. While this might sound like a routine patch update, we used it as an opportunity to make our infrastructure more secure and our deployment process more automated. Here's what we learned along the way.


Why Upgrade?

When glitch-soc (our preferred Mastodon variant) released version 4.5.1, we reviewed the changelog and found 10 bug fixes, including:

  • Better keyboard navigation in the Alt text modal
  • Fixed issues with quote posts appearing as “unquotable”
  • Improved filter application in detailed views
  • Build fixes for ARM64 architecture

More importantly: no database migrations, no breaking changes, and no new features that could introduce instability. This is what we call a “safe upgrade” – the perfect candidate for improving our processes while updating.


The Starting Point

Our Mastodon setup isn't quite standard. We run:

  • glitch-soc variant (Mastodon fork with extra features)
  • Custom Docker images with Sentry monitoring baked in
  • Kubernetes deployment via Helm charts
  • AMD64 architecture (important for cross-platform builds)

This means we can't just pull the latest official image – we need to rebuild our custom images with each new version.


The Problem We Solved

Before this upgrade, our build process looked like this:

# Find Harbor registry credentials (where?)
# Copy-paste username and password
docker login registry.example.com
# Enter credentials manually
# Update version in 4 different files
# Hope they all match
./build.sh
# Wait for builds to complete
# Manually verify everything worked

The issues: – Credentials stored in shell history (security risk) – Manual steps prone to typos – No automation = easy to forget steps – Credentials sitting in ~/.docker/config.json unencrypted

We knew we could do better.


The Solution: Infisical Integration

Infisical is a secrets management platform – think of it as a secure vault for credentials that your applications can access automatically. Instead of storing Harbor registry credentials on our laptop, we:

  1. Stored credentials in Infisical (one-time setup)
  2. Updated our build script to fetch credentials automatically
  3. Automated the Docker login process

Now our build script looks like this:

#!/bin/bash
set -e

VERSION="v4.5.1"
REGISTRY="registry.example.com/library"
PROJECT_ID="<your-infisical-project-id>"

echo "🔑 Logging in to Harbor registry..."
# Fetch credentials from Infisical
HARBOR_USERNAME=$(infisical secrets get \
  --domain https://secrets.example.com/api \
  --projectId ${PROJECT_ID} \
  --env prod HARBOR_USERNAME \
  --silent -o json | jq -r '.[0].secretValue')

HARBOR_PASSWORD=$(infisical secrets get \
  --domain https://secrets.example.com/api \
  --projectId ${PROJECT_ID} \
  --env prod HARBOR_PASSWORD \
  --silent -o json | jq -r '.[0].secretValue')

# Automatic login
echo "${HARBOR_PASSWORD}" | docker login ${REGISTRY} \
  --username "${HARBOR_USERNAME}" --password-stdin

# Build and push images...

Note: Code examples use placeholder values. Replace registry.example.com, secrets.example.com, and <your-infisical-project-id> with your actual infrastructure endpoints.

The benefits: – ✅ No credentials in shell history – ✅ No manual copy-pasting – ✅ Audit trail of when credentials were accessed – ✅ Easy credential rotation – ✅ Works the same on any machine with Infisical access


The Upgrade Process

With our improved automation in place, the actual upgrade was straightforward:

Step 1: Research

We used AI assistance to research the glitch-soc v4.5.1 release: – Confirmed it was a patch release (low risk) – Verified no database migrations required – Reviewed all 10 bug fixes – Checked for breaking changes (none found)

Lesson: Always research before executing. 15 minutes of reading can prevent hours of rollback.

Step 2: Update Version References

We needed to update the version in exactly 4 places:

  1. docker-assets/build.sh – Build script version variable
  2. docker-assets/Dockerfile.mastodon-sentry – Base image version
  3. docker-assets/Dockerfile.streaming-sentry – Streaming image version
  4. values-river.yaml – Helm values for both image tags

Lesson: Keep a checklist of version locations. It's easy to miss one.

Step 3: Build Custom Images

cd docker-assets
./build.sh

The script now: – Fetches credentials from Infisical ✓ – Logs into Harbor registry ✓ – Builds both images with --platform linux/amd64 ✓ – Pushes to registry ✓ – Provides clear success/failure messages ✓

Build time: ~5 seconds (thanks to Docker layer caching!)

Step 4: Deploy to Kubernetes

cd ..
helm upgrade river-mastodon . -n mastodon -f values-river.yaml

Helm performed a rolling update: – Old pods kept running while new ones started – New pods pulled v4.5.1 images – Old pods terminated once new ones were healthy – Zero downtime for our users

Step 5: Verify

kubectl exec -n mastodon deployment/river-mastodon-web -- tootctl version
# Output: 4.5.1+glitch

All three pod types (web, streaming, sidekiq) now running the new version. Success! 🎉


What We Learned

1. Automation Compounds Over Time

The Infisical integration took about 60 minutes to implement. The actual version bump took 30 minutes. That might seem like overkill for a “simple” upgrade.

But here's the math: – Manual process: 5 minutes per build to manage credentials – Automated process: 0 minutes – Builds per year: ~20 upgrades and tests – Time saved annually: 100 minutes – Payback period: 12 builds (~6 months)

Plus, we eliminated a security risk. The real value isn't just time – it's confidence and safety.

2. Separate Upstream from Custom

We keep the upstream Helm chart (Chart.yaml) completely untouched. Our customizations live in: – Custom Dockerfiles (add Sentry) – Values overrides (values-river.yaml) – Build scripts

Why this matters: We can pull upstream chart updates without conflicts. Our changes are additive, not modifications.

3. Test Incrementally

We didn't just run the full build and hope it worked. We tested:

  1. ✓ Credential retrieval from Infisical
  2. ✓ JSON parsing with jq
  3. ✓ Docker login with retrieved credentials
  4. ✓ Image builds
  5. ✓ Image pushes to registry
  6. ✓ Kubernetes deployment
  7. ✓ Running version verification

Each step validated before moving forward. When something broke (initial credential permissions), we caught it immediately.

4. Documentation Is for Future You

We wrote a comprehensive retrospective covering: – What went well – What we learned – What we'd do differently next time – Troubleshooting guides for common issues

In 6 months when we upgrade to v4.6.0, we'll thank ourselves for this documentation.

5. Version Numbers Tell a Story

Understanding semantic versioning helps assess risk:

  • v4.5.0 → v4.5.1 = Patch release (bug fixes only, low risk)
  • v4.5.x → v4.6.0 = Minor release (new features, moderate risk)
  • v4.x.x → v5.0.0 = Major release (breaking changes, high risk)

This informed our decision to proceed quickly with minimal testing.


What We'd Do Differently Next Time

Despite the success, we identified improvements:

High Priority

1. Validate credentials before building

Currently, we discover authentication failures during the image push (after building). Better:

# Test login BEFORE building
if ! docker login ...; then
  echo "❌ Auth failed"
  exit 1
fi

2. Initialize Infisical project config

Running infisical init in the project directory creates a .infisical.json file, eliminating the need for --projectId flags in every command.

3. Add version consistency checks

A simple script to verify all 4 files have matching versions before building would catch human errors.

Medium Priority

4. Automated deployment verification

Replace manual kubectl checks with a script that: – Waits for pods to be ready – Extracts running version – Compares to expected version – Reports success/failure

5. Dry-run mode for build script

Test the script logic without actually building or pushing images. Useful for testing changes to the script itself.


The Impact

Before this session: – Manual credential management – 5+ minutes per build for login – Credentials in shell history (security risk) – No audit trail

After this session: – Automated credential retrieval – 0 minutes per build for login – Credentials never exposed (security improvement) – Full audit trail in Infisical – Repeatable process documented

Plus: We're running Mastodon v4.5.1 with 10 bug fixes, making our instance more stable for our users.


Lessons for Other Mastodon Admins

If you run a Mastodon instance, here's what we learned that might help you:

For Small Instances

Even if you're running standard Mastodon without customizations:

  1. Document your upgrade process – Your future self will thank you
  2. Test in staging first – If you don't have staging, test with dry-run/simulation
  3. Always check release notes – 5 minutes of reading prevents hours of debugging
  4. Use semantic versioning to assess risk – Patch releases are usually safe

For Custom Deployments

If you run custom images like we do:

  1. Separate upstream from custom – Keep modifications isolated and additive
  2. Automate credential management – Shell history is not secure storage
  3. Use Docker layer caching – Speeds up builds dramatically
  4. Platform flags matter--platform linux/amd64 if deploying to different architecture
  5. Verify the running version – Don't assume deployment worked, check it

For Kubernetes Deployments

If you deploy to Kubernetes:

  1. Rolling updates are your friend – Zero downtime is achievable
  2. Helm revisions enable easy rollbackhelm rollback is simple and fast
  3. Verify pod image versions – Check what's actually running, not just deployed
  4. Monitor during rollout – Watch pod status, don't just fire and forget

The Numbers

Session Duration: 90 minutes total – Research: 15 minutes – Version updates: 10 minutes – Infisical integration: 60 minutes – Build & deploy: 5 minutes

Deployment Stats:Downtime: 0 seconds (rolling update) – Pods affected: 3 (web, streaming, sidekiq) – Helm revision: 166 – Rollback complexity: Low (single command)

Lines of code changed: 18 lines across 4 files Lines of documentation written: 629 lines (retrospective) Security improvements: 1 major (credential management)


Final Thoughts

What started as a simple patch upgrade turned into a significant infrastructure improvement. The version bump was almost trivial – the real work was automating away manual steps and eliminating security risks.

This is what good ops work looks like: using routine maintenance as an opportunity to make systems better. The 60 minutes we spent on Infisical integration will pay dividends on every future build. The documentation we wrote will help the next person (or future us) upgrade with confidence.

Mastodon v4.5.1 is running smoothly, our build process is more secure, and we learned lessons that will make the next upgrade even smoother.


Resources

For Mastodon Admins:Mastodon Upgrade Documentationglitch-soc Releases

For Infrastructure:Infisical (Secrets Management)Docker Build Best PracticesHelm Upgrade Documentation

Our Instance:river.group.lt – Live Mastodon instance – Running glitch-soc v4.5.1+glitch – Kubernetes + Helm deployment – Custom images with Sentry monitoring


Questions?

If you're running a Mastodon instance and have questions about: – Upgrading glitch-soc variants – Custom Docker image workflows – Kubernetes deployments – Secrets management with Infisical – Zero-downtime upgrades

Feel free to reach out! We're happy to share what we've learned.


Tags: #mastodon #glitch-soc #kubernetes #devops #infrastructure #security #automation


This blog post is part of our infrastructure documentation series. We believe in sharing knowledge to help others running similar systems. All technical details are from our actual upgrade session on November 13, 2025.

Comment in the Fediverse @saint@river.group.lt

We recently upgraded our Mastodon instance from v4.4.4 to v4.5.0, bumping the Helm chart from 6.5.3 to 6.6.0. While the upgrade itself was straightforward, we encountered an interesting challenge that's worth sharing.

What's New in Mastodon v4.5.0?

The v4.5.0 release brings some exciting features:

  • Quote Posts – Full support for authoring and displaying quotes
  • 🔄 Dynamic Reply Fetching – Better conversation threading in the web UI
  • 🚫 Username Blocking – Server-wide username filtering
  • 🎨 Custom Emoji Overhaul – Complete rendering system rewrite
  • 📊 Enhanced Moderation Tools – Improved admin/moderator interface
  • Performance Improvements – Optimized database queries

The Architecture Gotcha

Everything seemed perfect during the upgrade process. We:

  1. Merged the upstream chart cleanly
  2. Updated our custom configurations (LibreTranslate, Sentry integration)
  3. Built new Docker images with Sentry monitoring
  4. Pushed to our registry

But when we deployed, pods started crashing with a cryptic error:

exec /usr/local/bundle/bin/bundle: exec format error

The Root Cause

The issue? Architecture mismatch. We built our Docker images on an ARM64 Mac (Apple Silicon), but our Kubernetes cluster runs on AMD64 (x86_64) nodes. The images were perfectly valid—just for the wrong architecture!

The Fix

The solution was simple but important:

docker build --platform linux/amd64 \
  -f Dockerfile.mastodon-sentry \
  -t registry.example.com/mastodon-sentry:v4.5.0 \
  . --push

By explicitly specifying --platform linux/amd64, Docker builds images compatible with our cluster architecture, even when building on ARM64 hardware.

We updated our build script to always include this flag, preventing future issues:

# Build for AMD64 (cluster architecture)
docker build --platform linux/amd64 \
  -f Dockerfile.mastodon-sentry \
  -t ${REGISTRY}/mastodon-sentry:${VERSION} \
  . --push

Deployment Results

After rebuilding with the correct architecture:

  • ✅ All pods running healthy (web, streaming, sidekiq)
  • ✅ Elasticsearch cluster rolled out successfully
  • ✅ PostgreSQL remained stable
  • ✅ Zero data loss, minimal downtime
  • ✅ All customizations preserved (LibreTranslate, Sentry, custom log levels)

Deployment Stats: – Total time: ~2 hours (including troubleshooting) – Downtime: Minimal (rolling update)

Lessons Learned

  1. Always specify target platform when building images for deployment, especially in cross-architecture development environments
  2. Build scripts should be architecture-aware to prevent silent failures
  3. Test deployments catch issues early – the error appeared immediately during pod startup
  4. Keep customizations isolated – Our values-river.yaml approach made the upgrade smooth

What's Next?

We still need to reindex Elasticsearch for the new search features:

kubectl exec -n mastodon deployment/river-mastodon-web -- \
  tootctl search deploy

This will update search indices for all accounts, statuses, and tags to take advantage of v4.5.0's improved search capabilities.

Key Takeaway

Modern development often happens on ARM64 Macs while production runs on AMD64 servers. Docker's multi-platform build support makes this seamless—but only if you remember to use it! A simple --platform flag saved us from a much longer debugging session.

Happy federating! 🐘


Resources:Mastodon v4.5.0 Release NotesDocker Multi-Platform BuildsMastodon Helm Chart


Our instance runs on Kubernetes with custom Sentry monitoring, LibreTranslate integration, and an alternative Redis implementation. All chart templates remain identical to upstream, with customizations isolated in our values file.

Comment in the Fediverse @saint@river.group.lt

Monitoring best practices

There are things that really irks my day and monitoring alerts that are not actionable (i.e. – informational) are in that list, so writing this blog post to improve the signal.

  • Any alert you receive from monitoring system should be actionable and that action must not be snoosable or something similar.

You get alert – you work bitch.

Comment in the Fediverse @saint@river.group.lt

How to use Borgmatic to backup PostgreSQL in Kubernetes

There are many goodies in the internets, but not much good documentation so that usage of them would be frictionless. Here is a short writeup on backuping up Mastodon instance database, running on Kubernetes with Borgmatic, but it could be used for any generic database and/or file path, that is supported by Borg backup and Borgmatic.

For the destination/target repository I am using Borg Base, but it should work with any Borg compatible ssh repo.

There are some issues that I might solve in the future, namely – storing sensitive information in safer way, but for the moment I just wanted to make backups work. If you will do it – please let me know and I will update my setup.

First things first, we need: – Kubernetes – Mastodon deployment in its own namespace – Setup SSH keys – Setup Healtchecks – Setup Borgbase – Install Borgmatic via Helm Chart, using our values

Kubernetes

You should be runnning something already

Mastodon deployment

You should have deployed Mastodon already in its own namespace. Instructions should work for any generic deployment that uses PostgreSQL or any other supported database.

Setup SSH keys

You need to generate ssh key and save both parts – private and public. Private will go to our Borgmatic setup and public – to Borgbase.

ssh-keygen -t ed25519 -f /tmp/id_ed25519

This can be done on any computer that has ssh-keygen utility. /tmp/id_ed25519will contain the private part and /tmp/id_ed25519.pub – the public. You don't need the files themselves, just contents.

Setup Healthchecks

This is optional, but do it if you do not use Borgbase – get https://healthchecks.io/ account – it is free for small numbers of checks. Borgbase also has the same functionality, so you might skip it if you chose to use Borgbase.

Setup Borgbase

Again – you can use whatever repository server that supports Borg, but I found Borgbase having all the features I require for the cheap price. Add public ssh key part to keys and make sure that repository is tied to this key. Save the repository address.

Install Borgmatic via Helm Chart, using our values

I have used this Chart – https://charts.gabe565.com/charts/borgmatic/ – it is great stuff, just had some issues figuring out how it works. I am kinda slow learner – I make a lot of assumptions that result in being me wrong most of the time. The trick was to find good values for values.yaml:

#
# IMPORTANT NOTE
#
# This chart inherits from our common library chart. You can check the default values/options here:
# https://github.com/bjw-s/helm-charts/blob/main/charts/library/common/values.yaml
#

image:
  # -- image repository
  repository: ghcr.io/borgmatic-collective/borgmatic
  # -- image pull policy
  pullPolicy: IfNotPresent
  # -- image tag
  tag: 1.7.14

controller:
  # -- Set the controller type. Valid options are `deployment` or `cronjob`.
  type: deployment
  cronjob:
    # -- Only used when `controller.type: cronjob`. Sets the backup CronJob time.
    schedule: 0 * * * *
    # -- Only used when `controller.type: cronjob`. Sets the CronJob backoffLimit.
    backoffLimit: 0

# -- environment variables. [[ref]](https://borgbackup.readthedocs.io/en/stable/usage/general.html#environment-variables)
# @default -- See [values.yaml](./values.yaml)
env:
  # -- Borg host ID used in archive names
  # @default -- Deployment namespace
  BORG_HOST_ID: ""
  BORG_PASSPHRASE: ---GENERATE THIS WITH pwgen OR SIMILAR TOOL---
  PGPASSWORD: ---PostgresSQL db password---
  # MYSQL_PWD:
  # MONGODB_PASSWORD:

persistence:
  # -- Configure persistence settings for the chart under this key.
  # @default -- See [values.yaml](./values.yaml)
  data:
    enabled: false
    retain: true
    # storageClass: ""
    # accessMode: ReadWriteOnce
    # size: 1Gi
    subPath:
      - path: borg-repository
        mountPath: /mnt/borg-repository
      - path: config
        mountPath: /root/.config/borg
      - path: cache
        mountPath: /root/.cache/borg
  # -- Configure SSH credentials for the chart under this key.
  # @default -- See [values.yaml](./values.yaml)
  ssh:
    name: borgmatic-ssh 
    enabled: true
    type: configMap
    mountPath: /root/.ssh/
    readOnly: false
    defaultMode: 0600

configMaps:
  # -- Configure Borgmatic container under this key.
  # @default -- See [values.yaml](./values.yaml)
  ssh :
    enabled: true
    data:
      id_ed25519: |
        -----BEGIN OPENSSH PRIVATE KEY-----
        --- PASTE YOUR PRIVATE KEY HERE---
        -----END OPENSSH PRIVATE KEY-----
      known_hosts: |
        --- paste the output of ssh-keyscan borg-repository-address ---
      
  config:
    enabled: true
    data:
      # -- Crontab
      crontab.txt: |-
        0 1 * * * PATH=$PATH:/usr/bin /usr/local/bin/borgmatic --stats -v 0 2>&1
      # -- Borgmatic config. [[ref]](https://torsion.org/borgmatic/docs/reference/configuration)
      # @default -- See [values.yaml](./values.yaml)
      config.yaml: |
        location:
          # List of source directories to backup.
          source_directories:
            - /etc/ <- any directory you want as we are concerned only with db backup

          # Paths of local or remote repositories to backup to.
          repositories:
            - ---BORG REPOSITORY URL---

        retention:
          # Retention policy for how many backups to keep.
          keep_daily: 7
          keep_weekly: 4
          keep_monthly: 6

        consistency:
          # List of checks to run to validate your backups.
          checks:
            - name: repository
            - name: archives
              frequency: 2 weeks

        hooks:


          # Databases to dump and include in backups.
          postgresql_databases:
            - name: mastodon_production
              hostname: hostname-of-mastodon-postgresql-db
              username: mastodon

          # Third-party services to notify you if backups aren't happening.
          healthchecks: --- healtcheck url ---
helm install borgmatic gabe565/borgmatic -f values.yaml -n mastodon
kubectl rollout status deployment.apps/borgmatic -n mastodon
kubectl exec -i deployment.apps/borgmatic -n mastodon -- borgmatic init --encryption repokey-blake2
kubectl exec -it deployment.apps/borgmatic -n mastodon -- borgmatic create --stats

All of these commands should be completed without errors

Comment in the Fediverse @saint@river.group.lt

Caddyfile for running Lemmy

How to follow Lemmy community from Mastodon

Spent a couple of hours.. wanted to follow a Lemmy community from group.lt instance on Mastodon. Here is a working config for Caddy (Caddyfile):

group.lt {
	reverse_proxy	http://lemmy_lemmy-ui_1:1234
        tls saint@ghost.lt {
}

@lemmy {
        path    /api/*
        path    /pictrs/*
        path    /feeds/*
        path    /nodeinfo/*
        path    /.well-known/*
}

@lemmy-hdr {
	header Accept application/*
}

@lemmy-post {
	method POST
}

handle @lemmy {
        reverse_proxy   http://lemmy_lemmy_1:8536
}

handle @lemmy-hdr {
        reverse_proxy   http://lemmy_lemmy_1:8536
}

handle @lemmy-post {
        reverse_proxy   http://lemmy_lemmy_1:8536
}

The key point here was

@lemmy-hdr {
	header Accept application/*
}

I have taken a hint from lemmy.coupou.fr

and from some nginx conf for lemmy

Comment in the Fediverse @saint@river.group.lt

Reclaiming space in synapse postgresql database

Follow the fat elephant

I have received an alert from Grafana – that my synapse directory is almost full, which was kinda strange as I have given 100GB partition to it just a couple of weeks ago.. So I have put a hat, picked up some cider and something to smoke and went to the adventure.

From the old times I knew that postgresql database size can be reduced using vacuumdb. Entered the container and boom – after 15 or so minutes it has finished and reclaimed 100MB of space.. Hmmm... Interesting – which table eats the space. Google, link

SELECT
    relname AS "relation",
    pg_size_pretty (
        pg_total_relation_size (C .oid)
    ) AS "total_size"
FROM
    pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C .relnamespace)
WHERE
    nspname NOT IN (
        'pg_catalog',
        'information_schema'
    )
AND C .relkind <> 'i'
AND nspname !~ '^pg_toast'
ORDER BY
    pg_total_relation_size (C .oid) DESC
LIMIT 5;
 relation      | total_size
--------------------+------------
 state_groups_state | 65 GB
 event_json         | 1197 MB
 event_edges        | 619 MB
 events             | 595 MB
 event_auth         | 528 MB

Alright!!! Google: stategroupsstate, link and found a compression tool.

git clone, crap a short docker-compose.yml and build the tool.

root@instance-20211112-2005:/opt/synapse-compress-state# cat docker-compose.yaml
---
version: "3.5"
services:
  synapse-compress:
      build:
        context: rust-synapse-compress-state/
      command: synapse_auto_compressor -p postgresql://user:pass@dbhost/dbname -c 500 -n 100
      networks:
          - synapse

networks:
        synapse:
                name: synapse

let's crap some more:

root@instance-20211112-2005:/opt/synapse# cat /opt/synapse-compress-state/run.sh
#!/bin/bash

cd /opt/synapse-compress-state/

docker-compose up

put it into crontab:

@daily /opt/synapse-compress-state/run.sh > /dev/null

later googled more and found some smarter people than me: shrink synapse database and that really helped, especially reindexing.

Comment in the Fediverse @saint@river.group.lt

How to send mail using php mail() function in docker way

Using ssmtp to send email to the relay container

Sometimes we want to run dockerized old php site that we do not want to work with, or a programmer is gone and nobody cares to make changes to use email relay host, such as mailgun or gmail or anything else. In the Linux VM or bare metal server it is quite easy task – you run web server and mail server two in one and mail server takes care of mail routing.

In the dockerized environment usually you want to run the least amount of services possible in the container, so sending mail using PHP's mail() function becomes a tricky.

Let's create a docker-compose.yml file, containing all required containers:

caddy – web server

php – php server for the app – the trick here to use msmtp as a sendmail to have mail sent to remote server (our mail container)

mail – smtp relay server, we will use postfix

the source code is here: docker-php-mail-example

the main thing is to use ssmtp on the php container and send data to mail container

enjoy!

Comment in the Fediverse @saint@river.group.lt