Mar 16, 2026

From Zero to Production EKS Platform in Two Sessions

It’s 11 PM on a Sunday night and I can’t stop building. What started as “let’s deploy a kids game” turned into the most productive 48 hours of my 20-year engineering career.

The Challenge

I had a kids games app — a React + Bun monorepo with a 3D pool game and puzzle game. It ran locally with Docker Compose. I wanted to deploy it properly. Not “throw it on a VPS” properly. Production-grade, scalable, reusable infrastructure that I could clone for every future project.

The kind of platform that normally takes a DevOps team months to build.

What We Built

In two evening sessions, working with an AI agent (Claude), we went from zero AWS infrastructure to a fully operational platform:

Session 1: The Platform

VPC with public/private subnets across 2 availability zones
EKS cluster with On-Demand platform nodes and Spot workload nodes
GitOps pipeline: push to main → GitHub Actions builds Docker images → pushes to ECR → updates infra repo → ArgoCD auto-deploys
SSL certificates via Let’s Encrypt (cert-manager)
Secrets management via AWS Secrets Manager + External Secrets Operator
MongoDB on EKS with daily backups to S3
Network policies and Pod Security Admission
DNS on Route53 with automatic subdomain routing

The kids game was live at game.kidsgamesapp.com before midnight.

Session 2: Observability + LLM Integration

The next evening, we kept going:

Prometheus + Grafana — full metrics, custom dashboards, alert rules
Loki + Promtail — centralized log aggregation from every pod
Langfuse — LLM call tracing for our AI photo-to-cartoon feature
RDS PostgreSQL for Langfuse’s database
Custom Grafana dashboards — Cluster Overview and LLM Operations, deployed as code via GitOps
Langfuse SDK integration — every AI image transform is traced with model, latency, and status

The Architecture

User → Route53 → NLB → ingress-nginx → app pods
                                      → Grafana
                                      → Langfuse

CI/CD: git push → GitHub Actions → ECR → infra repo → ArgoCD → live

Everything runs on 3 nodes: 1 On-Demand for platform services, 2 Spot instances for workloads. Total cost: ~$200/month for a production-grade platform with full observability.

Deploying a New App

The best part — adding a new app to this platform takes 5 steps:

Add a Dockerfile to the app repo
Copy the CI workflow
Add ECR repo via Terraform
Add ArgoCD application manifest
Push — it deploys automatically

We proved this works by deploying this very blog you’re reading as another app on the same cluster.

What I Learned

I’ve been building software for 20 years. I’ve led teams that spent months setting up infrastructure like this. The tools haven’t changed — Terraform, Kubernetes, Prometheus, ArgoCD are all the same tools my teams use. What changed is the velocity.

The AI didn’t replace my engineering judgment. It replaced the hours of typing, debugging YAML indentation, looking up Helm chart values, and waiting for Stack Overflow answers. I still made every architectural decision. I still reviewed every change. But instead of context-switching between 15 browser tabs, I stayed in flow.

The platform is real. The observability is real. The CI/CD is real. This isn’t a demo. It’s running in production right now, serving actual users.

What’s Next

Marketing site deployment (proving the multi-app pattern)
CloudFront CDN for static asset caching
More games and features for the kids platform
A custom mobile notification app for monitoring alerts

The infrastructure is built to grow. Every future project starts with git clone and a few config changes.

This post was auto-generated from our engineering session logs and published via the same GitOps pipeline it describes. It’s 11 PM and I still can’t stop building.