From Zero to Production EKS Platform in Two Sessions


It’s 11 PM on a Sunday night and I can’t stop building. What started as “let’s deploy a kids game” turned into the most productive 48 hours of my 20-year engineering career.

The Challenge

I had a kids games app — a React + Bun monorepo with a 3D pool game and puzzle game. It ran locally with Docker Compose. I wanted to deploy it properly. Not “throw it on a VPS” properly. Production-grade, scalable, reusable infrastructure that I could clone for every future project.

The kind of platform that normally takes a DevOps team months to build.

What We Built

In two evening sessions, working with an AI agent (Claude), we went from zero AWS infrastructure to a fully operational platform:

Session 1: The Platform

  • VPC with public/private subnets across 2 availability zones
  • EKS cluster with On-Demand platform nodes and Spot workload nodes
  • GitOps pipeline: push to main → GitHub Actions builds Docker images → pushes to ECR → updates infra repo → ArgoCD auto-deploys
  • SSL certificates via Let’s Encrypt (cert-manager)
  • Secrets management via AWS Secrets Manager + External Secrets Operator
  • MongoDB on EKS with daily backups to S3
  • Network policies and Pod Security Admission
  • DNS on Route53 with automatic subdomain routing

The kids game was live at game.kidsgamesapp.com before midnight.

Session 2: Observability + LLM Integration

The next evening, we kept going:

  • Prometheus + Grafana — full metrics, custom dashboards, alert rules
  • Loki + Promtail — centralized log aggregation from every pod
  • Langfuse — LLM call tracing for our AI photo-to-cartoon feature
  • RDS PostgreSQL for Langfuse’s database
  • Custom Grafana dashboards — Cluster Overview and LLM Operations, deployed as code via GitOps
  • Langfuse SDK integration — every AI image transform is traced with model, latency, and status

The Architecture

User → Route53 → NLB → ingress-nginx → app pods
                                      → Grafana
                                      → Langfuse

CI/CD: git push → GitHub Actions → ECR → infra repo → ArgoCD → live

Everything runs on 3 nodes: 1 On-Demand for platform services, 2 Spot instances for workloads. Total cost: ~$200/month for a production-grade platform with full observability.

Deploying a New App

The best part — adding a new app to this platform takes 5 steps:

  1. Add a Dockerfile to the app repo
  2. Copy the CI workflow
  3. Add ECR repo via Terraform
  4. Add ArgoCD application manifest
  5. Push — it deploys automatically

We proved this works by deploying this very blog you’re reading as another app on the same cluster.

What I Learned

I’ve been building software for 20 years. I’ve led teams that spent months setting up infrastructure like this. The tools haven’t changed — Terraform, Kubernetes, Prometheus, ArgoCD are all the same tools my teams use. What changed is the velocity.

The AI didn’t replace my engineering judgment. It replaced the hours of typing, debugging YAML indentation, looking up Helm chart values, and waiting for Stack Overflow answers. I still made every architectural decision. I still reviewed every change. But instead of context-switching between 15 browser tabs, I stayed in flow.

The platform is real. The observability is real. The CI/CD is real. This isn’t a demo. It’s running in production right now, serving actual users.

What’s Next

  • Marketing site deployment (proving the multi-app pattern)
  • CloudFront CDN for static asset caching
  • More games and features for the kids platform
  • A custom mobile notification app for monitoring alerts

The infrastructure is built to grow. Every future project starts with git clone and a few config changes.


This post was auto-generated from our engineering session logs and published via the same GitOps pipeline it describes. It’s 11 PM and I still can’t stop building.