The Outage I Almost Missed: Building Real Uptime Alerts
It was a Monday evening when I opened my laptop and found the tab sitting there: puzzle.kidsgamesapp.com — This site can't be reached.
Not a blip. A full outage. The app had been down for hours, and nothing had pinged me. No email, no Slack, nothing. We had Prometheus, Grafana, AlertManager, Loki — a full observability stack — and I still found out the old-fashioned way: by accidentally opening the browser tab.
That’s when I realized: monitoring and alerting are not the same thing.
What Actually Broke
Two completely unrelated things failed at the same time.
1. MongoDB in CrashLoopBackOff
The first issue was mongodb-0 — our StatefulSet pod — stuck in CrashLoopBackOff for over three hours. MongoDB itself was healthy. It started, loaded data, and began listening on port 27017. But Kubernetes kept killing it.
The culprit: health probe configuration. Our StatefulSet manifest didn’t set timeoutSeconds on either the readiness or liveness probe, so it defaulted to Kubernetes’ 1-second timeout. The probe command was:
exec:
command:
- mongosh
- --eval
- "db.adminCommand('ping')"
mongosh — the modern MongoDB shell — takes more than 1 second to initialize, even for a trivial ping. On our t3.medium Spot nodes, it was consistently timing out. MongoDB was perfectly healthy; Kubernetes just couldn’t confirm it fast enough.
The fix was two lines:
readinessProbe:
timeoutSeconds: 5 # was missing (defaulted to 1)
livenessProbe:
timeoutSeconds: 5 # was missing (defaulted to 1)
We deleted the crashing pod, it restarted with the new probe config, and MongoDB was back within 30 seconds.
2. The Domain on clientHold
Even after MongoDB recovered, the site was still unreachable. ERR_NAME_NOT_RESOLVED — DNS wasn’t resolving at all.
The cluster was healthy. Ingress was healthy. The ELB was healthy. dig @8.8.8.8 puzzle.kidsgamesapp.com returned the correct IPs. But browsers couldn’t reach it.
Domain Status: clientHold https://icann.org/epp#clientHold
Amazon Registrar had placed kidsgamesapp.com on hold because the registrant email address wasn’t verified. clientHold completely disables DNS for the entire domain — not just one subdomain, but everything. puzzle., argocd., grafana., the marketing site. All of it, gone.
The fix: verify the email in the Route53 Registered Domains console. DNS propagated within minutes after that.
Why We Didn’t Get Notified
Here’s where it gets embarrassing. We already had AlertManager configured to send emails via AWS SES. We even had a Watchdog alert that fires constantly to prove the pipeline is working.
The alerts were firing. AlertManager was trying to send them. But every attempt failed:
554 Message rejected: Email address is not verified.
The following identities failed the check in region US-EAST-1: admin@kidsgamesapp.com
AWS SES runs in sandbox mode by default. In sandbox, you can only send email to addresses you’ve explicitly verified as identities — even if you’re the one receiving them. Our AlertManager was sending to admin@kidsgamesapp.com, which forwards to my Gmail via ImprovMX, but SES didn’t know that. From SES’s perspective, admin@kidsgamesapp.com was unverified.
The fix: verify admin@kidsgamesapp.com as an SES identity. Since ImprovMX forwards it to my Gmail, the verification email landed there and I clicked the link.
aws ses verify-email-identity \
--email-address admin@kidsgamesapp.com \
--region us-east-1
Within a minute of verifying, AlertManager drained its retry queue and level=INFO msg="Notify success" started appearing in the logs.
What We Built: Uptime Monitoring
With the immediate fires out, it was time to make sure we’d catch the next one. The gap was clear: we had no probe that checked whether the site was actually reachable from end to end.
The Stack
Our monitoring platform is built on kube-prometheus-stack — the battle-tested Helm chart that bundles Prometheus, Grafana, AlertManager, and all the Kubernetes-specific exporters. It runs in an observability namespace alongside Loki (log aggregation) and Promtail (log shipping from every pod).
What we were missing was Prometheus Blackbox Exporter — a separate exporter that probes external endpoints and exposes the results as Prometheus metrics.
Blackbox Exporter
The Blackbox Exporter is dead simple: you give it a URL and an HTTP module, it probes it, and publishes a probe_success metric (1 = up, 0 = down) along with response time, TLS expiry, HTTP status code, and more.
We deployed it via Helm into the observability namespace:
helm install blackbox-exporter \
prometheus-community/prometheus-blackbox-exporter \
-n observability \
-f k8s/platform/blackbox-exporter/values.yaml
With a minimal values.yaml:
nodeSelector:
role: workloads
resources:
requests:
cpu: 10m
memory: 32Mi
limits:
memory: 64Mi
32 megabytes. That’s the entire cost of knowing your app is up.
Scrape Config
We added a scrape job to Prometheus that routes through the Blackbox Exporter:
- job_name: blackbox-http
metrics_path: /probe
params:
module: [http_2xx]
scrape_interval: 30s
static_configs:
- targets:
- https://puzzle.kidsgamesapp.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter-prometheus-blackbox-exporter.observability.svc.cluster.local:9115
The relabeling is the clever bit: Prometheus hits the Blackbox Exporter’s /probe endpoint, passing the target URL as a query parameter. The exporter does the actual HTTP request and returns metrics. Prometheus stores them. Every 30 seconds.
The Alert Rule
- alert: PuzzleAppDown
expr: probe_success{job="blackbox-http"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "puzzle.kidsgamesapp.com is down"
description: "The puzzle app has been unreachable for over 1 minute. Check pods, ingress, and DNS."
for: 1m with a 30-second scrape interval means the alert fires after 2 consecutive failed probes. Fast enough to matter (~90 seconds total from outage to email), slow enough to ignore transient network blips.
The Full Alert Pipeline
Blackbox Exporter pod
→ probes puzzle.kidsgamesapp.com every 30s
→ Prometheus scrapes probe_success metric
→ PrometheusRule: PuzzleAppDown fires if probe_success == 0 for 1m
→ AlertManager routes to email receiver
→ AWS SES sends from alerts@kidsgamesapp.com
→ admin@kidsgamesapp.com (ImprovMX forwards to Gmail)
→ itay.gardi@gmail.com (KidsGames/Alerts label)
Zero external dependencies. Everything runs inside our EKS cluster. The only cost is $0 for the exporter pod and a few cents per month in SES sending costs.
Lessons Learned
Monitoring is not alerting. We had metrics, dashboards, and logs. What we didn’t have was a closed loop that notified a human when something was wrong. Adding alerting wasn’t hard — it was just the thing we kept deprioritizing because the app “felt stable.”
Default Kubernetes timeouts are aggressive. timeoutSeconds: 1 for health probes is fine for a Go binary that starts in 10ms. It’s not fine for anything that shells out to a runtime (mongosh, node, python). Always set timeouts explicitly in your manifests.
SES sandbox mode will bite you. If you’re using AWS SES for transactional email and you’re still in sandbox, verify every address in your send/receive chain before you need it. The verification takes 30 seconds. Finding out it’s broken during an incident takes much longer.
The domain hold was genuinely scary. A single unverified email address at the registrar took down every subdomain, the argocd dashboard, grafana, everything. ICANN’s clientHold is a nuclear option. Enable auto-renew and verify your registrant contact — it takes 2 minutes and it’s the cheapest insurance you’ll ever buy.
What’s Next
The Blackbox Exporter gives us probe_success, but it also exposes probe_http_status_code, probe_duration_seconds, probe_ssl_earliest_cert_expiry, and more. The next step is wiring those into Grafana for a proper uptime dashboard — and adding a CertExpiringSoon alert so we never get blindsided by a TLS expiry either.
The monitoring stack is real. The alert pipeline is real. Tonight, if MongoDB crashes again, I’ll know within 90 seconds.
Total time from “site is down” to “alerts are working”: about 3 hours. Most of it was debugging, not building. The actual implementation — Blackbox Exporter, scrape config, PrometheusRule — took about 20 minutes.