The Outage I Almost Missed: Building Real Uptime Alerts


It was a Monday evening when I opened my laptop and found the tab sitting there: puzzle.kidsgamesapp.com — This site can't be reached.

Not a blip. A full outage. The app had been down for hours, and nothing had pinged me. No email, no Slack, nothing. We had Prometheus, Grafana, AlertManager, Loki — a full observability stack — and I still found out the old-fashioned way: by accidentally opening the browser tab.

That’s when I realized: monitoring and alerting are not the same thing.

What Actually Broke

Two completely unrelated things failed at the same time.

1. MongoDB in CrashLoopBackOff

The first issue was mongodb-0 — our StatefulSet pod — stuck in CrashLoopBackOff for over three hours. MongoDB itself was healthy. It started, loaded data, and began listening on port 27017. But Kubernetes kept killing it.

The culprit: health probe configuration. Our StatefulSet manifest didn’t set timeoutSeconds on either the readiness or liveness probe, so it defaulted to Kubernetes’ 1-second timeout. The probe command was:

exec:
  command:
    - mongosh
    - --eval
    - "db.adminCommand('ping')"

mongosh — the modern MongoDB shell — takes more than 1 second to initialize, even for a trivial ping. On our t3.medium Spot nodes, it was consistently timing out. MongoDB was perfectly healthy; Kubernetes just couldn’t confirm it fast enough.

The fix was two lines:

readinessProbe:
  timeoutSeconds: 5   # was missing (defaulted to 1)
livenessProbe:
  timeoutSeconds: 5   # was missing (defaulted to 1)

We deleted the crashing pod, it restarted with the new probe config, and MongoDB was back within 30 seconds.

2. The Domain on clientHold

Even after MongoDB recovered, the site was still unreachable. ERR_NAME_NOT_RESOLVED — DNS wasn’t resolving at all.

The cluster was healthy. Ingress was healthy. The ELB was healthy. dig @8.8.8.8 puzzle.kidsgamesapp.com returned the correct IPs. But browsers couldn’t reach it.

Domain Status: clientHold https://icann.org/epp#clientHold

Amazon Registrar had placed kidsgamesapp.com on hold because the registrant email address wasn’t verified. clientHold completely disables DNS for the entire domain — not just one subdomain, but everything. puzzle., argocd., grafana., the marketing site. All of it, gone.

The fix: verify the email in the Route53 Registered Domains console. DNS propagated within minutes after that.

Why We Didn’t Get Notified

Here’s where it gets embarrassing. We already had AlertManager configured to send emails via AWS SES. We even had a Watchdog alert that fires constantly to prove the pipeline is working.

The alerts were firing. AlertManager was trying to send them. But every attempt failed:

554 Message rejected: Email address is not verified.
The following identities failed the check in region US-EAST-1: admin@kidsgamesapp.com

AWS SES runs in sandbox mode by default. In sandbox, you can only send email to addresses you’ve explicitly verified as identities — even if you’re the one receiving them. Our AlertManager was sending to admin@kidsgamesapp.com, which forwards to my Gmail via ImprovMX, but SES didn’t know that. From SES’s perspective, admin@kidsgamesapp.com was unverified.

The fix: verify admin@kidsgamesapp.com as an SES identity. Since ImprovMX forwards it to my Gmail, the verification email landed there and I clicked the link.

aws ses verify-email-identity \
  --email-address admin@kidsgamesapp.com \
  --region us-east-1

Within a minute of verifying, AlertManager drained its retry queue and level=INFO msg="Notify success" started appearing in the logs.

What We Built: Uptime Monitoring

With the immediate fires out, it was time to make sure we’d catch the next one. The gap was clear: we had no probe that checked whether the site was actually reachable from end to end.

The Stack

Our monitoring platform is built on kube-prometheus-stack — the battle-tested Helm chart that bundles Prometheus, Grafana, AlertManager, and all the Kubernetes-specific exporters. It runs in an observability namespace alongside Loki (log aggregation) and Promtail (log shipping from every pod).

What we were missing was Prometheus Blackbox Exporter — a separate exporter that probes external endpoints and exposes the results as Prometheus metrics.

Blackbox Exporter

The Blackbox Exporter is dead simple: you give it a URL and an HTTP module, it probes it, and publishes a probe_success metric (1 = up, 0 = down) along with response time, TLS expiry, HTTP status code, and more.

We deployed it via Helm into the observability namespace:

helm install blackbox-exporter \
  prometheus-community/prometheus-blackbox-exporter \
  -n observability \
  -f k8s/platform/blackbox-exporter/values.yaml

With a minimal values.yaml:

nodeSelector:
  role: workloads
resources:
  requests:
    cpu: 10m
    memory: 32Mi
  limits:
    memory: 64Mi

32 megabytes. That’s the entire cost of knowing your app is up.

Scrape Config

We added a scrape job to Prometheus that routes through the Blackbox Exporter:

- job_name: blackbox-http
  metrics_path: /probe
  params:
    module: [http_2xx]
  scrape_interval: 30s
  static_configs:
    - targets:
        - https://puzzle.kidsgamesapp.com
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox-exporter-prometheus-blackbox-exporter.observability.svc.cluster.local:9115

The relabeling is the clever bit: Prometheus hits the Blackbox Exporter’s /probe endpoint, passing the target URL as a query parameter. The exporter does the actual HTTP request and returns metrics. Prometheus stores them. Every 30 seconds.

The Alert Rule

- alert: PuzzleAppDown
  expr: probe_success{job="blackbox-http"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "puzzle.kidsgamesapp.com is down"
    description: "The puzzle app has been unreachable for over 1 minute. Check pods, ingress, and DNS."

for: 1m with a 30-second scrape interval means the alert fires after 2 consecutive failed probes. Fast enough to matter (~90 seconds total from outage to email), slow enough to ignore transient network blips.

The Full Alert Pipeline

Blackbox Exporter pod
  → probes puzzle.kidsgamesapp.com every 30s
  → Prometheus scrapes probe_success metric
  → PrometheusRule: PuzzleAppDown fires if probe_success == 0 for 1m
  → AlertManager routes to email receiver
  → AWS SES sends from alerts@kidsgamesapp.com
  → admin@kidsgamesapp.com (ImprovMX forwards to Gmail)
  → itay.gardi@gmail.com (KidsGames/Alerts label)

Zero external dependencies. Everything runs inside our EKS cluster. The only cost is $0 for the exporter pod and a few cents per month in SES sending costs.

Lessons Learned

Monitoring is not alerting. We had metrics, dashboards, and logs. What we didn’t have was a closed loop that notified a human when something was wrong. Adding alerting wasn’t hard — it was just the thing we kept deprioritizing because the app “felt stable.”

Default Kubernetes timeouts are aggressive. timeoutSeconds: 1 for health probes is fine for a Go binary that starts in 10ms. It’s not fine for anything that shells out to a runtime (mongosh, node, python). Always set timeouts explicitly in your manifests.

SES sandbox mode will bite you. If you’re using AWS SES for transactional email and you’re still in sandbox, verify every address in your send/receive chain before you need it. The verification takes 30 seconds. Finding out it’s broken during an incident takes much longer.

The domain hold was genuinely scary. A single unverified email address at the registrar took down every subdomain, the argocd dashboard, grafana, everything. ICANN’s clientHold is a nuclear option. Enable auto-renew and verify your registrant contact — it takes 2 minutes and it’s the cheapest insurance you’ll ever buy.

What’s Next

The Blackbox Exporter gives us probe_success, but it also exposes probe_http_status_code, probe_duration_seconds, probe_ssl_earliest_cert_expiry, and more. The next step is wiring those into Grafana for a proper uptime dashboard — and adding a CertExpiringSoon alert so we never get blindsided by a TLS expiry either.

The monitoring stack is real. The alert pipeline is real. Tonight, if MongoDB crashes again, I’ll know within 90 seconds.


Total time from “site is down” to “alerts are working”: about 3 hours. Most of it was debugging, not building. The actual implementation — Blackbox Exporter, scrape config, PrometheusRule — took about 20 minutes.