Building a Cron Scheduler Microservice: Recurring Jobs Without the Chaos


The puzzle server had no cron jobs when we started. Then we needed to warm the puzzle pool on a schedule. Then health checks. Then the question became: where does recurring work live?

Adding setInterval to the puzzle server works until it doesn’t. Two replicas means two intervals firing. A crash loses the timer. There’s no visibility into what ran, when, or whether it succeeded. Cron logic scattered across app servers is maintenance debt in waiting.

So we built a scheduler: a standalone microservice that owns all recurring jobs across the platform.

What It Does

The scheduler is a Bun/TypeScript service that runs on port 3004. It stores jobs in MongoDB, ticks every 30 seconds, and executes any job whose cron expression is due. Every execution gets a JobRun record with status, duration, and HTTP response details. There’s a full admin UI for creating, editing, triggering, and monitoring jobs.

That’s the whole thing. It’s not complex. But having it as a separate process with its own database records and its own observability changes what’s possible: you can schedule anything that has an HTTP endpoint, from any other service, without touching app code.

The Data Model

A Job has the fields you’d expect:

interface IJob {
  name: string;
  cronExpression: string;  // e.g. "*/5 * * * *"
  url: string;             // HTTP endpoint to call
  method: "GET" | "POST" | "PUT" | "DELETE";
  headers?: Record<string, string>;
  body?: string;
  timeout: number;         // ms
  retries: number;
  enabled: boolean;
  lastRunAt?: Date;
  nextRunAt?: Date;
}

A JobRun records each execution:

interface IJobRun {
  jobId: ObjectId;
  status: "success" | "failed" | "timeout";
  startedAt: Date;
  durationMs: number;
  httpStatus?: number;
  responseBody?: string;
  error?: string;
}

Job runs get a 30-day TTL index — MongoDB deletes them automatically, no cleanup cron needed:

JobRunSchema.index(
  { startedAt: 1 },
  { expireAfterSeconds: 30 * 24 * 60 * 60 }
);

The Tick Loop

The worker is a setInterval loop that runs every 30 seconds. On each tick, it finds all enabled jobs where nextRunAt is in the past, runs them, and updates their timestamps.

The critical part is preventing duplicate execution when you eventually run multiple replicas. We use an atomic findOneAndUpdate with returnDocument: "before" — if the document we get back still has the old nextRunAt, we won the race:

async function claimJob(job: IJob): Promise<IJob | null> {
  const claimed = await Job.findOneAndUpdate(
    {
      _id: job._id,
      nextRunAt: job.nextRunAt, // only update if unchanged
    },
    {
      $set: {
        nextRunAt: getNextRun(job.cronExpression),
        lastRunAt: new Date(),
      },
    },
    { returnDocument: "before" }
  );
  return claimed;
}

If two workers tick simultaneously, only one gets back the document with the expected nextRunAt. The other gets null and skips. No distributed locks, no Redis, no coordination. MongoDB does the work.

The tick loop itself:

async function tick() {
  const stopTimer = tickDurationMs.startTimer();
  try {
    const now = new Date();
    const due = await Job.find({ enabled: true, nextRunAt: { $lte: now } });
    for (const job of due) {
      const claimed = await claimJob(job);
      if (claimed) await runJob(claimed);
    }
  } finally {
    stopTimer();
  }
}

Stuck Job Recovery

If the process crashes mid-execution, a job can get stuck in a state where lastRunAt is set but the next JobRun was never created. On startup, the service looks for jobs where lastRunAt is recent but has no corresponding run record, and resets them:

async function recoverStuckJobs() {
  const cutoff = new Date(Date.now() - 5 * 60 * 1000); // 5 min ago
  await Job.updateMany(
    { lastRunAt: { $gt: cutoff }, nextRunAt: { $lte: new Date() } },
    { $set: { nextRunAt: new Date() } }
  );
}

This runs at startup before the tick loop starts. Simple, not perfect — but good enough for a kids game platform where jobs aren’t financial transactions.

The Admin UI

The admin is a vanilla JS SPA (no framework) served as static files. It shows the jobs list, lets you create and edit jobs, shows execution history per job, and supports manual triggers.

The design decision to avoid React here was deliberate. The admin is a single-session tool used maybe once a week. A full framework adds build steps, bundle size, and maintenance overhead for something with six views. A few hundred lines of vanilla JS does the job.

The one feature that made it genuinely useful: live updating. Without polling, you’d have to refresh manually to see whether a triggered job succeeded.

function startPolling() {
  if (pollInterval) return;
  pollInterval = setInterval(async () => {
    if (!apiKey || modalOpen) return;
    if (currentView === "dashboard") await loadDashboard();
    else if (currentView === "jobs") await loadJobs();
    else if (currentView === "detail" && currentDetailId) {
      await loadJobDetail(currentDetailId);
    }
  }, 10_000);
  liveDot.classList.add("live-active");
}

The live dot in the nav pulses while polling is active. Modals pause polling so a mid-edit refresh doesn’t clobber the form. API keys are persisted in localStorage so you don’t re-enter them every session.

The countdown to next run was a small touch that turned out to be the most satisfying part — a data-ts attribute on each cell, updated every second by a client-side interval:

function tickCountdowns() {
  document.querySelectorAll(".countdown").forEach((el) => {
    const ts = Number(el.dataset.ts);
    const diff = ts - Date.now();
    el.textContent = diff <= 0 ? "now" : formatDuration(diff);
  });
}
setInterval(tickCountdowns, 1000);

No WebSockets, no server-sent events. Polling plus client-side arithmetic is good enough.

Prometheus Metrics

The service exports four metrics at /api/metrics (no auth required — Prometheus scrapes it):

export const jobRunsTotal = new Counter({
  name: "scheduler_job_runs_total",
  help: "Total job executions",
  labelNames: ["job_name", "status"],
});

export const jobDurationMs = new Histogram({
  name: "scheduler_job_duration_ms",
  help: "Job execution duration",
  labelNames: ["job_name"],
  buckets: [50, 100, 250, 500, 1000, 2500, 5000, 10000],
});

export const tickDurationMs = new Histogram({
  name: "scheduler_tick_duration_ms",
  help: "Worker tick processing duration",
  buckets: [1, 5, 10, 25, 50, 100],
});

export const activeJobs = new Gauge({
  name: "scheduler_active_jobs",
  help: "Number of enabled jobs",
});

The worker instruments every run: jobRunsTotal.inc({ job_name, status }) on completion, jobDurationMs.observe() with actual execution time. A ServiceMonitor in the infra repo wires it into the existing Prometheus stack:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kids-games-scheduler
  namespace: kids-games
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: kids-games-scheduler
  endpoints:
    - port: http
      path: /api/metrics
      interval: 30s

The Deployment Problem

The scheduler landed in services/scheduler/ inside the existing monorepo. It needed its own ECR repository, its own K8s Deployment, its own ArgoCD Application, its own ExternalSecret, its own ServiceMonitor. That’s five new resources for one new service.

The first deployment failure: CreateContainerConfigError. The pod couldn’t start because scheduler-secrets didn’t exist as a K8s secret. The ExternalSecret was applied, but the secret in AWS Secrets Manager hadn’t been created yet. Order matters: create the secret in AWS → ExternalSecret syncs → pod can start.

The second failure was subtler. The ArgoCD Application manifest was committed to the infra repo and pushed, but the Application object never appeared in the cluster. ArgoCD only watches apps it already knows about — a new argocd-X-app.yaml in the repo isn’t auto-discovered. It needs a one-time manual apply:

kubectl apply -f k8s/apps/kids-games/argocd-scheduler-app.yaml

After that, ArgoCD self-heals and syncs on every infra push. But the bootstrapping step is manual, and it’s easy to miss.

The Lockfile Problem

Adding prom-client to services/scheduler/package.json broke CI — not the scheduler build, but the auth service and puzzle server builds.

Bun’s --frozen-lockfile validates the lockfile against every workspace manifest in the Docker build context. The auth service Dockerfile was installing dependencies with --frozen-lockfile, but it only copied its own package.json. When bun.lock referenced a new scheduler dependency that wasn’t in scope, bun rejected it.

The fix: every Dockerfile in the monorepo must copy every workspace’s package.json before running bun install:

# services/auth/Dockerfile
COPY package.json .
COPY apps/puzzle/client/package.json apps/puzzle/client/package.json
COPY apps/puzzle/server/package.json apps/puzzle/server/package.json
COPY apps/puzzle/shared/package.json apps/puzzle/shared/package.json
COPY services/auth/package.json services/auth/package.json
COPY services/scheduler/package.json services/scheduler/package.json  # NEW
COPY bun.lock .
RUN bun install --frozen-lockfile

Every Dockerfile. Even if that service doesn’t use the new package. The lockfile doesn’t know about your service boundaries — it knows about the workspace.

Accessing It in Production

The scheduler runs as a ClusterIP service — no ingress, not reachable from the public internet. To get to the admin UI:

kubectl port-forward -n kids-games deploy/kids-games-scheduler 3014:3004
# open http://localhost:3014/admin

The API key lives in AWS Secrets Manager under scheduler/production. The admin prompts for it once and stores it in localStorage.

The first job we created in production: a health check that hits puzzle.kidsgamesapp.com/api/health every five minutes. Within an hour, the run history showed consistent 200s with sub-100ms response times. Not because anything was wrong — because now we’d know if it were.


Recurring tasks are infrastructure, not application logic. Pulling them into a dedicated service means the puzzle server doesn’t know they exist, the scheduler doesn’t know what the tasks do, and adding a new job is four fields in a form. That’s the right division of labor.