Shipping Google One Tap Across Three Repos, Dark Behind a Feature Flag


The plan looked simple on paper: add Google One Tap sign-in to the marketing site, let parents save favorite stories and coloring pages, ship it. A few days later the feature was running in production — but the marketing UX was deliberately invisible to users. Backend live, frontend dark, one build flag away from launch.

The interesting part wasn’t the feature. The interesting part was every place reality refused to match the plan: a cookie that browsers silently rejected, an admission webhook that vetoed our nginx config, a cluster DNS cache that stalled cert issuance for five minutes, and a quiet ExternalSecret that nothing was actually deploying.

Here’s what shipped, and what went sideways along the way.

Why a separate user-service

The first decision was the cleanest: don’t put favorites in the puzzle server.

Favorites read like a puzzle concern, but they aren’t. They’re an account concern that happens to point at puzzle content today. Tomorrow they’ll point at match-game levels. The day after that, they’ll be joined by subscription state and entitlement claims. Every one of those is cross-cutting between games and orthogonal to the gameplay loop.

So we stood up a new microservice — services/user/ on port 3005 — alongside the existing auth, scheduler, and AI services. It owns one MongoDB collection today (favorites), it’ll own three or four more before the year is out, and it speaks the same JWT cookie language as everything else in the cluster. Every game becomes a thin client of it.

The cost of this decision was a couple of hours of boilerplate: Bun entrypoint, Mongoose connection, Dockerfile, Helm values, ArgoCD application, ECR repo, ingress. The benefit is that the next account-shaped feature already has a home.

Sign-in worked. The user signed in with Google, the AccountChip rendered their name in the navbar, everything looked great. Then they clicked the heart on a story:

POST http://localhost:4321/api/favorites → 401 Unauthorized

No cookie in the request. The auth-service had clearly issued one — we could see the Set-Cookie header in the network tab. So why wasn’t the browser sending it back?

Set-Cookie: token=<jwt>; Domain=.kidsgamesapp.com; HttpOnly; SameSite=Lax

There’s the bug. The cookie’s Domain attribute says .kidsgamesapp.com — which is correct for production where everything sits at puzzle.kidsgamesapp.com and kidsgamesapp.com and the freshly-deployed auth.kidsgamesapp.com. But in dev the response is coming from localhost, and browsers silently reject cookies whose declared domain doesn’t match the response origin. Strict and quiet — no console warning, no failed request, just a cookie that never persists.

The fix was small. Auth-service’s buildSetCookie now omits the Domain attribute entirely when COOKIE_DOMAIN is the empty string, which we set in the local .env:

function buildSetCookie(token: string): string {
  const parts = [`token=${token}`];
  if (COOKIE_DOMAIN) parts.push(`Domain=${COOKIE_DOMAIN}`);
  parts.push("Path=/", `Max-Age=${maxAge}`, "HttpOnly", "SameSite=Lax");
  if (process.env.NODE_ENV !== "development") parts.push("Secure");
  return parts.join("; ");
}

Production behavior is unchanged: COOKIE_DOMAIN is unset there, the ?? fallback gives .kidsgamesapp.com, the if (COOKIE_DOMAIN) guard still emits the attribute. Local dev gets a host-only cookie scoped to the marketing dev origin, which a small Vite proxy in astro.config.mjs makes work across the three local services without any cross-port cookie scoping at all.

nginx-ingress disabled snippet annotations

Auth-service was supposed to be carefully exposed. Only three paths should be reachable from the public internet — /health, /auth/google, /auth/logout — and everything else should 404 at the ingress before it ever reached the application. Defense in depth.

The nginx-ingress controller has a feature for exactly this: a configuration-snippet annotation that lets you inject arbitrary nginx config into the generated server block. We wrote it:

annotations:
  nginx.ingress.kubernetes.io/configuration-snippet: |
    if ($request_uri !~ ^/(health|auth/google|auth/logout)(\?.*)?$) {
      return 404;
    }

ArgoCD synced. ArgoCD failed:

admission webhook "validate.nginx.ingress.kubernetes.io" denied the request:
nginx.ingress.kubernetes.io/configuration-snippet annotation cannot be used.
Snippet directives are disabled by the Ingress administrator.

This cluster has allow-snippet-annotations: false set on the ingress controller — a standard hardening for shared clusters because snippet directives are functionally a code-injection vector. Reasonable policy. It just happened to take out our nice defense-in-depth strategy.

The right fix turned out to be the boring one. Auth-service already has an in-process middleware that requires x-api-key on every route except a small EXEMPT list. We added /auth/google and /auth/logout to that list, dropped the configuration-snippet annotation entirely, and the application layer became the single load-bearing gate. A request to https://auth.kidsgamesapp.com/auth/login from the public internet now returns 401 Invalid or missing API key — the same outcome the nginx filter would have produced, just enforced one layer in.

The lesson, in retrospect: layered defenses are good until one of the layers is something the platform vetoes. The application-layer check was always going to be the actual security boundary. The ingress filter was decoration.

CoreDNS negative cache stalled cert-manager

Adding a new subdomain on this cluster is more manual than I expected. There’s no ExternalDNS — the Route53 records live in the terraform repo as a literal list of subdomain strings inside module.route53 in main.tf. To add user.kidsgamesapp.com and auth.kidsgamesapp.com, we edited the list, ran tofu apply -target=module.route53, and watched two new CNAME records pop into the hosted zone.

Public DNS resolved within seconds. dig +short user.kidsgamesapp.com @8.8.8.8 returned the load balancer hostname immediately. cert-manager started the ACME HTTP-01 challenge for the new TLS certificate, presented the response on the temporary HTTP solver ingress, and… stalled.

Waiting for HTTP-01 challenge propagation: failed to perform self check
GET 'http://user.kidsgamesapp.com/.well-known/acme-challenge/...':
dial tcp: lookup user.kidsgamesapp.com on 172.20.0.10:53: no such host

cert-manager does its self-check from inside the cluster. The cluster’s CoreDNS resolver had cached the negative lookup of user.kidsgamesapp.com from before we added the record, and that negative cache was sticky for the full TTL window. Public DNS knew about the new subdomain. CoreDNS still didn’t. cert-manager kept getting no such host and refusing to ask Let’s Encrypt to issue.

The fix was to wait. After about five minutes the negative cache expired, the next self-check succeeded, the certificate was issued, the ingress flipped from cm-acme-http-solver to a fully TLS-terminated kids-games-user-service ingress. We could have restarted the CoreDNS pods to drop the cache faster, but the time window for impatience was tight enough that “do nothing for five minutes” won by inaction.

ExternalSecret with no ArgoCD app

The auth-service got a new required env var, GOOGLE_OAUTH_CLIENT_ID. We added it to the AWS Secrets Manager entry. We added it to the existing kids-games-secrets ExternalSecret manifest in the platform repo. We merged the PR. We waited.

Then the new auth-service pod rolled out and crashed at boot:

[auth] Missing required env vars: GOOGLE_OAUTH_CLIENT_ID

Five minutes later, still missing. Ten minutes later, still missing. The ExternalSecret manifest had been merged to the platform repo’s main branch. The External Secrets Operator was running. AWS Secrets Manager had the value. None of it was actually flowing into the cluster.

The reason, once we started checking, was uncomfortable: the external-secret.yaml manifest in the platform repo had no ArgoCD application configured to deploy it. It was a plain raw Kubernetes manifest sitting in a directory that nothing watched. Whoever had originally created the cluster had bootstrapped it with kubectl apply -f and that was the last time the file had been read. The ESO controller had been quietly reconciling against the same stale kids-games-secrets resource for weeks.

kubectl apply -f k8s/apps/kids-games/external-secret.yaml from a laptop, ESO reconciled within seconds, the auth-service pod’s next restart picked up the new env var, the crashloop resolved.

The lesson here was bigger than the bug. There’s a whole class of “ArgoCD-adjacent but not actually deployed by ArgoCD” manifests in this repo. Each one is a landmine waiting for the next person who assumes git-merge means deployed. The follow-up action — and it’s documented in memory now — is to stand up a small ArgoCD app whose only job is to deploy raw manifests from k8s/apps/kids-games/.

Concurrent CI race on the infra repo

The deploy pipeline for this monorepo pushes to two repos. The application code lives in kids-games, the deployment manifests live in aws-platform. When CI builds a new image, it pushes to ECR, then clones aws-platform, sed-edits the relevant *-values.yaml to bump the image tag, commits with a [bot] author, and pushes. ArgoCD picks up the new tag and rolls out the deploy.

This works fine for one deploy. It works fine for ten deploys in series. It does not work fine when two deploys land within the same minute:

remote: error: cannot lock ref 'refs/heads/main':
is at 5b4cb064... but expected 605bad54...
error: failed to push some refs to 'aws-platform'

The second pipeline cloned aws-platform/main while the first pipeline was still running. By the time the second one tried to push its commit, main had already advanced. Git refused, pipeline failed, the deploy was stuck halfway — image in ECR, values.yaml not updated, ArgoCD seeing nothing new to sync.

The hot fix is gh run rerun --failed. The second pipeline re-runs, this time clones a fresh main, applies its sed-edit on top of the current state, pushes successfully. We did this once during this rollout and it worked.

The proper fix — not yet shipped — is a retry-on-conflict loop in the workflow. Pull, rebase the bot commit onto the new tip, push again, repeat up to N times. Three lines of bash. It’s on the follow-up list.

Shipping dark — feature flag with build-time tree-shaking

The whole point of this project was to land One Tap now and ship it to users later, alongside an in-flight redesign of the stories and coloring page that the marketing team wanted to launch as a single coordinated release. So the entire UX needed to be invisible to production users until one switch flipped.

The build flag pattern Astro makes nice is dead-code elimination on import.meta.env.PUBLIC_* literals. At build time, Vite replaces every reference to import.meta.env.PUBLIC_FEATURE_ONETAP with the literal string value passed as a build arg. With it set to "false", every conditional like:

{FEATURE_ONETAP && (
  <Fragment>
    <script src="https://accounts.google.com/gsi/client" async defer />
    <HeartButton kind="story" slug={story.slug} locale={locale} />
  </Fragment>
)}

becomes false && (...) at compile time, which the bundler tree-shakes out completely. Not present-but-hidden. Not loaded-but-disabled. Not in the bundle at all.

We verified this on the production build. With PUBLIC_FEATURE_ONETAP=false, across all 4,767 generated HTML files:

  • Zero <script src="https://accounts.google.com/gsi/client"> tags
  • Zero class="heart-btn" instances on detail pages
  • Zero exchangeIdToken references in any bundled JS
  • The /my-favorites route exists but renders only a “Coming soon” stub

Flipping the flag to true is one line in the deploy workflow. CI rebuilds the marketing image with the new arg, ArgoCD picks it up, the feature lights up in production within five minutes. Rollback is the inverse — flip to false, push, five minutes later the bundle is dark again. No database migration to roll back, no service redeploy, no traffic-routing changes. The riskiest production change becomes the lowest-blast-radius operation we have.

That last property is the one I keep coming back to. The whole “ship dark” pattern is worth the small cost of plumbing a flag through the bundle for exactly this reason: because it lets the launch decision happen weeks after the code lands, in calmer circumstances, with a one-line PR.

What I’d do differently

A few things, in order of how much pain they would have saved:

Stand up the raw-manifests ArgoCD app first. The crash on missing GOOGLE_OAUTH_CLIENT_ID was avoidable. Knowing now that several manifests in the platform repo are silently un-watched, every future “I added it to the ExternalSecret” assumption gets a footnote. Better to fix the deployment topology once.

Don’t try nginx snippet annotations on a hardened cluster. Application-layer enforcement was always going to be the load-bearing gate. The defense-in-depth at the ingress was never going to add real security on a cluster that explicitly disables it. Knowing the cluster’s policy ahead of time would have saved a sync cycle.

Build the feature flag from day one. Adding PUBLIC_FEATURE_ONETAP after the rest of the work was done was cheap because the surface area was small — five files. On a deeper feature with more entry points, retrofitting the flag would be tedious. From now on, anything that needs a coordinated launch gets the flag plumbed at the same time as the first component lands.

Commit to the proxy approach for local dev upfront. We started with absolute URLs to local services on different ports, hit the cookie-domain wall, switched to a Vite proxy mid-implementation. That second decision was the right one but the switching cost was real. For any feature that involves cookies across multiple local services, just start with the proxy.

The feature itself is sitting there now, ready to flip. The real product was never the One Tap button — it was the deploy pipeline that knew how to put it in production safely without anyone noticing.