The Bootstrap Betrayal

A blank page in production — traced through browser forensics, cluster logs, and git archaeology to a Helm migration that silently downgraded a load balancer timeout from 3600s to 30s.

Act I · Investigation

The Void Stares Back

Odin

https://www.fenrirledger.com/ledger is fucked

Odin reports /ledger is blank in production. Chrome browser automation confirms: the page loads its title but renders nothing. The DOM contains a single empty <main> element — React never hydrated. Console reveals Error: Connection closed from a Next.js RSC chunk. Network traffic shows all static assets return 200, but analytics.fenrirledger.com/script.js returns 503. The RSC streaming connection is being killed mid-flight.

Act II · Infrastructure

The Cluster Speaks

Odin

(continued investigation)

GKE cluster status reveals app pods are only 17 minutes old — a fresh deployment just landed. Pod logs expose a cascade of requireAuthz: access denied warnings with reason household_mismatch. The client sends session.user.sub (a Google numeric ID) as the household identifier, but Firestore now stores a UUID. Every sync/push call returns 403. Two bugs surface: the blank page and broken cloud sync.

Act III · Deep Dive

The Code Trail

Odin

(continued investigation)

Tracing through useCloudSync.ts:211, AuthContext.tsx:94, and authz.ts:106 reveals the full picture. The client was always designed to use session.user.sub as the household key — it is the localStorage partition key. But the server's requireAuthz compares this against firestoreUser.householdId, which is now a UUID from the household join feature. The IDOR guard rejects the mismatch even though the user IS who they claim to be.

For the blank page: the Explore agent digs through ingress annotations, Helm values, Dockerfile, and middleware. The smoking gun is values.yaml:102timeoutSec: 30.

Act IV · Root Cause

Bootstrap's Betrayal

Odin

Was it something bootstrap fucked up?

Git archaeology confirms: yes, bootstrap fucked it. The deleted raw manifests in infrastructure/k8s/app/ had timeoutSec: 3600 (1 hour). But the Helm chart created in PR #788 hardcoded timeoutSec: 30 — someone never carried the value over. For months the old manifests were still active on the cluster, masking the discrepancy. PR #1243 (today) deleted the raw manifests and redeployed via Helm, silently applying the 30-second timeout. The GCP Load Balancer now kills the RSC stream before React can hydrate.

-  timeoutSec: 3600  # deleted k8s/app manifests
+  timeoutSec: 30    # Helm values (PR #788)

Act V · Fix

The Two-Sword Fix

Odin

fix both hkr

Two surgical fixes applied in a single PR:

Fix 1 — Restore timeout: values.yaml:102 changed from 30 to 3600. The GCP Load Balancer will no longer kill RSC streams.

Fix 2 — Accept Google sub: authz.ts now treats the user's own Google sub as a valid self-reference for the household check. IDOR protection remains intact — the server still uses firestoreUser.householdId for all Firestore operations.

A new test covers the Google sub compat path — 22/22 pass. PR #1269 merged via HKR. Deploy rolling out.

+    const isOwnSub = requirement.householdId === googleSub;
-    if (requirement.householdId !== firestoreUser.householdId) {
+    if (!isOwnSub && requirement.householdId !== firestoreUser.householdId) {

Bug Fixed

BackendConfig timeoutSec: 30 → 3600 (GCP LB killed RSC stream); requireAuthz household_mismatch (client sends Google sub, server expects UUID)

infrastructure/helm/fenrir-app/values.yaml development/frontend/src/lib/auth/authz.ts development/frontend/src/__tests__/auth/authz.test.ts