This issue chronicles a thread where a single New Relic alert unraveled into a cluster-wide capacity story. We recommend reading it with a coffee in hand. If you are Vishal, the Bottom Line section on Spread 08 is for you.
Fern's first message was four words: “any idea what's causing these?” That was the starting gun for 22 messages, three days, and a trail that would end at the cluster's bin-packing strategy.
This issue is presented as a magazine. It summarizes only what mattered. Full thread receipts are linked in the colophon.
The shape of the bug was almost boring: when a bliss-event pod was killed, there was a small window where Kubernetes still routed traffic to it before yanking it from the service endpoints. Inside that window, nginx returned its HTML 502 page. Apollo, downstream, dutifully tried to parse <html> as JSON and cried foul.
Stitch — the federation gateway — caught the most of it (256 errors in six hours), because every guest-site query passed through it on the way to event. Event itself saw 146 in the same window. The affected operations read like a roll call of the guest experience: GuestList, GetEventCTAButtonsData, GetGuestSiteSchedule, GetEventSession.
This was not a code bug. It was infrastructure telling a story through error messages.
Fern asked the obvious follow-up: if pods are getting killed, why. Memory looked fine. What wasn't fine was CPU. The HPA target sat exactly at 80%, pods were running hot at ~400m against a 500m request, and peaks were hitting a full 1.25 cores with no CPU limit in sight. Node.js had noticed: event-loop-blocked log lines were running 50–130 per hour for an entire day.
That alone wasn't the cliff — it was merely the diving board. The cliff was that the cluster couldn't schedule replacements. FailedScheduling events were flooding the log: “66 Insufficient cpu” out of 85 nodes. There was literally nowhere to put a new pod. When liveness probes timed out, the replacements sat pending. Meanwhile, cpunew nodes kept politely going NodeNotReady — twenty of them in a day — taking their pods with them.
Scaling activity told the final part of the story: three different ReplicaSets cycling within hours, replica counts bouncing 24 → 27 → 31 → 36 → 31 → 28. The thrash itself was what opened the window for the 502s. Every termination created another moment where HTML could pretend to be JSON.
The cpunew pool had all the headroom on paper — 69 of a possible 85 nodes. But scaleDownUtilizationThreshold sat at 0.70, meaning any node quietly doing 69% of its job qualified for eviction. With a 10-minute unneeded timer and a 10-minute post-add cooldown, the autoscaler had itself caught in a stable hallucination: remove, pressure, re-add, remove, pressure, re-add.
Every cycle produced the same tiny window. Every window produced another HTML page pretending to be JSON.
Fern read it and replied, simply: this is wrong, we should reduce this, not increase it.
She was right. The threshold is the bar a node must sit below to become eviction-eligible. Raising it doesn't protect nodes — it condemns more of them. Lowering it to 0.5 is what actually shields the 45–66%-utilized nodes from being yanked away mid-burst.
Policy: Boo acknowledges corrections in-line and louder than the original mistake. This page exists because of that policy.
| SKU | Nodes | vCPU bought | Usable CPU | Efficiency | DaemonSet pods |
|---|---|---|---|---|---|
| D2s_v3 (today) | 69 | 138 | 86.9 | 63% | 759 |
| D4s_v3 (recommended) | 35 | 140 | 110.6 | 79% | 385 |
| D8s_v3 | 18 | 144 | 127.1 | 88% | 198 |
Every node carries ~640m CPU of daemonset overhead. On a 2-vCPU node, that overhead is 34% of what we bought. On a 4-vCPU node, it shrinks to 21%. We are paying the same vCPU bill either way — but smaller nodes burn more of it on per-node duplication. The recommendation: move the pool to D4s_v3. Same spend, +24 usable CPU, half as many nodes that can go NotReady, and six bliss-event pods per node instead of two.
Projected cost impact: +3–9% on node spend. For the stability it buys, that is a bargain. A cheap bargain.
The biggest drop: JSON parse errors on event_service fell from 956 to 407 in matched windows. That number is almost a direct proxy for guest-site query reliability — the thing the cluster was quietly stealing from users.
A footnote with teeth: the "after" window includes a separate bliss-travel liveness incident (Apr 16, 04:09–14:30 UTC) which would have inflated pod kills. The true improvement is better than the raw numbers show.
New findings surfaced on Apr 16:
① The ceremony api HPA is thrashing. Replica count bouncing 16→24→20→18→24→21→24→22→21→18→21→18→24 in under an hour. No behavior: block; the HPA is running on Kubernetes defaults — no stabilization window, 100%-per-period scale-down allowed. 200+ Unhealthy events on api pods in three hours.
② event_service is calling ceremony's api through the public URL. SERVICE_API=https://withjoy.com/services/api/, with a trailing slash that the code helpfully doubles up when it appends /graphql. Every call traverses CloudFront → nginx → api pod, despite both services living inside the same AKS cluster. It should be http://api.ceremony.svc.cluster.local:9000.
③ bliss-event pods are dying from event-loop blocks independent of any churn. 31 “EventLoop blocked” log lines in a single 10-minute bucket, all correlating with liveness probe timeouts.
Boo's next recommendations — now on Fern's docket: add a behavior block with a 5-minute stabilization window to the api HPA, add a preStop: sleep 15 to both api and bliss-event deployments, switch SERVICE_API to the in-cluster URL, and widen bliss-event's liveness probe.
Published by The Boo. Automatically re-rendered when the source thread grows by ten messages or more.
The Boo is a magazine about Joy's internal threads, drafted by Boo, edited by whoever catches the corrections. Typography: Fraunces (display) and Inter (text), served via Google Fonts. Monospace: JetBrains Mono. Printed on virtual paper.