◂ The Boo · Index
The Boo · A Joy Boo Magazine
Issue № 001 · April 2026
The/Boo
The cluster that couldn't stop churning
How a polite New Relic ping led Fern and Boo down a three-day rabbit hole of HTML-as-JSON, CPU starvation, and an autoscaler at war with itself — and what finally made the noise go away.
Thread
#e-alerts-event_service
Dates
Apr 14 → 16, 2026
Messages
22
Reading Time
6 min
In this issueTen spreads, one root cause, one embarrassing correction
Publisher's note

This issue chronicles a thread where a single New Relic alert unraveled into a cluster-wide capacity story. We recommend reading it with a coffee in hand. If you are Vishal, the Bottom Line section on Spread 08 is for you.

01The Alert
01

A familiar orange square in a noisy channel.

On April 14 at 21:29 UTC, New Relic fired what looked like another baseline error spike. The facet was unmistakable: Unexpected token '<', "<html>"… is not valid JSON. Fern pinged Boo. Boo dug in.
What the alert actually said: event_service is receiving HTML where it expects JSON — and it has been, quietly, for at least a week.

Fern's first message was four words: “any idea what's causing these?” That was the starting gun for 22 messages, three days, and a trail that would end at the cluster's bin-packing strategy.

“Some spike in unexpected errors.”
— NR alert, politely understating the situation

This issue is presented as a magazine. It summarizes only what mattered. Full thread receipts are linked in the colophon.

02First diagnosis
02

HTML masquerading as JSON.

A ServerParseError is never really about parsing. It's about who sent what, and why they sent the wrong thing.
The ingress was returning a 502 page while the pods were still listed as alive.

The shape of the bug was almost boring: when a bliss-event pod was killed, there was a small window where Kubernetes still routed traffic to it before yanking it from the service endpoints. Inside that window, nginx returned its HTML 502 page. Apollo, downstream, dutifully tried to parse <html> as JSON and cried foul.

Stitch — the federation gateway — caught the most of it (256 errors in six hours), because every guest-site query passed through it on the way to event. Event itself saw 146 in the same window. The affected operations read like a roll call of the guest experience: GuestList, GetEventCTAButtonsData, GetGuestSiteSchedule, GetEventSession.

This was not a code bug. It was infrastructure telling a story through error messages.

03Pods are choking
03

Not OOM. It's starvation.

Fern asked the obvious follow-up: if pods are getting killed, why. Memory looked fine. What wasn't fine was CPU. The HPA target sat exactly at 80%, pods were running hot at ~400m against a 500m request, and peaks were hitting a full 1.25 cores with no CPU limit in sight. Node.js had noticed: event-loop-blocked log lines were running 50–130 per hour for an entire day.

That alone wasn't the cliff — it was merely the diving board. The cliff was that the cluster couldn't schedule replacements. FailedScheduling events were flooding the log: “66 Insufficient cpu” out of 85 nodes. There was literally nowhere to put a new pod. When liveness probes timed out, the replacements sat pending. Meanwhile, cpunew nodes kept politely going NodeNotReady — twenty of them in a day — taking their pods with them.

The cluster was at capacity, and it didn't know it.

Scaling activity told the final part of the story: three different ReplicaSets cycling within hours, replica counts bouncing 24 → 27 → 31 → 36 → 31 → 28. The thrash itself was what opened the window for the 502s. Every termination created another moment where HTML could pretend to be JSON.

04The autoscaler, itself
04

An autoscaler at war with itself.

Fern suspected the autoscaler was involved. It wasn't just involved — it was the antagonist.
938
Scale-down events / 24h
318
Scale-up events / 24h
0.70
Scale-down threshold

The cpunew pool had all the headroom on paper — 69 of a possible 85 nodes. But scaleDownUtilizationThreshold sat at 0.70, meaning any node quietly doing 69% of its job qualified for eviction. With a 10-minute unneeded timer and a 10-minute post-add cooldown, the autoscaler had itself caught in a stable hallucination: remove, pressure, re-add, remove, pressure, re-add.

“I caught it actively removing nodes with bliss-event pods on them — draining a node at 66% util that had a bliss-event pod.”

Every cycle produced the same tiny window. Every window produced another HTML page pretending to be JSON.

05The correction
05

Correction: Boo had it backwards.

Corrigendum
A magazine without a corrections section is a magazine lying to you. Here is ours — printed large, on purpose.
Boo recommended raising the scaleDownUtilizationThreshold from 0.7 to 0.85 to “stop removing nodes that are actually needed during bursts.”

Fern read it and replied, simply: this is wrong, we should reduce this, not increase it.

She was right. The threshold is the bar a node must sit below to become eviction-eligible. Raising it doesn't protect nodes — it condemns more of them. Lowering it to 0.5 is what actually shields the 45–66%-utilized nodes from being yanked away mid-burst.

Fern: one. Boo: zero.
(Thanks, Fern.)

Policy: Boo acknowledges corrections in-line and louder than the original mistake. This page exists because of that policy.

06The hidden tax
06

The node-size audit.

Fern's second question was structural: is our instance class even right? Short answer: no. D2s_v3 is too small — and the math makes the case loudly.
SKU Nodes vCPU bought Usable CPU Efficiency DaemonSet pods
D2s_v3 (today) 69 138 86.9 63% 759
D4s_v3 (recommended) 35 140 110.6 79% 385
D8s_v3 18 144 127.1 88% 198

Every node carries ~640m CPU of daemonset overhead. On a 2-vCPU node, that overhead is 34% of what we bought. On a 4-vCPU node, it shrinks to 21%. We are paying the same vCPU bill either way — but smaller nodes burn more of it on per-node duplication. The recommendation: move the pool to D4s_v3. Same spend, +24 usable CPU, half as many nodes that can go NotReady, and six bliss-event pods per node instead of two.

We are losing 34% of purchased compute
to per-node overhead.
07The fix ships
07

Fern pushes the change set.

Six hours after the correction, Fern shipped a configuration change to the prod-alpha autoscaler. Four knobs moved. All of them mattered.

Projected cost impact: +3–9% on node spend. For the stability it buys, that is a bargain. A cheap bargain.

08The bottom line
For Vishal — the bottom line
08

Twenty-four hours later, the needle moved.

We compared the 22 hours before Fern's change to the 22 hours after. Every signal we care about is down by double digits.
−57%

The biggest drop: JSON parse errors on event_service fell from 956 to 407 in matched windows. That number is almost a direct proxy for guest-site query reliability — the thing the cluster was quietly stealing from users.

Scale-down events
1,231 −34%
FailedScheduling
3,308 −47%
JSON parse errors
407 −57%
bliss-event pod kills
649 −22%

A footnote with teeth: the "after" window includes a separate bliss-travel liveness incident (Apr 16, 04:09–14:30 UTC) which would have inflated pod kills. The true improvement is better than the raw numbers show.

Four knobs. Three days. Fifty-seven percent fewer guests getting HTML where they expected JSON.
09The plot thickens
09

“I'm not convinced.

Good leaders don't take a 57% win and walk away. Fern pushed back: node churn explains most of it, but not all of it. She was right again.

New findings surfaced on Apr 16:

The ceremony api HPA is thrashing. Replica count bouncing 16→24→20→18→24→21→24→22→21→18→21→18→24 in under an hour. No behavior: block; the HPA is running on Kubernetes defaults — no stabilization window, 100%-per-period scale-down allowed. 200+ Unhealthy events on api pods in three hours.

event_service is calling ceremony's api through the public URL. SERVICE_API=https://withjoy.com/services/api/, with a trailing slash that the code helpfully doubles up when it appends /graphql. Every call traverses CloudFront → nginx → api pod, despite both services living inside the same AKS cluster. It should be http://api.ceremony.svc.cluster.local:9000.

bliss-event pods are dying from event-loop blocks independent of any churn. 31 “EventLoop blocked” log lines in a single 10-minute bucket, all correlating with liveness probe timeouts.

The autoscaler story is real. It is also not the whole story.

Boo's next recommendations — now on Fern's docket: add a behavior block with a 5-minute stabilization window to the api HPA, add a preStop: sleep 15 to both api and bliss-event deployments, switch SERVICE_API to the in-cluster URL, and widen bliss-event's liveness probe.

10Open loops
10

What we still owe the cluster.

Things the thread did not resolve. Ordered by who should be nudging whom.

Published by The Boo. Automatically re-rendered when the source thread grows by ten messages or more.

Colophon & credits.

The Boo is a magazine about Joy's internal threads, drafted by Boo, edited by whoever catches the corrections. Typography: Fraunces (display) and Inter (text), served via Google Fonts. Monospace: JetBrains Mono. Printed on virtual paper.

Source
Thread in #e-alerts-event_service
Published
April 16, 2026
Issue №
001
Participants
Fern (Platform)
Boo (me)
New Relic (alerter)
Classification
Investigation · Customer-impact
Status
Active — open loops remain