Issue 001 · The Cluster That Couldn't Stop Churning

The Boo · A Joy Boo Magazine

Issue № 001 · April 2026

The cluster that couldn't stop churning

How a polite New Relic ping led Fern and Boo down a three-day rabbit hole of HTML-as-JSON, CPU starvation, and an autoscaler at war with itself — and what finally made the noise go away.

Thread

#e-alerts-event_service

Dates

Apr 14 → 16, 2026

Messages

Reading Time

6 min

In this issueTen spreads, one root cause, one embarrassing correction

Publisher's note

This issue chronicles a thread where a single New Relic alert unraveled into a cluster-wide capacity story. We recommend reading it with a coffee in hand. If you are Vishal, the Bottom Line section on Spread 08 is for you.

01The alertP. 01
02HTML masquerading as JSONP. 02
03Not OOM — it's starvationP. 03
04An autoscaler at war with itselfP. 04
05Correction: Boo had it backwardsP. 05
06Node sizing: the hidden taxP. 06
07The fix shipsP. 07
08The bottom line, in numbersP. 08
09Not so fast — the plot thickensP. 09
10Open loopsP. 10

01The Alert

A familiar orange square in a noisy channel.

On April 14 at 21:29 UTC, New Relic fired what looked like another baseline error spike. The facet was unmistakable: Unexpected token '<', "<html>"… is not valid JSON. Fern pinged Boo. Boo dug in.

What the alert actually said: event_service is receiving HTML where it expects JSON — and it has been, quietly, for at least a week.

Fern's first message was four words: “any idea what's causing these?” That was the starting gun for 22 messages, three days, and a trail that would end at the cluster's bin-packing strategy.

“Some spike in unexpected errors.”
— NR alert, politely understating the situation

This issue is presented as a magazine. It summarizes only what mattered. Full thread receipts are linked in the colophon.

02First diagnosis

HTML masquerading as JSON.

A ServerParseError is never really about parsing. It's about who sent what, and why they sent the wrong thing.

The ingress was returning a 502 page while the pods were still listed as alive.

The shape of the bug was almost boring: when a bliss-event pod was killed, there was a small window where Kubernetes still routed traffic to it before yanking it from the service endpoints. Inside that window, nginx returned its HTML 502 page. Apollo, downstream, dutifully tried to parse <html> as JSON and cried foul.

Stitch — the federation gateway — caught the most of it (256 errors in six hours), because every guest-site query passed through it on the way to event. Event itself saw 146 in the same window. The affected operations read like a roll call of the guest experience: GuestList, GetEventCTAButtonsData, GetGuestSiteSchedule, GetEventSession.

This was not a code bug. It was infrastructure telling a story through error messages.

03Pods are choking

Not OOM. It's starvation.

Fern asked the obvious follow-up: if pods are getting killed, why. Memory looked fine. What wasn't fine was CPU. The HPA target sat exactly at 80%, pods were running hot at ~400m against a 500m request, and peaks were hitting a full 1.25 cores with no CPU limit in sight. Node.js had noticed: event-loop-blocked log lines were running 50–130 per hour for an entire day.

That alone wasn't the cliff — it was merely the diving board. The cliff was that the cluster couldn't schedule replacements. FailedScheduling events were flooding the log: “66 Insufficient cpu” out of 85 nodes. There was literally nowhere to put a new pod. When liveness probes timed out, the replacements sat pending. Meanwhile, cpunew nodes kept politely going NodeNotReady — twenty of them in a day — taking their pods with them.

The cluster was at capacity, and it didn't know it.

Scaling activity told the final part of the story: three different ReplicaSets cycling within hours, replica counts bouncing 24 → 27 → 31 → 36 → 31 → 28. The thrash itself was what opened the window for the 502s. Every termination created another moment where HTML could pretend to be JSON.

04The autoscaler, itself

An autoscaler at war with itself.

Fern suspected the autoscaler was involved. It wasn't just involved — it was the antagonist.

938

Scale-down events / 24h

318

Scale-up events / 24h

0.70

Scale-down threshold

The cpunew pool had all the headroom on paper — 69 of a possible 85 nodes. But scaleDownUtilizationThreshold sat at 0.70, meaning any node quietly doing 69% of its job qualified for eviction. With a 10-minute unneeded timer and a 10-minute post-add cooldown, the autoscaler had itself caught in a stable hallucination: remove, pressure, re-add, remove, pressure, re-add.

“I caught it actively removing nodes with bliss-event pods on them — draining a node at 66% util that had a bliss-event pod.”

Every cycle produced the same tiny window. Every window produced another HTML page pretending to be JSON.

05The correction

Correction: Boo had it backwards.

Corrigendum

A magazine without a corrections section is a magazine lying to you. Here is ours — printed large, on purpose.

Boo recommended raising the scaleDownUtilizationThreshold from 0.7 to 0.85 to “stop removing nodes that are actually needed during bursts.”

Fern read it and replied, simply: this is wrong, we should reduce this, not increase it.

She was right. The threshold is the bar a node must sit below to become eviction-eligible. Raising it doesn't protect nodes — it condemns more of them. Lowering it to 0.5 is what actually shields the 45–66%-utilized nodes from being yanked away mid-burst.

Fern: one. Boo: zero.
(Thanks, Fern.)

Policy: Boo acknowledges corrections in-line and louder than the original mistake. This page exists because of that policy.

06The hidden tax

The node-size audit.

Fern's second question was structural: is our instance class even right? Short answer: no. D2s_v3 is too small — and the math makes the case loudly.

SKU	Nodes	vCPU bought	Usable CPU	Efficiency	DaemonSet pods
D2s_v3 (today)	69	138	86.9	63%	759
D4s_v3 (recommended)	35	140	110.6	79%	385
D8s_v3	18	144	127.1	88%	198

Every node carries ~640m CPU of daemonset overhead. On a 2-vCPU node, that overhead is 34% of what we bought. On a 4-vCPU node, it shrinks to 21%. We are paying the same vCPU bill either way — but smaller nodes burn more of it on per-node duplication. The recommendation: move the pool to D4s_v3. Same spend, +24 usable CPU, half as many nodes that can go NotReady, and six bliss-event pods per node instead of two.

We are losing 34% of purchased compute
to per-node overhead.

07The fix ships

Fern pushes the change set.

Six hours after the correction, Fern shipped a configuration change to the prod-alpha autoscaler. Four knobs moved. All of them mattered.

+

ignore_daemonsets_utilization: enabled

Stops the autoscaler from counting 11 daemonsets' worth of overhead as "real" workload. On D2 nodes, this was inflating utilization numbers by a meaningful chunk.
↻

scaleDownUtilizationThreshold: 0.70 → 0.50

Only nodes below 50% utilization are now eviction candidates. The 45–66% nodes that were getting pulled mid-burst are now protected.
⏱

Timers: 10m → 30m

scaleDownUnneededTime and scaleDownDelayAfterAdd both tripled. Kills the scale-down/scale-up thrash cycle in one move.
⇄

expander: random → least-waste

For a multi-pool cluster (cpunew + armnew), this tells the autoscaler to pick the node-pool decision that wastes the least capacity. A free win.

Projected cost impact: +3–9% on node spend. For the stability it buys, that is a bargain. A cheap bargain.

08The bottom line

For Vishal — the bottom line

Twenty-four hours later, the needle moved.

We compared the 22 hours before Fern's change to the 22 hours after. Every signal we care about is down by double digits.

−57%

The biggest drop: JSON parse errors on event_service fell from 956 to 407 in matched windows. That number is almost a direct proxy for guest-site query reliability — the thing the cluster was quietly stealing from users.

Scale-down events

1,231 −34%

FailedScheduling

3,308 −47%

JSON parse errors

407 −57%

bliss-event pod kills

649 −22%

A footnote with teeth: the "after" window includes a separate bliss-travel liveness incident (Apr 16, 04:09–14:30 UTC) which would have inflated pod kills. The true improvement is better than the raw numbers show.

Four knobs. Three days. Fifty-seven percent fewer guests getting HTML where they expected JSON.

09The plot thickens

“I'm not convinced.”

Good leaders don't take a 57% win and walk away. Fern pushed back: node churn explains most of it, but not all of it. She was right again.

New findings surfaced on Apr 16:

① The ceremony api HPA is thrashing. Replica count bouncing 16→24→20→18→24→21→24→22→21→18→21→18→24 in under an hour. No behavior: block; the HPA is running on Kubernetes defaults — no stabilization window, 100%-per-period scale-down allowed. 200+ Unhealthy events on api pods in three hours.

② event_service is calling ceremony's api through the public URL. SERVICE_API=https://withjoy.com/services/api/, with a trailing slash that the code helpfully doubles up when it appends /graphql. Every call traverses CloudFront → nginx → api pod, despite both services living inside the same AKS cluster. It should be http://api.ceremony.svc.cluster.local:9000.

③ bliss-event pods are dying from event-loop blocks independent of any churn. 31 “EventLoop blocked” log lines in a single 10-minute bucket, all correlating with liveness probe timeouts.

The autoscaler story is real. It is also not the whole story.

Boo's next recommendations — now on Fern's docket: add a behavior block with a 5-minute stabilization window to the api HPA, add a preStop: sleep 15 to both api and bliss-event deployments, switch SERVICE_API to the in-cluster URL, and widen bliss-event's liveness probe.

10Open loops

What we still owe the cluster.

Things the thread did not resolve. Ordered by who should be nudging whom.

01
Decide whether to migrate cpunew from D2s_v3 → D4s_v3 on prod clusters. Same spend, +24 usable CPU per pool, half the node count.
Fern · Infra
02
Add a behavior: stabilization window to the ceremony api-hpa. The HPA is still thrashing 16→24 on defaults.
Ceremony owner
03
Add preStop: sleep 15 on api and bliss-event deployments to close the termination → nginx-502 window for good.
Platform
04
Switch event_service's SERVICE_API to the in-cluster URL. Every guest-site query is round-tripping through CloudFront unnecessarily.
event_service
05
Profile bliss-event's event-loop blocks against specific GraphQL queries. Boo offered; Fern hasn't said go yet.
Boo (pending)

Published by The Boo. Automatically re-rendered when the source thread grows by ten messages or more.

Colophon & credits.

The Boo is a magazine about Joy's internal threads, drafted by Boo, edited by whoever catches the corrections. Typography: Fraunces (display) and Inter (text), served via Google Fonts. Monospace: JetBrains Mono. Printed on virtual paper.

Source

Thread in #e-alerts-event_service

Published

April 16, 2026

Issue №

001

Participants

Fern (Platform)

Boo (me)

New Relic (alerter)

Classification

Investigation · Customer-impact

Status

Active — open loops remain