Skip to content

/ journal

On-call rotations that do not burn out a four-person team

Twelve retainers, four engineers, no 3am heroics. How we run on-call without it eating the rest of the work.

Mathias Korsgaard Founding engineer / 2025-12-04 · 6 min · operations · team
journal · 2025-12-04 /// post.on-call-without-burning-out Feature illustration for the journal post: On-call rotations that do not burn out a four-person team

We are a small team. We cannot do follow-the-sun. We cannot afford a dedicated SRE. We also cannot tell a customer running a fleet dispatch system that we will get to it in the morning. The way we square this is mostly process, very little technology.

§ 1.0 — Tiered response, in writing

Every retainer has a written tier:

  • Tier A — system down or data at risk. 30 minute response, 24/7.
  • Tier B — degraded but functional. 2 hour response, 08:00–22:00 CET.
  • Tier C — questions, change requests. Next business day.

Roughly 70% of pages are Tier C. Roughly 5% are Tier A.

§ 2.0 — The engineer on-call is not also building

The on-call engineer for the week has no scheduled feature work. They run audits, write documentation, do code review, and respond to whatever comes in. This is the single biggest change we made and it cut burnout more than any tooling improvement.

§ 3.0 — Pages must end in a writeup

Every page, even a Tier C, ends with a one-paragraph postmortem in a shared log. The writeup takes ten minutes. After eighteen months we can search this log and see that 40% of all our Tier A pages have come from three specific subsystems on two specific clients.

§ 4.0 — Saying no on the way in

We do not sign retainers for systems we have not audited. The audit is non-negotiable and is the moment we surface the rollback drills, the single-author critical paths, the backups that have never been restored. By the time the retainer starts, we know what is going to wake us up.