
Or: why the people you've never heard of are often the reason you still have customers
There's a specific kind of person in modern software companies whose job most leaders don't fully understand until something breaks at 2am and the company's revenue starts leaking out of a hole in production. That person is the Site Reliability Engineer.
If you've ever wondered who actually keeps the website up, the payments flowing, and the app not-on-fire — and why your bill for them keeps growing — this post is for you. Let's walk through what an SRE is, where the role came from, what they do all day, and what happens to a business that needs one but doesn't have one.
Where SRE came from: Google, 2003
In 2003, Google had a problem most companies would envy: it was growing so fast that no amount of hiring traditional system administrators could keep up. So they tried something unusual.
They handed the job of "keeping things running" to a software engineer named Ben Treynor Sloss, and told him to design an operations team the way an engineer would build a product. The result became known as Site Reliability Engineering. Treynor later described it with a line that has become legendary inside the industry: "SRE is what happens when you ask a software engineer to design an operations function."
Instead of more people clicking more buttons, Google built software that did the clicking — and people who built the software that did the clicking. That single mental shift is the whole origin story.
You can read the full version of the philosophy in Google's own SRE book, which they released for free.
A normal day in the life of an SRE
There isn't really one. But a typical week tends to include a mix of the following.
Watching the dashboards. SREs care intensely about something called an SLO — a Service Level Objective. In plain English: a promise about how reliable a service should be. For example, "the checkout page will work correctly 99.9% of the time." The actual measurement (called an SLI) is tracked continuously. If the service starts missing that promise, alarms go off.
Spending the error budget wisely. If you promise 99.9% reliability, that leaves 0.1% room for things to go wrong. That's the error budget. It's the SRE's currency. They use it to negotiate with product teams: "If we ship this risky new feature now, we'll probably burn the rest of our budget for the month. Are we sure?" When the budget runs out, new releases pause until reliability is restored. It's a brilliantly simple way of making reliability a shared business decision instead of a fight.
Killing toil. Toil is the SRE word for repetitive manual work that scales linearly as your business grows — restarting a server every Tuesday, manually approving a deployment, rotating a credential by hand. Left alone, toil eats the team. SREs find toil and automate it out of existence.
Running incidents. When production breaks, the SRE is the incident commander. They coordinate the response, communicate status, decide when to roll back, and afterwards they write a blameless postmortem — a document that explains what happened, why, and what will change so it doesn't happen again. The "blameless" part is non-negotiable: the goal is to fix systems, not punish people.
Capacity planning, on-call rotations, chaos testing, paging policies, runbooks. All the unglamorous infrastructure work that turns a fragile system into a boring one. And boring, in this profession, is the highest compliment.
The toolset
The SRE shelf in 2026 is well-stocked, and most of it is open source. A few you'll hear repeatedly:
- Prometheus for collecting metrics, and Grafana for visualising them. The default duo for cloud-native monitoring.
- Kubernetes for orchestrating containers — essentially the operating system of the modern data centre.
- Terraform for "infrastructure as code" — describing servers, networks, and databases in text files so the whole environment can be rebuilt from scratch on demand.
- PagerDuty, Opsgenie, or Rootly for incident response and on-call paging.
- OpenTelemetry, Loki, Tempo, Datadog, New Relic, Honeycomb for tracing and observability — the practice of asking your system, after the fact, "what on earth were you doing at 3:14am?"
- GitHub Actions, ArgoCD, Spinnaker for shipping code safely and rolling it back when it misbehaves.
The toolset evolves quickly, but the principle behind it doesn't: if a human is doing it by hand more than twice, write code that does it instead.
What an SRE actually knows
This is where the role surprises people. SREs are software engineers first. They write code, they read code, they review pull requests. But they also carry a second toolbox most engineers don't: a deep understanding of how complex systems fail.
A good SRE will fluently discuss distributed systems, networking, Linux internals, database internals, queueing theory, statistical reasoning about reliability, security fundamentals, and the dark arts of debugging production at 4am with incomplete information. They know how to read a flame graph, how to interpret a P99 latency chart, and why "it's slow" is never a sufficient bug report.
They are also, often, unusually good communicators. When the site is down, the SRE is the person writing the status updates that the CEO is reading.
Which companies and systems need SREs
Honestly? More than realise it.
The strongest case for hiring SREs exists when any of these are true:
- Downtime is expensive. E-commerce, payments, banking, ad tech, exchanges, SaaS platforms with paying customers — anywhere a minute of outage is a measurable number on a P&L. Recent industry data puts large-enterprise downtime at well over a million dollars an hour, with some segments measuring it in millions per minute.
- The system is complex. Microservices, multiple cloud regions, real-time data, mobile + web + API surfaces. Complexity hides failure modes that no one engineer can hold in their head.
- The system is regulated. Healthcare, finance, telecoms, infrastructure. Auditors want evidence of disciplined operations, and SRE practices generate that evidence as a side effect.
- You serve customers who pay for uptime. Anyone with an SLA in their contract.
If you're a five-person startup pre-launch, you don't need a dedicated SRE — your engineers wear the hat part-time. The moment a serious outage would cost you a customer or a contract, the conversation changes.
SRE vs DevOps: the question I get asked the most
This deserves its own paragraph because executives are routinely told these are the same role with different titles. They aren't.
The cleanest way I've seen it expressed: DevOps is a philosophy. SRE is a specific implementation of that philosophy. Or as Google likes to put it, "class SRE implements DevOps."
In practice:
- DevOps is a cultural movement that argues developers and operations should not be separate tribes. Its focus is on shipping software faster and more reliably — pipelines, automation, shared ownership, feedback loops. It is broad and tool-agnostic.
- SRE is a concrete engineering practice with specific rituals: SLOs, error budgets, blameless postmortems, toil budgets, on-call rotations. It is focused on production reliability after the code is shipped.
Imagine a city. DevOps designs the road network so cars can move quickly from one place to another. SRE makes sure the roads stay open during a storm, decides where the traffic lights go, and runs the emergency services when there's a crash.
In 2026, the line is blurrier than it used to be. Many companies use "DevOps Engineer" as the job title and ask the person to do SRE work. The newer role of Platform Engineer sits somewhere in between. The titles matter less than understanding which actual practices your business needs.
What happens to a company that needs SREs and doesn't have them
This is the part executives tend to find most useful, so let's be concrete.
The reliability tax is paid by your product engineers, inefficiently. Without an SRE function, your most expensive software engineers spend their evenings fighting fires instead of building features. Velocity drops. Morale drops. The best ones quietly start interviewing.
Outages get longer and louder. Without practiced incident response, your first 30 minutes of an outage are spent figuring out who should be on the call. Mean time to recovery balloons. Customers find out about your problems before you do, on Twitter.
No one knows what "reliable enough" means. Engineering chases nines because it feels safe; product chases features because it feels urgent. Without an error budget framework, every release feels like a fight and every postmortem feels like a blame game.
Capacity planning is a surprise. You learn how much traffic your system can handle the first time it can't.
Compliance and audit become painful. Auditors ask for evidence of change management, incident response, and capacity planning. Without an SRE-style discipline, this evidence is reconstructed retroactively, badly, the week before the audit.
The bill for cloud infrastructure grows mysteriously. Without someone whose job is to right-size, decommission, and automate, costs drift upward 20–40% above where they need to be.
None of these are hypothetical. They are the daily experience of every engineering organisation that crossed the complexity threshold before staffing the role.
The takeaway
Site Reliability Engineering isn't a fashionable job title or a fancy synonym for "ops." It is a specific, hard-earned discipline invented because the old way of running large systems didn't scale. The companies that adopt it well end up with a quiet superpower: their systems get more reliable as they get larger, instead of less.
If your business runs on software and your customers notice when it stops, you need this function. Whether you call the people "SREs," "Platform Engineers," "DevOps," or something invented next year matters far less than whether the work itself is being done.
When it is, you won't notice them. That's the point.