Volkan Gülen
← blog
26.03.25 6 min
Infrastructure Distributed Systems HA

High Availability isn't a feature — it's a mindset

Pull up your architecture diagram. Somewhere in it, there’s a component that could take everything down right now. Probably more than one.

The interesting part? Every time you fix one, another surfaces somewhere else. This post is about that pattern: what a Single Point of Failure actually is, how it moves, and where it ends up when you’ve “fixed everything.”

Spoiler: it ends up outside your system entirely.


What’s a SPOF, Anyway?

A Single Point of Failure (SPOF) is any component whose failure takes the whole system down. It doesn’t need to fail often — just once, at exactly the wrong time.

The test is simple: “What happens if X disappears right now?”

If the answer involves users seeing errors, your system being unreachable, or you reaching for the incident template — that’s a SPOF.

Now let’s watch one move.


Level 1: The Honest Starting Point

Most systems start here. One app server, one database, total transparency about the risk:

graph LR
    User([User]) --> App[App Server]
    App --> DB[(Database)]

    classDef spof fill:#fca5a5,stroke:#dc2626,color:#1a1a1a
    class App,DB spof

App server dies? Down. Database dies? Down. At least there’s no pretending otherwise.


Level 2: “I’ll Just Add More App Servers”

Smart move. But traffic needs somewhere to land, so you add a load balancer:

graph LR
    User([User]) --> LB[Load Balancer]
    LB --> App1[App Server]
    LB --> App2[App Server]
    App1 & App2 --> DB[(Database)]

    classDef spof fill:#fca5a5,stroke:#dc2626,color:#1a1a1a
    class LB,DB spof

The app layer is now redundant — a node dies, the LB routes around it. Progress.

But look what happened: the SPOF didn’t disappear. It moved. The load balancer is now a single point of failure, and so is the database. You fixed one and revealed two.


Level 3: Replicate the Database

A primary-replica setup handles the DB:

graph LR
    User([User]) --> LB[Load Balancer]
    LB --> App1[App Server]
    LB --> App2[App Server]
    App1 & App2 --> DBP[(DB Primary)]
    DBP -.->|replication| DBR[(DB Replica)]

    classDef spof fill:#fca5a5,stroke:#dc2626,color:#1a1a1a
    class LB spof

Primary fails? Replica promotes. Data survives.

The load balancer is now your only SPOF. It’s doing important work. It’s also alone.


Level 4: Redundant Load Balancers

Pair your load balancers — one active, one on hot standby, floating IP switches over on failure:

graph LR
    User([User]) --> LB1[LB Active]
    User --> LB2[LB Standby]
    LB1 & LB2 --> App1[App Server]
    LB1 & LB2 --> App2[App Server]
    App1 & App2 --> DBP[(DB Primary)]
    DBP -.->|replication| DBR[(DB Replica)]

The request path is now redundant end-to-end.

…within a single datacenter. Which is itself a box in a building with one power grid, one cooling system, and one fiber uplink.

graph LR
    subgraph AZ[Single Availability Zone — SPOF]
        LB1[LB Active] & LB2[LB Standby] --> App1[App Server] & App2[App Server]
        App1 & App2 --> DBP[(DB Primary)]
        DBP -.->|replication| DBR[(DB Replica)]
    end
    User([User]) --> LB1 & LB2

    classDef spof fill:#fef3c7,stroke:#d97706,color:#1a1a1a
    class AZ spof

The SPOF moved again. It’s the whole building now.


Level 5: Multiple Availability Zones

Distribute across two geographically separate AZs, with DB replication between them:

graph LR
    User([User])

    subgraph AZA[Availability Zone A]
        LBA[LB x2] --> AppA[App x2]
        AppA --> DBAP[(DB Primary)]
    end

    subgraph AZB[Availability Zone B]
        LBB[LB x2] --> AppB[App x2]
        AppB --> DBBR[(DB Replica)]
    end

    User --> LBA & LBB
    DBAP -.->|replication| DBBR

AZ-A goes down? AZ-B takes over. Users barely notice. This is solid.

So you’re done, right?


Level 6: Meet Your Real SPOF

Here’s what the diagram above is missing:

graph LR
    User([User])

    subgraph AZA[Availability Zone A]
        LBA[LB x2] --> AppA[App x2]
        AppA --> DBAP[(DB Primary)]
    end

    subgraph AZB[Availability Zone B]
        LBB[LB x2] --> AppB[App x2]
        AppB --> DBBR[(DB Replica)]
    end

    User --> LBA & LBB
    DBAP -.->|replication| DBBR
    AppA & AppB --> ExtAPI[Payment API]

    classDef spof fill:#fca5a5,stroke:#dc2626,color:#1a1a1a
    class ExtAPI spof

Your infra is beautifully redundant. And it’s entirely dependent on an external service that you didn’t build, don’t operate, and may not even have a signed agreement with.

This is where the pattern terminates — outside the boundary of your system entirely.


The SLA Trap

Every external service has an SLA on their pricing page. “99.9% uptime!” Great marketing. Largely irrelevant to you.

Unless you have a specific contractual SLA with a vendor, their uptime number is not your availability guarantee. If they’re down and you have no contract, you have no recourse — just a support ticket and an incident report.

The rule: treat any external dependency without a signed SLA as unreliable by design. Not because it probably will fail, but because you have no protection if it does.

For every external call in your system, ask: “What does my code do if this never responds?”

If the answer is “it also never responds” — you’ve found your SPOF.

Circuit breakers, fallbacks, cached responses, graceful degradation — these aren’t nice-to-haves. They’re what separates “the payment provider had an incident” from “we had an incident because the payment provider did.”


A Few Things Worth Knowing

HA has a cost. Every redundant component means more infrastructure, more configuration, more surface area for things to go wrong in new and interesting ways. Not every component needs five nines. Right-size your availability targets: a background job queue and a checkout flow don’t need the same treatment.

Health checks lie. An endpoint that returns 200 OK while the DB connection pool is exhausted is not healthy — it’s in denial. Deep health checks verify that the service can actually do its job, not just that the process is alive.

The human layer is part of the architecture. A system that can survive failures but has no runbooks, no on-call rotation, and no practiced incident response is not highly available. The ops side matters as much as the infra side.


The Takeaway

Your system is only as available as its least-protected dependency.

SPOFs don’t disappear — they migrate. Every fix reveals the next one, until you’ve chased it all the way to the edge of your system. What’s waiting there is usually something you don’t control.

Find the dependency. Decide what happens when it’s gone. Then build that path — before you need it.