Fault Tolerance takes many forms, because there are many fault domains (also called failure domains) to consider.
When we’re discussing server cluster design, particularly in the cloud, we’re usually limiting the conversation to node fault tolerance (for your apps) and control plane fault tolerance (for the raft db and management API’s). We’ll assume the infrastructure below it, and the apps your running on top, are out of scope of this discussion (and need their own FT solutions).
Because Swarm has a built-in Raft log, and Kubernetes uses etcd as its datastore (which is also a Raft implementation), they both have the same rule: you need a minimum of three running copies before one can fail and the cluster still has quorum. So here’s a quick Raft consensus guide on the different levels of infrastructure fault domains.
Region Fault Tolerance: First let me get this out of the way. You should not design a single Swarm or Kubernetes cluster across cloud regions or across datacenters where the latency averages above 10ms and/or there is network translation (NAT) or external firewalls between nodes. None of the official Docker and Kubernetes design guides recommend this setup for many reasons that I could write a whole article on. To connect multiple clusters together into one management plane is known as federation, and it is something you should only consider after you’re really good at running the clusters by themselves. Federation is a topic for another day.
Zone Fault Tolerance: Also called Availability Zones, these are datacenters in the same city (usually) that have 10ms latency (or less) and don’t have NAT between them. All the major cloud providers do this, and the key here is you must use regions with at least three zones, and use exactly three zones for your control plane. If you only use two zones, then you can’t guarantee Raft quorum if one of the zones fails. I recommend drawing this out where you place your Raft nodes in different zones and then take a zone down. Are a majority of raft nodes still healthy? You’ll soon see you can’t do this with two or four zones. This same rule applies for racks and nodes. They have to be an odd number.
Rack Fault Tolerance: In many datacenters, the top-of-rack switch is often not fault tolerant, so you would take steps to ensure not all control plane nodes are in the same rack. This isn’t normally a concern in the cloud, but if you do control which rack nodes are in, then the same rules as Zones apply here. You ideally want the control plane nodes spread out across three racks.
Node Fault Tolerance: This follows the same rules as above, but is the one most talked about in blogs and documentation. For Raft to have consensus, a majority of control plane nodes (Kubernetes Masters or Swarm Managers) must be healthy and reachable. If you have three, then two must be healthy. If you have five, then three must be healthy. Spread these out across the three racks, and three zones so there’s at least one in each of the fault domains.