Fault Tolerance
As a system netidx depends on fault tolerant strategies in the subscriber, publisher, and resolver server in order to minimize downtime caused by a failure. Before I talk about the specific strategies used by each component I want to give a short taxonomy of faults as I think of them so we can be clear about what I'm actually talking about.
- Hang: Where a component of the system is not 'dead', e.g. the
process is still running, but is no longer responding, or is so slow
it may as well not be responding. IMHO this is the worst kind of
failure. It can happen at many different layers, e.g.
- You can simulate a hang by sending SIGSTOP to a unix process. It isn't dead, but it also won't do anything.
- A machine with a broken network card, such that most packets are rejected due to checksum errors, it's still on the network, but it's effective bandwidth is a tiny fraction of what it should be.
- A software bug causing a deadlock
- A malfunctioning IO device
- Crash: A process or the machine it's running on crashes cleanly and completely.
- Bug: A semantic bug in the system that causes an effective end to service.
- Misconfiguration: An error in the configuration of the system that
causes it not to work. e.g.
- Resolver server addresses that are routeable by some clients and not others
- Wrong Kerberos SPNs
- Misconfigured Kerberos
- Bad tls certificates
Subscriber & Publisher
-
Hang: Most hang situations are solved by heartbeats. Publisher sends a heartbeat to every subscriber that is connected to it every 5 seconds. Subscriber disconnects if it doesn't reveive at least 1 message every 100 seconds.
Once a hang is detected it is dealt with by disconnecting, and it essentially becomes a crash.
The hang case that heartbeats don't solve is when data is flowing, but not fast enough. This could have multiple causes e.g. the subscriber is too slow, the publisher is too slow, or the link between them is too slow. Whatever the cause, the publisher can handle this condition by providing a timeout to it's
flush
function. This will cause any subscriber that can't consume the flushed batch within the specified timeout to be disconnected. -
Crash: Subscriber allows the library user to decide how to deal with a publisher crash. If the lower level
subscribe
function is used then on being disconnected unexpecetedly by the publisher all subscriptions are notified and marked as dead. The library user is free to retry. The library user could also usedurable_subscribe
which will dilligently keep trying to resubscribe, with linear backoff, until it is successful. Regardless of whether you retry manually or usedurable_subscribe
each retry will go through the entire process again, so it will eventually try all the publishers publishing a value, and it will pick up any new publishers that appear in the resolver server.
Resolver
- Hang: Resolver clients deal with a resolver server hang with a dynamically computed timeout based on the number of requests in the batch. The rule is, minimum timeout 15 seconds or 50 microseconds per simple operation in the batch for reads (longer for complex read ops like list matching) or 100 microseconds per simple operation in the batch for writes, whichever is longer. That timeout is a timeout to get an answer, not to finish the batch. Since the resolver server breaks large batches up into smaller ones, and answers each micro batch when it's done, the timeout should not usually be hit if the resolver is just very busy, since it will be sending back something periodically for each micro batch. The intent is for the timeout to trigger if the resolver is really hanging.
- Crash: Resolver clients deal with crashes differently depending on
whether they are read or write connections.
- Read Connections (Subscriber): Abandon the current connection, wait a random
time between 1 and 12 seconds, and then go through the whole
connection process again. That roughly entails taking the list of
all servers, permuting it, and then connecting to each server in
the list until one of them answers, says a proper hello, and
successfully authenticates (if authentication is on). For each batch a
resolver client will do this abandon and reconnect dance 3 times,
and then it will give up and return an error for that
batch. Subsuquent batches will start over from the beginning. In a
nutshell read clients will,
- try every server 3 times in a random order
- only give up on a batch if every server is down or unable to answer
- remain in a good state to try new batches even if previous batches have failed
- Write Connections (Publishers): Since write connections are
responsible for replicating their data out to each resolver server
they don't include some of the retry logic used in the read
client. They do try to replicate each batch 3 times seperated by a
1-12 second pause to each server in the cluster. If after 3 tries
they still can't write to one of the servers then it is marked as
degraded. The write client will try to replicate to a degraded
server again at each heartbeat interval. In a nutshell write
clients,
- try 3 times to write to each server
- try failed servers again each 1/2
writer_ttl
- never fail a batch, just log an error and keep trying
- Read Connections (Subscriber): Abandon the current connection, wait a random
time between 1 and 12 seconds, and then go through the whole
connection process again. That roughly entails taking the list of
all servers, permuting it, and then connecting to each server in
the list until one of them answers, says a proper hello, and
successfully authenticates (if authentication is on). For each batch a
resolver client will do this abandon and reconnect dance 3 times,
and then it will give up and return an error for that
batch. Subsuquent batches will start over from the beginning. In a
nutshell read clients will,
One important consequence of the write client behavior is that in the
event all the resolver servers crash, when they come back up
publishers will republish everything after a maximum of 1/2
writer_ttl
has elapsed.