Consistent Hashing Multicast

by Kragen Sitaker; see also Publications

These are the few thoughts I jotted down in September, 2004, on how to begin to handle decentralized application-layer multicast. I think I came up with a protocol that satisfies the end-to-end principle almost to a fault. Gilles pointed out today that 3GHz of Xeon costs about as much as 100kbps of bandwidth, so maybe the routers here are too dumb. But maybe not.

A web cache has to have a bit of intelligence about dealing with HTTP, but not a lot having to deal with caching itself — it’s only an optimization. This gives them a lot of freedom to be dumb or smart and still work reasonably well.

If you allow yourself to require smart clients, you can build a similarly dumb publish-subscribe infrastructure. It works like this.

The protocol contains two messages: publish(topic, message), and subscribe(topic). Topics and messages are arbitrary strings; they are %-encoded so they can’t contain newlines and so that topics can’t contain spaces. This lets us encode subscribe as “topic\n” and publish as “topic message\n”, which also allows us to pack several messages together into a single UDP packet.

When a message server receives a subscribe message from some UDP endpoint, it adds that endpoint to its list of subscribers for that topic, if it’s not already there, and plans to keep the endpoint there for at least N seconds. When a server receives a publish message for some topic, it forwards that message to each of the subscribers it has listed for that topic.

Functions usually provided on the server side can be provided by clients as follows:

If a client wants to stay subscribed to some topic, it resends its subscribe message every N/M seconds, where M is a reliability factor such as 2 or 3.
If a client wants to increase the probability that a message it has sent will be received by each subscriber, it resends the message.
If a client wants to unsubscribe to some topic, it stops sending its subscription messages and ignores further messages on that topic.
If a client wants to protect its traffic from being read by unauthorized eyes, it encrypts it with a public key, then distributes the private key to the authorized readers.
If a client wants to conceal the times at which it is sending, it sends garbage data at other times as well.
If a client wants to avoid receiving traffic from unauthorized senders, it distributes the public key only to authorized senders.
And, of course, every listener must handle inadvertent retransmissions safely, since they can occur even at the UDP layer.

N is a protocol design parameter. The optimal value is the one that minimizes wasted network traffic for the desired level of reliability; wasted network traffic comes from obsolete subscriptions (traffic flow * (N – subscription length mod N)) and from keepalives (M/N).

(A slightly different design gives the server a fixed-size FIFO subscriber table and has it send estimated subscription lifetimes to the clients in response to the subscribe messages; this holds N/M constant for a fixed number of clients, and shrinks N as the number of clients grows.)

Clearly the design as it stands has some flaws:

a single point of failure and performance bottleneck
vulnerable to denial-of-service attacks on the server
subscriptions can easily be forged, resulting in denial-of-service attacks on unwitting clients.

Here’s how to fix some of that.

Each client has a relatively complete list of all the world’s servers.
When it wants to publish or subscribe on some topic, it picks a subset of the list that is likely to overlap with the subset any other client picks for that same topic, but in which every server is equally likely to appear. This solves the single-point-of-failure, performance bottleneck, and server-DoS problems.

Selecting the subsets is a bit tricky. I will call three extreme approaches S, R, and CH.

S. Single. Number the servers 0 to C-1, hash the topic string mod C, and use that server. This retains the single-point-of-failure and denial-of-service problems of the single-server approach, adding the problem that clients with different lists of servers will miss one another, while ameliorating the performance bottleneck.

R. Random. Select a random subset of all the servers such that when its size is squared, the result is comparable to the total number of servers — larger for more reliability.

This approach is extremely robust against denial-of-service attacks, since an attacker must essentially DoS the entire network to shut down a single topic, and it allows the network bandwidth to scale as the square root of the number of servers. However, it requires two to six orders of magnitude more resources to get reasonable reliability than S.

CH. Consistent Hashing. Hash each server’s address into a thousand bit strings. Hash the topic into a bit string. Sort the list of bit strings. Take the K bit strings in the sorted list right before the topic (wrapping around from the beginning to the end if necessary) and use the corresponding servers. As long as one of your K matches one of my K, your message will reach me.

This approach is more robust against denial-of-service attacks than S by a factor of K (it requires K times as much bandwidth to swamp it), takes K times as much legitimate effort as S, tolerates inconsistent server lists handed to different clients, eliminates single points of failure and performance bottlenecks (bandwidth scales linearly as new servers come online).

We can combine R and CH to get a variant that’s more robust against denial-of-service attacks.

RCH. Random consistent hashing. Compute K servers as above, but either for publishing or subscribing (not both), only use a random subset of K’ of them. This allows K to be larger for the same amount of legitimate work and quality of service, but the attacker must still flood all K servers to prevent legitimate traffic from getting through.

If you publish to all K but subscribe only to K’, the publisher must send K messages each time a message is sent. But if you subscribe to all K but publish only to K’, the subscribers must renew K subscriptions. In either case, each subscriber receives each message K’ times. Which of the two is worse depends on the traffic on the channel. (There’s no way to tell which K’ servers a subscriber has subscribed to, so you still have to flood all K servers to block traffic even to a single subscriber; and there’s no way to tell which K’ servers the next message will be published to, so you still have to flood all K servers to block traffic even from a single publisher.)

For a realistic network, we might have 100000 servers, K’=3 and K=1000.

In the subscribe-to-K’ scenario, this allows our network to handle 33000 times the bandwidth of a single server, provide 99.9% reliable delivery if the servers are independent and 90% reliable, and require 1000 servers’ worth of traffic (1% of the network; say, 1000 megabits, about $1000/hour) to block access to a single topic; but it can only handle 100 times as many subscribers as a single server could.

In the publish-to-K’ scenario, the network can handle 100 times the bandwidth of a single server, still provides 99.9% reliable delivery, and still requires 1000 servers’ worth of traffic to block access to a single topic, but it can handle 33000 times as many subscribers as a single server.

RCH introduces a new coordination problem, though. Everyone must use the same K, which is not true with CH.

The last problem, that forged subscriptions can flood an innocent bystander, can be dealt with as follows.

First, use a three-way handshake in the subscribe protocol: send a magic cookie to the aspiring subscriber, and require them to send it back in a subscribe-confirmation request. This will prevent a forged subscribe request from resulting in many packets of response unless the attacker is snooping the network link of the victim.

Second, require proof-of-work, such as partial secure hash collisions, as early as possible in the subscribe protocol. For example, the initial subscribe packet could contain a string whose SHA-1 hash begins with the client’s IP address and the server’s IP address. This isn’t as strong as one might like, though. You could require an arbitrarily large amount of work in the subscribe-confirm step.

Third, don’t provide amplification; request packets should be no smaller and no less numerous than their responses.

Distributing the server list is still a difficulty, because the server list on each client is effectively the trust root: if you can get a client to use a server list of your choosing, then you can arbitrarily distort its view of the world.

If an attacker can pack the list with servers under their own control, they can redirect an arbitrary amount of the world’s traffic through their servers. Perhaps this in itself will be punishment enough.

If an attacker can cause the server lists held by particular clients to diverge sufficiently, those clients will not be able to communicate.

If an attacker can empty your server list, then it will cut you off from the world.