July 06 2017 | Company
Making Gossip More Robust with Lifeguard
Today we are proud to announce our first publication by HashiCorp Research, titled "Lifeguard: SWIM-ing with Situational Awareness". The paper details a number of novel improvements we have introduced to Serf, Consul, and Nomad to make their underlying gossip protocol more robust. Collectively called Lifeguard, these extensions reduce by 50x the false positives produced by the failure detector and allow us to detect true failures faster.
Distributed systems such as BitTorrent, Apache Cassandra, Microsoft Orleans, and HashiCorp Consul commonly use Gossip protocols. They are typically embedded to provide features such as cluster membership (who is in the cluster), failure detection (which members are alive), and event broadcast. Their peer to peer nature often makes them much more scalable and reliable than centralized approaches to solving the same problem. However, the reduced amount of communication makes them sensitive to slow processing of their messages.
Many of our tools leverage work from the academic community, and with HashiCorp Research we hope to contribute back. Our focus is on novel work and whitepapers about the algorithms and system designs we are using in practice. Lifeguard is our first published work, and our users operating the tools in production environments drive the focus of these improvements.
Read on to learn more about Lifeguard.