Skip to main content
HashiTalks 2025 Learn about unique use cases, homelab setups, and best practices at scale at our 24-hour virtual knowledge sharing event. Register
Presentation

Lifeguard: Failure Detection in the Era of Gray Failures

The new Lifeguard extensions in HashiCorp's Serf, Consul, and Nomad massively reduces the number of false positives, making it more robust and providing faster and more reliable failure detection.

Detecting a failure deep inside a large-scale distributed system is hard. It's hard to do it reliably and quickly. It's especially hard to avoid false positives—for example, a node acting as if it has failed, when it's just running slowly for a moment, like a zombie node—not quite dead.

Jon Currey, director of research at HashiCorp, describes the company's solution to the problem, Lifeguard. In this fascinating talk, he explains how his team got to this point, and the unusual engineering disciplines they learned from.

Speakers

Transcript

Hey everybody. Thank you very much for being here. I realize I stand between you and lunch so, great responsibility, I will try to both entertain and educate you as far as I can.

So hopefully this is a familiar a picture to many of you—it doesn't really matter whether it's a cluster or you have some large group of instances that you need to manage—one of the things you have to do with these is have a way for your services to discover one another and to have those services be available in a highly available way so you need to do some failure detection. I guess from the demographic here there's a chance that you're using HashiCorp technology to do this, maybe not. We like to play nicely so you don't have to use all of our tools, but Consul is one possible solution for doing this, so when you decide who you're going to put your trust in for your failure detection solution there are a number of criteria that you use to evaluate the technology that you're going to use.

We appreciate that you have many choices in this area and as HashiCorp when we were looking to understand the best algorithm that we should use inside of Consul, we also had the criteria that we needed to use to evaluate the right technology and these are the general ways that you would evaluate any technology:

How does it perform?

Performance in this context includes several dimensions: - What is the latency or, if there's a failure, how quickly are we going to detect that failure? - How quickly will we propagate that failure to everybody in the system - What's the error rate like: are there going to be false positives, are there going to be false negatives? How many?

Also in this area we have a consistency model. If we actually offer Consul up as a strong consistency model if you need everybody to be on the same page, then it's possible to go to the Consul servers and to get one definitive view of the health of the system of nodes. Or if you don't need that you can go with the weaker model, you can use Serf and get something that's more fresh.

Efficiency concerns

For efficiency, what is the overhead of all of this and what are you going to do, how many messages are you going to send on the network, CPU, and memory on the instances where you're running this agent or whatever.

Reliability concerns

Reliability in this particular case is a sensitive subject because what's the point of having a failure detection mechanism if it is not reliable? We need to know when there's a failure and we can't have this infinite regression of turtles all the way down. The buck stops somewhere and you want to know that this thing is a reliable system.

Failure model

But there's also a special criterion when you're thinking about failure detection—you need to think about the failure model. You can compare this to the threat model in security—this model is defining the domain, it's the things that are going to be visible to you: what are the semantics? what are the ways that things can fail? So implicitly or not, whether you realize it or not, you're always shopping based on the failure model of your failure detection system.

Failure modes

So luckily there are many failure modes to choose from. This is a nice diagram that Stefan Hollander from Technical University of Veen put together. He teaches a great course. (By the way all of the things that I reference, at the end I'll give you my Twitter handle and I've just tweeted the links to all of these things so don't be scrambling to write down URLs and things because you can go find all of this stuff afterwards) Many people actually have referenced Stefan's writing about this.

There's a hierarchy of failure modes here and we're starting at the outside. We're starting with a very general arbitrary fail, in fact the definition here: a Byzantine failure is absolutely arbitrary. It could fail in any way, it doesn't have to stop and it could cover all sorts of horrible things. Malicious and collusion situations where people are actually trying to bring your system down, they're injecting confusing malformed messages or conflicting messages, so it's the most general thing that we to try and recover from.

Then each ring here we're going in, we're making some simplifying assumptions, we're reducing the scope and making it more of a tractable problem. How can we actually implement a failure detector that can find this particular class of failure? And then right in the middle there you have a fail stop—so not only is it crashing it's inside a crash so that means it has to have crashed but it crashes nicely (and I'll come back to this in another couple of slides actually)

Byzantine fault tolerance (BFT)

Byzantine failures were on the outside of that, they were the most general thing (by the way they come from a related problem the Byzantine generals problem—these are the guys who are trying to attack and they have to send messages to each other to coordinate the attack but one of them might be a bad actor—malicious, he's a spy for the other side), and there is a way to make a protocol that enforces consistency here. It's actually thinking of it as a consensus problem, not surprisingly.

Leslie Lamport posed the Byzantine generals problem—this is the same person who gave us the Paxos consensus protocol, so great—this thing is very general. We can handle arbitrary faults. So you think this is something that people would aspire to. We should be seeing this out there in the marketplace. Well, it turns out it's not widely deployed and I think you see this for yourself—there's no open source project out there that everyone is jumping on to use this thing. Okay, well maybe you need to be Google you need to be Microsoft. (So Leslie Lamport works at Microsoft and I'm going to give you the citation later for a paper from Microsoft where they say that it is not being used in a production system inside of Microsoft, in fact, I think they mean generally in the world, but certainly they would know whether it's in Microsoft).

And then Google who leaped on Paxos and made it famous are also asserting that BFT is not ready for primetime. And there are sensible reasons for this. When you go to look at the BFT class of systems, they're complicated protocols and then if they've been simplified they've been made more tractable, but nevertheless they're complicated protocols.

To tolerate F failures, you typically need 3F+1 instances so there's a lot of overhead here and as a consequence, the way they scale the number of messages, the number of instances that can be brought in, it's a tough thing to do, which pretty much makes sense. Why is this stuff not widely adopted? You could say that probably it's overly pessimistic for the data center. We don't have Byzantine generals typically in the data center. You want to do something more lightweight, you want to defend against accidental failures not somebody being malicious unless it's something super sensitive.

It turns out Bitcoin actually is a great candidate for BFT because there everybody is going to try and game the system and it has to be absolutely rock-solid in a fully distributed manner, so BFT if you go to any of the conferences working on Bitcoin stuff, you're going to see fault tolerance is front and center with Bitcoin

The dominant models: Crash and fail-stop

So what you do see when you go look at the literature and look at the systems is that everyone's really come down to either the crash failure model, which is sometimes called "crash stop" and then the fail stop failure model.

The crash just says the thing stops and it no longer will send messages. The only promise you have to make is that this process that we were monitoring disappears.

Fail stop is stronger. It requires that it stops in the same way, but this is a model where people are trying to make things actually fail on purpose in a consistent manner because they would like it to leave the system so that the remaining nodes—the healthy nodes—should be able to recover the global state of the system. So clean up after yourself, be a thoughtful colleague as you go and crash out in flames.

But these are all simplifying assumptions. The crash failure one seems pretty reasonable though, right? And you'll see many papers, many concrete implementations asserting that crash failure is their model, they can't go further and promise fail stop, but the crash failure model hopefully is good enough.

The SWIM protocol

One of the many protocols that use this model is SWIM. SWIM is the protocol that we use underlying Consul, and we evaluated a lot of technologies before picking SWIM. It was first brought out in an IEEE DSN conference in 2002 and it actually says very clearly in the paper that the crash failure is the expected model in this domain and therefore they support the crash affiliate model. So SWIM is on board with doing things the realistic way.

When we evaluated SWIM for our own use and for your use indirectly, there were a bunch of criteria. We knew that we needed something that scales, we wanted it to be robust both to network problems and local node problems. There was no point having an unreliable failure detection system and we wanted it to be easy to deploy and manage, and it turns out SWIM has some really nice properties. The way it's architected really helps with this stuff.

It's a peer-to-peer system. Consul has servers but not for SWIM. The SWIM part is down in the agent, so a Consul server also happens to have the agent running. From that perspective, it's just another one of the members of this group, so this is completely symmetric there's no special nodes to administer, there's no special nodes to worry about recovering when they fail, and it uses randomized communication, which is really nice. It means that the chance of a correlated failure is much lower in each round of communication. Each node is picking some run to a random other member of the group to communicate with and this gives a really good strength to the patterns of communication. We'll come back to that a little bit later.

Within that communication it's using gossip. It's not the case that one node has to talk to all the other nodes or even 10% of the other nodes in the certain amount of time when information is passed from one node to another. It's the epidemic or viral model, or the contagion model. Information spreads through the network by hops. There's been a lot of theoretical work done on this.

Now to be clear, this is probabilistic, so you can have that perfect storm where once in a while somebody doesn't get that message for quite a long time. For that we have extended SWIM, we have some direct sync that we put in periodically between nodes which encourages this to converge a lot faster and we have a hard upper limit. But in the general case there's actually a simulator on our website where you can see how quickly it converges and even in the order of thousands to ten thousand nodes the convergence is very fast, so SWIM is mostly excellent. We've been using SWIM for I think five years now and as the community has grown, the scale that people are using this at has grown.

I think early on it was an order of tens or hundreds of machines then we start to hear about people doing thousands of machines, then we start to take a little survey and we find there plenty of big users routinely deploying a single Consul group of more than six thousand machines and just in the last few months I'm now hearing ten thousand machines. So this thing scales really well.

We also use it directly so you can use Consul to build the availability for your services. But we also directly make the SWIM implementation into Serf and Nomad, so all three of them share the same library. It's called member list, it's a HashiCorp open-source library. However, while it is generally excellent there are occasions where users would come to us sometimes a little bit panicked where there would be an escalation: "Hey we really need your help we've got this weird situation, help us out."

So over a period of time there were a number of debug sessions and a picture started to emerge of what the problem was. Here I'm going to use a distributed denial-of-service attack to motivate the problem, but generally speaking there's a whole class of problems and I'll point that out in a second. So here we've got that same cluster and you have some edge nodes which are responsible for the ingress and egress. Maybe they're running firewalls, load balancers, web servers, whatever the right thing is in this particular environment, and they're getting hammered by a distributed denial of service attack.

Okay, fair enough, we expect some of these nodes to fail. They're overloaded but the peculiar behavior is that we also see some flapping nodes and these nodes are not directly under attack. These are nodes in the interior of the cluster that shouldn't be affected. What do I mean by flapping? This is a healthy node [13:30] these paler colored ones here, they're healthy but they're being marked as failed by Consul and then a little while later they're coming back and being marked as failed and then they've been immediately marked as healthy again.

If you would go in and look at the logs for that instance you'd see no problem, there's plenty of CPU, there's plenty of network connectivity, it's not under attack. Even more disturbingly, other healthy nodes think this healthy node is sick so what is going on here? This is this is not cool, and as I say the DDoS is a concrete example, but we got this in a few different ways. Web services were overloaded, video transcode services, it doesn't have to be on the edge, it can be in the interior of your network. Somebody is using these burstable instances—the AWS t2 micro instance—where on average it's not allowed to exceed 10% of its CPU budget so you can do a burst of work, but then the budget is depleted. Eventually if you go to zero budget, you're throttled.

The common thing here is that there's some resource depletion whether it's typically CPU, sometimes network, but there's resource depletion at some nodes and it's making other healthy nodes appear unhealthy, so this was the mystery and we dug into and we have a nice solution. On Thursday I will be in Luxembourg presenting this at DSN, which is very nice. This is the conference where SWIM was originally published, and it's really nice to come back there. We've been in communication with the lead author of SWIM, and he likes this work and is very happy that we've done it, so yay!

I'm not going to go over this in detail, you could watch the talk I gave at HashiConf last year if you want the details of this. I also recommend you read the white paper which is on the archive website freely available and it gives all the details.

The mystery of the flapping nodes

But to talk about what I am going to talk about today I have to just give you a little high-level summary of what was wrong so I'll give you a very quick description of SWIM. All you have to know here is that SWIM is a distributed systems protocol and it has a number of steps. - In step one node A directly probes node B. It's trying to say, "hey are you alive?" It sends it a ping or a probe if it gets back a response within a timeout. You're okay? I move on. - If not denoted by the X, we move on to step two and we say, "okay, I couldn't talk to you directly, let me just see if it's a connectivity problem between you and me. Let's go around. I'm gonna ask these guys to talk to you so then that's a ping appropriate they make the probe if they get through it comes back to you, okay I'm still happy. - If it doesn't come through, then we move on and we start gossiping these rumors, "hey I think that guy's dead and we're going downhill at this point."

Interestingly, in the 2002 SWIM paper, Cornell University Ivy League school thanked Amazon CTO Verner Vogel. At the time he was a professor there, he helped them get 55 computers together—this is large-scale distributed systems at a university in 2002—and it's amazing how well this thing works given that's the only scale they had, and now we have people taking this way past a thousand machines before they really started to hurt with this problem.

But even then at 55 nodes they've had this thing that they saw which is sometimes they would get these false positives and they tracked it down and it was because sometimes a message isn't processed fast enough. There's slow message processing, so they put another mechanism I'm not going to give you the details, but they had a way to give people longer to come back and work around this.

What we discovered was that unfortunately, even that remediation of this suspicion mechanism has the same vulnerability. It still requires some of the messages to be processed in a timely manner so the problem we have here is node A was suspecting that node B was dead and it gossiped this rumor—it said hey I think you're dead. What the suspicion mechanism is good at is saying node D is actually a dead node, but the epidemic nature of the suspicion mechanism floods around any slow or dead nodes.

If there's a fast way to get the information from A to B it will get there and if it's a fast way to get it back it will get it back. Unfortunately if the node that was asking in the first place is also slow—the message can even be in the computer, it can have come into the kernel, it can be somewhere in the device driver, or the protocol buffer, or even in a Go message queue—it just hasn't been processed yet and the timeout comes, so this is the exhaust ports of their Death Star—there's a flaw that, unfortunately they were not able to patch and we have plenty of users who ran into this at scale.

The flapping node solution: Lifeguard

So generally our approach to fixing this is, we realized that there were all these messages, we know the protocol we implement, so when we send a message there's an expectation we know whether we should receive a reply and it's just some basic accounting you keep track of—well, I've sent three messages out in the last 200 milliseconds, I should really should get three replies back in the next 500 milliseconds—and because of that referee-randomized communication, you're very well-connected to the whole network and it makes it possible that you've gotten unlucky and you talked to three dead guys. It's possible, much less likely then, that you are becoming disconnected from the network unless your whole datacenter's going down and then you have some other problems to worry about!

So and actually it's robust even to more than 50% of the nodes going out, but basically what we're saying is we know we have an expectation of how many messages I should receive and now if I'm isolated, even the absence of messages and absences of replies is a signal to me, and that's nice. It means that you can be completely isolated, you can be disconnected from the network, and the system says, "I should slow down here, I should I should give these other people more time to reply to me." Then it's an intermittent problem. You get some packets or you're bouncing along with the CPU throttling, you are going to give other people the benefit of the doubt.

This works really nicely. The false positives were the problem, all these nodes being accused of being dead when they weren't, and there are three components to the solution. On the left you are without any of the components, on the right you have the full solution of Lifeguard with all three components and in the middle. So when you combine them all, each of them has an effect, but when you combine them there's a synergy at each stage of that at that pipeline. We're knocking down the false positives and it's really powerful.

Once this was deployed, those support requests stopped, so it's a good thing. Also, something really cool we found is that you can tune the parameters of SWIM, now that we have got those bad guys out of the way, we can actually be more aggressive. So there's a knob you can basically choose median latencies—on average how long will it take the detector failure, how long before a healthy node sees that an unhealthy node is a dead node—you can choose to leave that where it was and be really aggressive. You can have more than a 50x reduction down to 1.9% of the false positives, or you can balance it. You can say that I'll just take a 20x or 25x reduction in the false positives, but I'll take a 15X reduction in the latency in the healthy case.

So this was an unexpected bonus—a nice side effect of the work. We weren't looking for this, but that's cool. We were able to tune the system more aggressively because we've eliminated a class of a fault.

All right so mission accomplished, right? We go home and we move on to the next project. Well, maybe not, because why did this happen? How did we get into this situation? You want to understand this fundamentally, and it turns out SWIM is in a line of work where if you go back to 1996, people were worried a lot about multicast and virtual synchrony delivering the messages to everybody simultaneously—it's a different era—this was not the cloud era, but a very important paper came out talking about unreliable failure detectors.

There are going to be these processes they're monitoring one another, hopefully just some redundancy to that monitoring, to make sure that when somebody fails who's monitoring somebody else, we're still doing the monitoring. There's a whole raft of ways that this can be implemented. This is not tied to using heartbeats or tokens in a ring, or doing the probes that SWIM does. These are all alternatives, but this is an abstract model of a failure detector. And again, different topologies with the rings, and hierarchy, the randomization, this is what SWIM uses, but the key point is that each individual failure detector can be unreliable. It can have false positives, and then you overlay a protocol on top of this, which filters out the false positives, so SWIM fits into this.

SWIM isn't exactly an instance of an unreliable failure detector. Within this abstract model there's the concept of a local failure detector module, so practically speaking, this is the member list part of the Consul agent. We said that we have a group of processes that are monitoring one another. Each of those processes embedded within it or that co-locates in it doesn't really matter. It doesn't have to be the same process, but on the box hopefully exposed to the same conditions so that it's a proxy.

There's a failure detector on each one monitoring some of its peers and in the first instance when this work was published it was all about really thinking about the far end. It was thinking about is that guy alive, is that guy alive, is that guy alive sending these probes or heartbeats or however you wanted to do it. Over time people realized that the characteristics of the network were a form of interference, so they started to draw more fancy models where you could actually over tie and learn a model for the behavior of the network, and then maybe you subtract that so you de-noise—you take a filter to remove some of the bias that you're seeing—and false positives you might accuse or you may you may not wait long enough for process B because of some network problems that you actually could figure out were occurring because you saw how the time of the heartbeats was degrading in the last 20 seconds or something.

Many works following up on this issue, so many publications followed this so we read them all and not a single one has talked about the health of the actual failure detector itself of this local module. They're all focused on the far end some of them get into this nice modeling of the network and in the case of SWIM with the suspicion mechanism we're worrying about the peers in between who are forwarding the messages. It seems to be a blind spot.

It's interesting that we've gone on so long without people paying attention to this and you have to think it's probably just a side effect of supporting the crash failure model. In this world you're either alive or you're dead so why would I care about your health if the failure detector's working. There's no problem if it's not working. It's interesting how the whole community can adopt the same path for 20 years and we are now wondering what is the scope of this thing, this local health, and for the first time potentially worrying about the health of the local detector. Where does this fit in?

I mean we were reading these papers because we surely were going to bump into somebody that must have done this before and we didn't know we could cite them. Some good news came in parallel to us working on this and publishing it There's an awesome workshop called Hot OS. It's out of the ACM Association of Computing Machinery and it's a really cool venue.

People just write these short workshop papers, but they're talking about the big painful problems. They're trying to find the next big research topic that the community should move towards and there's a lot of great work as you can see. The seed of the idea shows up at this workshop and last year was no exception. A fantastic paper—I would urge you to go and read the great failures paper. This is a report from Microsoft and you'll see there's Microsoft Research and Microsoft Azure, so this is not just the buffoons over there in the ivory tower. It's research and engineering and I know the people that in research these problems are people who have led development teams, so this is the considered opinion of people who have been intimately involved in building one of the largest private and public cloud infrastructures.

Their definition of a grave failure—they say things don't always fail cleanly. There are these gray cases and it can have to do with some degraded hardware and the thing can limp along. It's not dead, but it's not happy. These are exactly the conditions that induce the problem that caused us to build Lifeguard, and they even call out that this model is overly simple. They call it fail stop, but the definition is actually crash failure and they misname it so even these people are using these terms a little bit loosely.

There might be a clue here as to what's going on, but they definitely nail the cause and the cause, really, is scale—scale and complexity.

A Google and Microsoft-scale problem

There's a quote I love—it's pretty much the definition of the data center: where the unlikely becomes commonplace. So we talked about some PCI bus controller malfunction or some cosmic ray or whatever it is. These are like one-in-a-hundred-million type things, maybe or one-in-ten-million. It is when you have a room with several hundred thousand machines in it. Many of these issues are always going to be going on somewhere in that room, so the good news is, congratulations we've we've made it to Google scale problems. We are now having to wrestle with these same problems that Microsoft and Google were using their infrastructure, and they can't they can't isolate us from all of these problems, so we've got their scale of problems, so I guess that's good news.

This actually came from a really good paper, Visigoth Fault Tolerance from EuroSys. I'm not going to have time to go into this, I would also encourage you to read this paper. These guys are trying to take the Byzantine fault tolerance and make it really usable. They have a particular programming model you have to use which does limit the applicability. Also they are trying to leverage how reliable in terms of performance modern data center networks are, but I think that this is something that could work for Microsoft provided they're running it on their latest gen switches. Not so much if you're in a random public cloud, so that's why we're not dwelling on this. But this is a great paper and we honestly are still unpacking the thoughts that are in this paper and we may end up implementing some of the ideas from here eventually.

So now we have the concept of a great failure as I mentioned. It strikes us that local health really is something that could be helpful here. We've applied it to SWIM, now SWIM's that the easy case I told you about, that fully randomized communication. This is great, you're very well connected to everybody in the data sense. You have a lot of good signals to the fact that you are either well-connected or not well-connected heartbeat failure detectors—it's more point-to-point.

Then the Rings in the hierarchy, there's going to be some redundancy we think this could be applied so this is a future research area for us and hopefully for other people. The criticism could be—it's only going to be robust with SWIM, you're not going to make it work with heartbeat failure detectors. The interesting thing here is that usually you will have multiple heartbeat detectors co-located because you want that redundancy.

Everyone is checking a few other people in the literature, they never think about the colocation and never cross a signal between the different instances because they had no need to. It was always healthy, but now we think about it having poor health. I have five failure detectors co-located, maybe I can look at the correlation. If they all stopped getting their heartbeats maybe it's me.

Also, you could actually do the SWIM-style randomize probably even if you send one message from one node each second. That's going to give you some background level of confidence. If you want network coordinates in Consul, we have Vivaldi which allows you to say, "route me to the fastest closest instance." So that's a value-add to having some background radiation one of second messages between peers.

There was another really cool paper at Hot OS, this time from Google. I don't know if they talked to each other first but Google and Microsoft will show up at the same conference and start talking about the pain points that they're having at scale and the most important thing that they brought to the table here is—it's common sense but we sometimes you need to be reminded of the obvious that there's this inexorable link between availability and performance. Is it dead or is it just running really really slowly, and we seem to be having some trouble with things that are running really really slowly so let's maybe model this as a performance problem.

But still how did we get here? Why is it in 2017 that we've got Google and Microsoft showing up at the same conference saying that we've been doing this fail stop thing for a while now and it's just not working for us—and they're confusing the terms fail stop and crash stop in their terminology. How did we get here?

I told you about Stephan's course and the link is on Twitter. He's actually been teaching this course for twenty years it's based on his PhD thesis which you can buy as a book. He's doing real-time systems for automotive, he's doing hardware and software code design, he's trying to build dependable systems, and it turns out if you go to IEEE DSN—dependable and secure networks is what DSN stands for—it turns out that in that community this is a problem that has been considered an electrical engineering issue.

This is for trains and missiles and chemical plants. There are lives at stake. These people are taking this stuff incredibly seriously. This is an awesome paper and I encourage everybody to read this paper. It is just a fantastic, very detailed, very precise enumerations of all the aspects of building a dependable system from the point of view of engineers with a capital 'e' not software cowboys. And there are just some amazing diagrams in here. Basically, not every failure is a stopping failure. There are these erratic services.

In 1980 they studied classifying and tackling this taxonomy. It's all laid out there. Google got the message last year but stopping is a strategy. This fail stop thing is really hard, but guess what, you actually commit to doing that. You try to make your systems fail. You're failing safe. It's the deadman switch where the driver has a heart attack, he lets go the handle and the train stops. They have to build that. That doesn't just happen. So this is some pretty important stuff that doesn't seem to be present in our community, so what can we do?

I have a general prescription which you can guess from the last couple of slides here I really think we need to start thinking in terms of building dependable systems as a design center. We've taken the terminology fail stop crashed up, we've gone over there to the IEEE guys and we've said, "that's a great model, thanks." And then we've done absolutely nothing with taking their practices.

We're in a different setting, we're not doing hardware and software co-design right now, although if you look at SDN and disaggregation and the data center people are starting to say, "hmm, if we had this switch that has these special hardware qualities maybe so we could actually go in the direction of hardware and software co-design with this stuff." But even without that concretely today we have to go beyond the binary and generic.

I mean we're seeing it now: is it alive or is it dead or is it just running slowly? Google pointed the same thing out but people have been doing this—you have to set the SLAs, you have to set the quality of service, you have to think about the application-level view of the health of your system. Is it delivering at the rate that it needs to deliver to be considered a happy healthy active member of this population, and if not, we should flip the script. We should say, "okay, I'm not going to let him run until I'm sure he's dead. I'm gonna kill him. It's time to go or local health kills itself.

So these are my prescriptions. I hope some of this has sparked some thoughts for you all. The references are on Twitter and I'd love to talk to people about this. Thank you.

More resources like this one

3/15/2023Case Study

Using Consul Dataplane on Kubernetes to implement service mesh at an Adfinis client

1/20/2023FAQ

Introduction to Zero Trust Security

1/4/2023Presentation

A New Architecture for Simplified Service Mesh Deployments in Consul

12/31/2022Presentation

Canary Deployments with Consul Service Mesh on K8s