Application Upgrades with Consul Service Mesh
See how Consul can be used to implement traffic forwarding for service deployments.
Speaker: Dan Kirkwood, Solutions Engineer, HashiCorp
» Transcript
Good morning. Good afternoon. Good evening. Wherever you are.
Thanks for spending the next 15 or so minutes with me to talk about one of my favorite topics, which is Consul.
Here's what we're going to be covering today. We're not going to spend much time on the basics of what is Consul or what is a service mesh. I have some resources that I'll share with you at the end if you're interested in a more basic introduction to some of these solutions or tools.
» Consul: The Basics
Very quickly, in terms of what Consul is, I like to think of it as a networking automation solution that offers you a consistent experience, no matter which platform or type of application deployment you're doing.
The very common use cases for Consul, the specific things that we're going to be looking at today, fall on the right-hand side of this slide. My focus is things that you can do underneath a service mesh and specifically how that might apply to an application upgrade.
» “HashiCups” Demo
To understand our upgrades, we need to understand, first of all, our application. On the screen now, we have a very simple architecture for my coffee ordering application, called HashiCups. I have it running, and we're going to see it in action in a second.
This is a microservices application, and as my business has grown and evolved, these services are sitting in different locations. I have some that are in AWS, some that are in Microsoft Azure, and these services may or may not even be part of the same deployment environment. I'm using Kubernetes, but I could also have services that live in another orchestration system like Amazon ECS. I might be consuming cloud native services, something like RDS.
The thing to point out here is that microservices are useful. It can be a great architecture to move towards if you want to move quickly, if you want to be able to update and scale services independently, have teams that look after one service only and be able to iterate and improve that to deliver a better experience to your end users. But moving to this architecture introduces new complexity around how I manage this application. It brings in complexity, specifically, around the network. I have to work out, How are these services going to find each other? How are they going to keep state of which service is up and healthy and able to take traffic? If these services scale independently, how does an upstream service know when a downstream service has scaled up or scaled down? Specifically for today, what do we do when we need to swap out something like our payment service?
» Service Networking with Consul
To combat that introduced complexity, especially in the network realm, we have Consul. Consul is used for what we call "service networking." The first thing that can do for us is keep a registry of all of these disparate services that make up my application. They can all register with Consul. If we take a look at the Consul UI now, we have a different view of what my application looks like, where we can see all of the different elements that make up my application.
I get a top-level view of, Are these things up or not? Are they even available? Is my application working? I can jump into any one of these and get a bit more information. I can find out the IP address and port that they're running on, which node. This is running on a Kubernetes at the moment, but it might be running on a VM as well. What is the underlying infrastructure that is supporting this app? I can then find out more information about what makes this thing healthy. What are the checks that I'm doing to ensure that this application element is able to take traffic?
All of this is built up into what I would call a service control plane. A wide view of this is, Are all of my application elements healthy or not? This control plane is useful for me because I can see at a glance if things are working or not, but how do applications use this? How do we use this to automate? How might we use this for our application upgrade?
One of the ways that we can put that control plane into action is through the service mesh. With a Consul service mesh, we're inserting a data plane element alongside each one of my application elements. In this case, it's an Envoy proxy. It doesn't need to be. You can bring your own proxy to a Consul service mesh, but Envoy works really nicely out of the box.
Now I've got a proliferation of proxies. I'm definitely not going to be configuring each of these or managing the lifecycle of these proxies on my own. Consul is going to be doing all of that configuration and lifecycle management for me.
The nice thing about this is that Consul can act as a CA (certificate authority). It's going to be serving out certificates to those proxies.
These proxies can now authenticate to each other. You've now got this first check: Should you really be part of this service mesh or not?
Once we've passed that check, I can now encrypt communication between each of my service elements. I can also control my data plane in a much more granular way. You noticed that my traffic path went from application to application up into proxy to proxy. This means that I could now do something like block out-of-band communication within my service mesh. I can definitively say that I don't want the frontend to be talking to the payment service, and I want to control whether that's happening at all.
Coming back into Consul, we can see that many of these elements are part of my service mesh. And I've already got some rule definitions in here around how these services should be communicating, which API should be allowed to talk to a database or to another API. Underneath the whole thing I've put a blanket "deny," so anything that is not explicitly allowed is going to be blocked.
I've got good architecture. I've got my application up and running. I've got my service mesh, which is working.
Now, with my application, I'm going to buy a coffee for myself. I've just submitted a payment. I'm interacting with a frontend signer.
That payment has gone through my payments API, and it's hit my database and I'm getting a return value that's showing my credit card number.
» The Upgrade
However, we have a problem.
I've been showing off my app. I've been telling people how many new users that we're getting, how many transactions we're getting per month. I'm really excited about how it's going. But security has said, "Whoa, whoa, whoa! Stop!" They've looked into my app and they can see that I'm storing customer data — credit card numbers — in my database in plaintext. They said, "This has to change." I asked them, what should I do?
Luckily, the security team has a service. They are using HashiCorp Vault. They've turned on one of the use cases with Vault, which is encryption as a service. And they said, "All you have to do is integrate with our API and you'll be able to encrypt all that data before it goes into the database."
So this is good. I've got a path,. I know how I can update my application to take advantage of this. And I've now got a new feature.
We're not in waterfall-land here. So I'm not going to look at this feature and say, "They need to think about user management, and a new UI, and some new coffee types. I'm going to bundle them all together and that'll be part of the release for next year." We're not going to do that. We're agile. We're using microservices. I'm going to roll out this feature and affect only the service that manages payments.
» Agile App Update Methods
In an agile world, how do I think about rolling out my updates? I could do something like a blue-green deploy, run parallel versions of my service and switch traffic between them, and switch back, if I have a problem. I might do something like a rolling release, gradually update any nodes and take care of that payment service to get the new feature out. Or I might do a canary deployment, pick a subset of nodes and put only a little bit of traffic towards the new service and see how I go in terms of errors or user experience…
As I think of these different ways of doing my deploy, the blue-green and canary are more deterministic methods. If I'm going to do a rolling release, I don't have a lot of control around how that release is going to look. I don't really know how I would roll back in that case, either.
If I'm thinking deterministically, I'd probably pick either blue-green or canary, but thinking back to the network and microservices, these methods have a network implication. I need some way to be able to steer traffic if I'm going to picking the blue-green or canary…
We are using Consul for our service mesh. We have a solution for this problem as well.
To understand that solution quickly, we're going to look at how Consul does service resolution. We'll cover default service resolution first: What happens if I don't configure anything? We have a blue service and a yellow service that are part of my service mesh.
There are a couple of assumptions here. First assumption is that every instance of my yellow service is healthy. Second assumption is that I have allowed communication from blue to yellow. If both of these things are true, Consul will, by default, randomly spread connections from the blue service to the yellow service.
This might be fine for regular operations. I might want to put in a bit more detail around how I'm going to do load balancing, maybe round robin or least request or something like that. This is not especially deterministic and not really that helpful for my application upgrade scenario, but I do have more control.
» Service Resolution Options
Moving beyond that default service resolution, what can we do?
We have this Envoy proxy that's forwarding from blue to yellow. What are the levers that we have to pull to influence how it's going to do that traffic forwarding? We've got three of them:
A Router: With a router, I can match against specific Layer 7 attributes. This could be a path, a header, or a specific query. Based off those matches, I can send traffic over to different services or service subsets.
A Splitter: I can look at things that I've matched with my router, or maybe I'm just matching everything. And I can take traffic that's coming to a destination service and I can choose to split that traffic between a subset of that service or to a completely new service altogether.
A Resolver: The resolver determines what I am matching against to create what we call a "subset." With my payments service, I might decide that all instances of the service that come from this particular cluster will be grouped. Maybe that's my new version with the new feature. Maybe I'm going to tag my new version in some way. This gives me the ability to choose what I'm going to match against to create subsets for a particular service. I can then also define what load balancing looks like. Once I've got my subset of V1 and V2, there are multiple nodes within those subsets. How am I going to load balance across those nodes? I can also decide if I want to do something like failover. Do I want to try all healthy nodes in a local datacenter first before we forward across to a backup datacenter?
These are the three ways that we can influence traffic as it moves through the service mesh. They are applicable, they are at Layer 7, anything that uses HTTP. I can use this for application upgrades and many other scenarios besides.
Let's take a look at how we're going to put this into action with our application today.
Here I have my Kubernetes deployment spec for my payments API. My payments API is already live. I have my V1 in my production app today, so this deployment spells out what I'm going to do for V2. The important thing here is I've written out how I'm going to be integrating with Vault, so the security team is going to be very happy.
I've also got a couple of annotations in here that describe that I'm going to be part of the Consul service mesh, which is good. I'm also going to tag my deployment. I'm defining a service tag for Consul that this is going to be a V2.
And I want to point out a couple of other things when I roll out this new version of the application. First is my splitter. I'm going to tell Consul that I would like to split the top-level service of the payments API between two subsets. I'm going to split between a V1 and a V2. When I first deploy this, we're going to have 100% of our traffic going to V1. So we're not going to affect production traffic on the first deploy.
Finally, I've got my resolver. Here's where I'm saying, "When I say that I want a V1 and a V2, I want to match against the service tags which Consul is tracking and which I'm defining in my Kubernetes deployment spec."
» Testing the Upgrade
Let's roll this out. We're going to see that new instances of the payments API come alive.
That was pretty quick. Consul is keeping track of my deployments into Kubernetes. It knows that there's a new version that's come in from my application. Those new versions aren't yet healthy, because we're waiting for the pod to be available.
The other thing that I'd point out at this stage is that we don't have to play with any of our security constructs to get this done, which I like about this. I don't need to go in and change rules from an authentication/authorization standpoint, to be able to do something like an application upgrade.
My rules, as they apply to the payments API, are consistent. Wherever that payments API app happens to reside, whatever I'm doing in terms of service subsets, I still got these overarching security policies that govern what's allowed to happen within my service mesh…
I've now got six instances on the payments API, and we've defined that three of them are V1 and three of them are V2.
When I submit this payment, I'm still getting plain text back.
That's because I haven't yet implemented my service splitting. We're still sending all of our traffic to V1 of the service. Let's make a change to that now. We're going to do a 75/25 split. I can apply the same change. I am using custom resource definitions with Kubernetes. Thinking about how this might play a part within your automated CI/CD pipeline, it's really easy for me to deploy these changes to the Consul configuration, because I cannot do it all through the Qube API.
Now that I've sent that change through, we should see some of our requests hitting the Vault backend. On my fourth or fifth click, a payment that I've submitted through my payments API service has hit V2. V2 is integrated with Vault. Vault has returned back an encrypted value that's going to be sent into my database. I know that it's working.
At this point I would probably do some testing from my canary deployment, Is this the behavior that I expect? I can also use Consul to take a look at some key metrics around how my application is performing. This is going to surface for me if I'm getting something like 500 errors against my server. Those are the kinds of things that will be surfaced within the topology view of what I'm doing in Consul. They could play a part in how I do testing around this release as well.
Let's say I'm quite happy with that. We're going to cut all traffic away from V1, cut all traffic to V2. Apply the change and now every payment is going to be integrated with Vault.
» A Quick Review
So what happened then? We had our working service mesh. We introduced a new version of our application. We registered that with Consul under the exact same service, but with a V2 tag.
Consul then deployed data plane elements of our Envoy proxy next to that service. We then defined the splitter and the resolver, and Consul used that to go and configure the upstream Envoy proxy to start to do traffic steering.
We then had a period of both services being live, and I've got a lot of flexibility around how I might split that traffic. We didn't use the HTTP router, but I might use that to take particular headers and route them towards the new service or a particular path for people who want to do something like beta testing. After the period where both services were live, I was able to eventually cut traffic entirely from that V1 service towards V2.
From the application instance point of view, both of these apps are still available, still healthy, still exist within my service mesh. But at this point I could take out the V1 services, and I wouldn't affect traffic from my application in any way.
That's it for the demonstration, and I hope that was interesting for you, a very specific use case. It’s just one example of what Consul can do coming back to that idea of multi-cloud, multi-platform, service space networking.
If that was interesting for you, here are a few places to learn more:
Definitely head to Consul.io. Everything around Consul is centered around that site.
We have our HashiCorp Learn pages, with two tracks: the Consul traffic management track, and the Consul service mesh track. Both of these have multiple hands-on labs that you can follow at your own pace and get an understanding of how some of these things work.
Finally, we've got webinars around Consul, how it works for multiple -datacenters, how you would do observability within Consul. All of those are on our website as well.
That's it! I want to say thank you very much for spending the time. Great to be with you this afternoon and chat about Consul, and I'll see you on the next one.