Service Mesh Interoperation Between VMware NSX Service Mesh and Consul
See how VMware Service Mesh is able to run Kubernetes workloads and interoperate with HashiCorp Consul Service Mesh running VM workloads.
Much has been discussed about multi-cluster deployments in service meshes, running tightly coupled workloads. In this case, the owners of the workloads are constrained by a higher authority. This authority forms an organizational unit boundary and establishes conventions for network addressing, workload namespacing, identity, and security policies.
The main reason for operators to adopt these conventions is to ease administration. However, very little has been said about service mesh interoperation. Each mesh is in a different and untrusted administrative domain (and hence workloads are loosely coupled). Each mesh can be from the same or different vendors. And each can have the same or different control and data plane implementations—single or multi-cluster—and can provide the same or different functionality to its customers.
In this talk, you'll see how VMware Service Mesh running Kubernetes workloads, is able to interoperate with HashiCorp Consul Service Mesh running VM workloads. The service meshes will establish a secure communication channel, synchronize the service registry, and establish an mTLS communication channel between workloads.
Speakers
- Sergio PozoStaff Senior Solutions Engineer, VMware
Transcript
Hello, everyone. I hope you are having a great conference and you are learning a lot of new stuff.
Thank you for attending my session today around different service mesh implementations and how we can interoperate those service meshes. This is a business case that is relevant, especially when the service meshes need to reach a bigger scale, let's say, or it's a different way of scaling service meshes.
We will go through the presentation today. We're going to show 2 ways of scaling service mesh. This is one of them.
One of the scenarios that we can think about when a service mesh interoperation is needed is when there is a large company which has a distributed team which is working in different places in the world, delivering different services which need to interact with the other services in other places in the world, created by other teams, and they don't speak between them.
This is not unusual. And these different teams many times need to deal with different regulations. They need to use different pieces of infrastructure, in many cases, because of those regulations.
But also they have different cultures. They might be coming from a different background. They might be using different products. Their journey to automation, their journey to microservices might have been completely different from one team to another.
That means that the decision toward which service mesh to use or the maturity of that decision might be different as well.
This is the agenda for today:
Which are the new old problems?
What is a service mesh?
Consul Connect overview
NSX Service Mesh overview
Service mesh federation
Demo
First of all, because there are so many people here and your backgrounds might be completely different or the skillsets that you have may be completely different, I'm going to try to settle everyone more or less on the same page. I'm going to run very quickly describing the problems that we all have faced as developers and why we have moved into a microservices architecture.
How real is that? Because like it or not, we are still living, and we are going to be living for many years, in a hybrid infrastructure where monoliths are going to be moving into microservices.
That process means they're composing the monolith and creating more and more microservices over time, and no one knows really how long that time is going to be.
Then I'm going to describe why service mesh makes sense.
I'm going to describe Consul’s service mesh product, Consul Connect, and VMware's service mesh product.
And then I'm going to describe the work that we have been doing, VMware, Consul, and a couple of other companies, around interoperating service meshes, which is going to be released into the open source for all of you to have a look at and give us feedback and contribute, ideally in the next few weeks.
I have a live demo, but I think that if things work properly, I'm going to be switching between the slides and product. We can do these in a better way to illustrate the different concepts and all we are going to be talking about.
I have here a common line. I have here different products, and let's hope everything works.
The new old problems
So, monoliths. Look at the picture.
What we have in the picture is 3 different monoliths, each of them composed of different parts, the colors inside each of the monoliths. When the monoliths, which are applications, need to talk to each other, need to share information—can be database, can be a frontend, a backend—they need to use network connectivity.
These are the lines that we have between the monoliths.
The problems with monoliths are well known by all of us. They are complex; they are big. They prevent innovation because we are committing to a technology stack for a very long time. It's a problem with availability as well.
If a part of the monolith is non-functioning, it's difficult to know why the monolith is failing, and it is very likely that the monolith fails completely.
That's the reason why we decided to move into microservices, to revert that situation.
With microservices, we have the opposite set of benefits to the problems we mentioned before. We have agility, because we now focus on the small pieces of software instead of big pieces. We are not committing to a technology stack forever. We can change the technology stack, create new microservices.
Those microservices are going to be communicating through APIs. It is much more elastic because the monolith scales vertically, which means more hardware or a bigger virtual machine.
But microservices scale horizontally and independently from one another. So we replicate the microservices, and that creates a new problem. That is stateful applications versus stateless applications.
If we look carefully at this picture and try to think about this at scale, what we are going to find is that we're going to have different pieces of infrastructure in different places of the world. We're going to have different applications with different levels of maturity. We are going to have a hybrid monolith, which is being converted into our microservices infrastructure.
Different degrees of that conversion in different places in their world, across different teams using different technology stacks. But they share the same thing.
Do you see how many lines we have now? It means that what we have solved some problems, but we have created new old problems. That is, we have a lot more connectivity now. A problem in any of those connectivity means that we have our problem in the final application, not in the service itself.
But we need to think about this as an application, as a composition of services. If one of the services doesn't work, maybe the application doesn't work. Or maybe it is resilient, it still works.
Think about Amazon. If that shopping cart doesn't work, it is very likely that I cannot finish shopping. And that means that there is no business for Amazon. And that is a problem in the application itself, although not all the parts of the application have failed.
We also need to pay attention to securing those links. It's not only about needing to pay more attention to network connectivity, but it is also about, How do we secure all those links? How do we monitor all those microservices?
There are going to be many teams doing this. How do we ensure that we have a consistent way of monitoring all those microservices? This is a different kind of problem that we didn't have before. And this is scaled because we're monitoring a few things which were very big and now we have reverted the problem.
The way we've solved this traditionally is because developers are early adopters. We've developed libraries. The first problem we found was, How do I observe and troubleshoot my services? Because I am not yet at scale.
This is my first problem, and we created different libraries, which also have a problem. These libraries were programming language-dependent. Of course, there are tweaks we can do. So, for example, in the case of Eureka we can run a sidecar service if my application is not based in Java.
But we are complicating things that in the networking world were really solved.
Next thing is I am scaling my infrastructure, so I'm going to need to deal with connectivity and controls. I need new libraries to deal with the classic networking problems. I need to create. I need to deal with load balancers. I need to find a way to make my application fault-tolerant.
I have different options here. I can use a product that is in the market. I interact with a product through an API or I use a software-based product which replicates the traditional hardware product functionality.
Then, as we wanted to move applications into production, we found security and compliance telling us, "You are not going to do it unless you prove you can deal with the compliance regulations, and you can do this big list of things in your applications."
Security and compliance people and operations people have completely lost visibility of what is going on, because everything is packed, let's say, in the application.
We have now the application is the microservices, but we have a lot of new microservices, which is not application logic. It is networking logic. It is security logic that we need to have to make that dream of different microservices connected together our reality.
And many times that's a nightmare because of the compatibility problems, because of software that is repeated in different microservices. As these models scale, then we are having a really big problem that has been the seed for the service mesh.
What is a service mesh?
We have tried to do client libraries, and we have failed with client libraries. The intention of a service mesh, or the rationale behind a service mesh, is, How do we revert that situation of having increasingly complex microservices because of those client libraries that are increasing in number and complexity?
We move back to giving operations and security teams the control, the visibility, and the security they are used to having for production environments.
The rationale behind the service mesh is really, Let's move out of the microservice all that logic to a different place. That place cannot be the networking stack again because we haven’t really been there. That place seems to be ideally a transparent layer that sits between the application and the infrastructure.
And that is the concept of a sidecar.
The sidecar extracts all that information and makes that transparent to the developer. However, to manage the sidecar, we need a control plane in the service mesh, which is able to tell the sidecar what to do.
Service meshes have 2 things:
The data plane, which is a sidecar, which actually does something
The control plane, which is the brain, the piece of the service manager mesh that we are going to interact with. So w
We are not going to see the sidecar.
This is the model of all service meshes. The use cases as you can imagine of the service mesh are the traditional ones. We are going to be solving the same problems that client libraries were solving:
Discovery: Discover and analyze the relationships and dependencies between services
Visibility: End-to-end topologies, monitoring, tracing, and behavior analytics of services
Control: Increase service resiliency and control over traffic management
Security: Business-level security policies, including securing service-to-service communication
Service meshes are going to provide service discovery mechanisms, ways to observe in a consistent way big networks of microservices, ideally across a variety of infrastructure.
There's going to be control not only for operations but also for developers so they can really instruct the service mesh to the desired behavior of the connectivity of those services. For example, to simulate traffic test faults in the application.
And it's going to provide a layer of security that is really the main driver for production deployment of service meshes.
How can I make sure that developers do not have the responsibility of securing the service-to-service communication? As a security team, I am accountable for enforcing in a way that is transparent for the developers.
Consul Connect overview
Consul Connect is an implementation of a service mesh. It is a HashiCorp service mesh. Consul Connect has a client/server architecture, which provides many of the benefits regarding scalability and dealing with a different heterogeneity of the infrastructure.
In this architecture, we have more than 1 server to have some kind of resiliency and high availability. The servers on the control plane are going to talk to the clients in each of the infrastructure components that we are going to have.
There are going to be different clients, and the clients are going to program the site in the proxies which is the data plane. Consul is providing out-of-the-box integration with Envoy as a proxy, but their model is pluggable, so we can potentially use different proxies in different deployments of Consul. Or we can use a different proxy in all of them.
Having that kind of architecture is what gives their customer the ability, let's say, to have different datacenters with different service meshes, using the same implementation in this case.
These datacenters can be physical, can be virtual, can even be virtual machines. Not necessarily a cloud-based datacenter. And there is an auto-join functionality. As soon as we have 1 cluster of Consul servers, as soon as we have another 1, we can make that new cluster join the existing infrastructure. So scaling is heavily simplified.
The service catalog is synchronized across different infrastructure, and Consul is providing today bidirectional synchronization of the service catalog between Consul servers and Kubernetes.
Consul also provides—very recently, since 1.6, I believe—a mesh gateway, which is a way, or a need really, to provide that model of connecting different datacenters.
When we are connecting different datacenters, we cannot make assumptions of anything regarding the networking space, and we need to secure the communication between datacenters. A mesh gateway is really providing that functionality.
Given 1 single service registry, services in 1 datacenter are going to be pointing to services in the other datacenters through the mesh gateway acting as an ingress. But we can also use the mesh gateway as an egress, so we can control how the traffic is leaving the datacenter.
Let me walk you through the Consul UI. Nic Jackson is doing an in-depth talk about Consul Connect later today. We don't have time here to go in depth in the product.
We have here the service catalog. We have the nodes. In this case we have a VMware node, which has been federated. This is a VMware NSX Service Mesh, and this is the local node. We have storage, we have ACLs, and we have intentions. I can create intentions here. This is an inventory really.
If I go into the different services, this Consul service is the Consul Connect service. We can see here all of the attributes of the different services, and we can continue doing things. We can also use the API, but the API is providing more or less the same information as the one we have here in the UI.
NSX Service Mesh overview
This is VMware NSX Service Mesh, and the differential value of VMware NSX Service Mesh is that we are providing a single pane of glass to scale service meshes using different clusters of different technologies.
For example, we can do all of the service mesh services like policy, telemetry, observability, etc., across a variety of infrastructure so the user doesn't need to know really where the services are running. The infrastructure becomes a pool of abstract infrastructure where the workloads can be moved across them so we can have microservices migrating from one cluster to another.
And from the operations perspective and developer perspective, it is going to be the same thing. Nothing is going to change, which is very consistent to the way VMware has been delivering value. That is abstracting hardware or abstracting the infrastructure and providing infrastructure-neutral constructs. This is no different.
Another big difference of NSX Service Mesh and differential value is that, although in the current version of the product we only deal with services, there is a strong part of the vision, that is coming next year, to extend the service mesh services to users and data as well.
For example, if today we can represent policies using services, next year we're going to be able to represent policies using different constructs, additional constructs, which are users and data sources.
If we zoom in to that image, we have a high-level architecture which more or less represents what I have been saying before. That is, we have service mesh services: discovery, visibility, control, and security across services, users, and data.
Then we have a pool of infrastructure beneath the control plane. The control plane of NSX Service Mesh is a new control plane, and the underneath infrastructure is the way we are dealing with this infrastructure.
The way we are providing the mesh services is through these 2. So NSX Service Mesh is based on these 2, and it is providing all of these functionalities out of the box, in addition to the product-specific feature.
If we think about scalability and the larger the scale of service mesh: As I was introducing the talk, I was saying that we need to think about what happens when we want to do a model like this. But instead of having 3 different clusters which are owned by the same company, what if we also need to interact with a piece of infrastructure which is not owned by the company, or it is owned by the company but it is a different administrative domain?
We have no idea what is going on there, and we will never be in agreement on how we are going to consume the infrastructure or a service mesh. In those cases then we need to think about different ways of integrating. That is going to happen in many situations.
For example, virtual machines, or what is going to happen in a serverless infrastructure? What is going to happen if we want to extend the service mesh to SaaS applications, if we want our services that we're developing to interact with services created by a third party we don't have access to? How do we do this?
The answer is interoperation.
Interoperation is a multi-mesh, multi-vendor, multi-product way of extending a service mesh. We are making no assumptions when 2 meshes are interoperating about how the service meshes are implemented.
The only assumption we make is that they implement an interoperation API. That interoperation API is the work we have been doing with HashiCorp and a few other companies. It is the basis of starting to think about interoperating service meshes.
Those service meshes which are interoperating need to begin with something, and they need to provide a minimum set of services. Those services need to be available in all of the meshes that are interoperating.
One is identity, another one is service discovery, and the third one is security. So interoperation comes, let's say, at the beginning with a single service registry across different service meshes and NTLS communication across services in different meshes.
Let me show you a little bit of NSX Service Mesh. This is the user interface of NSX Service Mesh. What we have here are 2 different clusters. I have deployed an app in 1 of the traffics, and I am generating traffic to that app. We can see the service graph of the application.
This also has an inventory, which includes the relationships between the services. So we can see here the different services, the instances. We can zoom in to the services to see more things.
We can be navigating the service graph. We also have telemetry information as time-series data so we can know how the service has been behaving over time, different information like RPS (requests per second), different kinds of latency, etc. We can see how the service has been behaving in different moments in time. We can aggregate data up to 24 hours.
Then we have a representational view of the inventory as well, which is again a way to navigate how the services are behaving and the whole mesh is behaving. We have more features on top of this, but this is possibly not the right place to discuss all these things.
Service mesh federation
As I was saying before, the first thing we need to do to make 2 service meshes interoperate is the federated services between. That is, given 2 service meshes, we need to make sure that these 2 service meshes can share control information.
To do that we create an encrypted controlled channel between them. We authenticate service meshes between them, and we also have a mechanism to synchronize the services between service meshes.
As 1 service mesh is creating services, the other service mesh is consuming them. How each service mesh implementation gives the user the ability to select which services are going to be federated with which other meshes, it is implementation-dependent.
Consul has its own mechanism. NSX has its own mechanism. Istio has its own mechanism. It's a user-experience perspective, and it's not part of the spec that we have created.
There are many other things we have left to the implementers of the service meshes on the spec. There should be also a way to create an encrypted channel between the services across different meshes.
We need to think about interoperation not only between 2 different meshes, but across a variety of meshes. We can have any number of potentially different meshes synchronizing catalogs between them and creating a service-to-service communication between them.
To do this, the use case is normally the same company. There's normally somewhere a root CA (certificate authority) which has signed certificates of the branches of the different departments or administrative units. We're using the certificates which have been signed by the same root CA.
The way this works: On the left, we have NSX Service Mesh. On the right, we have Consul Connect. We have a federation agent which is deployed in each of the meshes.
The federation agent is implementation-neutral. The federation agent is one of the parts we are going to be releasing in the open source, and it is basically running the mesh-to-mesh control protocol. It also runs this data catalog synchronization between the meshes.
But once a service in 1 mesh wants to connect to a service in another mesh, the service can discover the other service in the mesh because there is an entry in the local catalog pointing to the service in the other mesh.
However, how each mesh routes traffic securely, mesh on the left for exiting traffic and mesh on the right for traffic that is entering the mesh, that is also implementation-specific. So Consul has its own implementation, because it is their mesh. NSX Service Mesh has its own implementation, and Istio will have its own implementation as well.
This is what is going to be happening. If we want to connect the service foo on the left with the service bar on the right, we're going to have a federation agent which has synchronized the service entry of the right-hand side bar on the left-hand side.
But this is really mirroring the real service. The IP address of the service is going to be the ingress of the other mesh. Envoy is going to route the traffic through the egress on the left, and that traffic is going to be entering the Consul mesh on the right through the ingress.
Then the Envoy on the right is going to be terminating the tunnel.
Demo
I'm going to show you how this works in that direction, and then in the opposite direction.
This is running in Kubernetes. What NSX Service Mesh does is install an Istio in the Kubernetes cluster, but we also install an NSX Service Mesh in the cluster.
We have access to the cluster using our own agents. We cannot extend the service mesh if we don't have access to the clusters because we cannot install our agents in the cluster. And we don't know if we cannot touch those clusters at all.
These are the services in the default namespaces. These are the services which are being federated with Consul. These are the services an NSX Service Mesh is exporting to Consul. If we see Consul here, we see in the service catalog, this is NSX Service Mesh; Consul has renamed those services. So we can search them in the catalog easily. And these are the services which NSX Service Mesh is federating with Consul.
We can have services in different namespaces. We would like to communicate with one of the services that Consul is federating. We're going to do that through one of the services that we have in the cluster.
We are going to be executing a call from that service, which is going to be calling a service that Consul is federating with NSX SM.
In our implementation, we have used an Istio service entry to point to those entries in the Consul service mesh, in the other mesh. We're going to see which are the entries in that registry and where the service entries are. Then we see here in the host that we have 1 service, 1 line, and this is the name of the service.
What we do not know yet is the port of the service. I get the service entry. So if we look at this, we have the IP address of the ingress in Consul. We have the port of the ingress. We have the name, the prop, that is the port number of the services with the service we are pointing and we can know which.
Let me go back to Consul. In the ingress, we have the port, and we have the address of the ingress here. So this is pointing to this ingress. And then I'm going to call that service creating mTLS between them.
In Consul we can have a look at the service as well that we have connected with. And although this information that we've got through the API is the same information we have in the UI, we can use these to interact with the service as well. So we have here the same information; we have the port here, and we can get the ingress as well.
Let me go back to this slide, and then we are going to see this in the opposite direction. Which is again creating the control channel between the meshes, federating, then the service. This is exactly the same flow, but in the opposite direction.
Bar in Consul is going to try to connect with foo and NSX Service Mesh. There is a local entry, so the service seems to be local, but it is not local, and traffic is routed through a mesh gateway in Consul and goes through the ingress in NSX Service Mesh, which is routing traffic to the Envoy of the service instance.
This is the information of the sidecar in Consul. For this service bar to talk to the other service, we need to use a proxy. The proxy is explicit. We need to call the local address of the proxy, which is a loopback address, and we need to use the port that is used in the proxy.
If we go to Consul, we see the proxy here. We're going to talk with this service at httpbin-nxssm
, and we have created a proxy for that, the sidecar.
If I go to the sidecar and I go to the other streams, this sidecar is pointing to httpbin
, and this is the local port of the sidecar. This is how I should call the service from Consul. And this is how I am doing it.
We have now created communication in the opposite direction, and this is more or less how it works.
There is still some work to do, obviously. The solution is not perfect, but we are very proud that it is a good start.
Thank you.