Skip to main content
Case Study

Enabling, Integrating, and Automating Consul ACLs at Scale

Learn about the strategies and pitfalls that Kong Cloud learned as they adopted a positive security model with Consul ACL policies and Vault token management.

API gateway SaaS provider, Kong Cloud is using Consul, Terraform, and Vault to automate and integrate their management of ACLs and ACL tokens.

In this talk, Kong Cloud engineer Robert Paprocki talks about how Consul ACLs shaped their service networking and security architecture. He'll go in-depth on: - How they wrote their role-based ACL policies, applying them to various processes based on access requirements. - How they seamlessly tied token generation and management to their existing Vault installation. - Their strategy for monitoring and auditing traffic in various clusters to gain confidence in shaping ACL policies on a per-role basis - The usefulness of integrating with Vault's many login mechanisms to provide a seamless, automated lifecycle for generating Consul ACL tokens.

Finally, you'll learn about the pitfalls, pain points, and unexpected/undocumented behaviors they learned about along the way. This talk focuses on ACLs in Consul 1.4 and doesn't cover any of the functionality introduced in Consul 1.5-1.6, but its concepts will still be helpful.

Speakers

Transcript

Is anybody using Consul ACLs in their environment? All right, a handful. Is anybody using Consul, Vault, and Terraform, all 3 of those in their environment, to tie everything together?

We're going to talk about Consul ACLs, but we're also going to tell a bit of a story about how my team at Kong Cloud is using all 3 of the tools that I mentioned, Consul, Terraform, and Vault, to automate and integrate our management of ACLs and ACL tokens. We're going to briefly review Consul ACL functionality before diving into what my team does at Kong.

For those who aren't familiar with Kong, we're an open-source company that builds the most popular API gateway on the planet, and Kong Cloud is the SaaS arm of that.

We'll take a look at what the Kong Cloud architecture and infrastructure look like. We'll dig into how Consul ACLs played a role in shaping and reinforcing our designs. We'll also go over how we shaped our ACL policies based on existing Consul traffic, and we'll take a look at some of the hiccups that we ran into along the way during this migration. Hopefully, if this is something that you're interested in adopting in your environment, you can learn from our mistakes and from our experiences and come away with a little bit of a better experience for it.

My name is Robert Paprocki. As I mentioned, I work as an engineer for Kong on the cloud team. Prior to working on the Cloud Team, I led one of our enterprise development teams known as our Fast Track Team. We're responsible for shipping and rapidly delivering enterprise features for Kong Enterprise.

Before this, I worked at a company called Zenedge (who was acquired by Oracle), doing a global AI and CDN lab for them. And I've also worked at Blizzard and DreamHost in AppSec and security engineering roles as well as systems and SRE work.

An intro to Consul ACLs

As I mentioned, we're going to have a brief intro to Consul ACLs. This is a very brief review for those who aren't familiar with it. If the concept is new to you and it sounds interesting, I would strongly recommend you check out the documentation and the tutorials online. HashiCorp has done a fantastic job of revamping some of these docs, some of the bootstrapping guides. If you're not familiar with it, definitely check it out, play with it in the sandbox environment. It's well worth getting into.

As the name might suggest, Consul ACLs provide access control for Consul. This is a role-based access control that's pretty similar to what you might find in a traditional NIST-defined RBAC model.

With that model, you have the notion of users that are assigned to roles, and permissions that are assigned to roles. And the summation of the relationship between users and roles and those associated permissions gives you a list of what you can and cannot do inside of a system.

Consul ACLs work in a similar way. There's a token, which is just a universally unique identifier (UUID) that serves as a form of identification; rules, which define a security control on a resource in Consul; and policies, which are just groupings of rules. You assign tokens to policies, one or more policies, and the summation of those rules gives you what a request with a given token is or isn't allowed to do inside of your Consul cluster.

Based on the server agent config, Consul lets you provide either a positive security model or a negative security model. Either deny everything and allow specific things, or just deny specific resources based on specific policies.

There's a handful of features and functionality about ACLs that were introduced in Consul 1.5 and 1.6. We're not going to be getting into those because my team is still on 1.4. And prior to 1.4 the ACL system didn't have a notion of policies. It was just rules tied to individual tokens, so we're not going to talk on that either.

This is going to focus specifically on 1.4. The documentation for the new ACL system is definitely worth checking it out if integration with Kubernetes or multiple policy groupings with roles sounds interesting to you. Like I said, definitely check out the documentation. We're not going to have time for it this morning.

The power of HCL

Like many other elements in Consul, ACL rules are defined in HCL. The HCL is fairly simple. It consists of a resource, a segment, and a disposition. Resources can be defined discreetly or with a prefix. And policy dispositions provide for explicit allow or deny behavior, and allowances are broken into reads and writes. Almost any Consul resource can be controlled by ACLs, from services to health check registration, events, prepared queries, even the ACL system itself.

Let's take a look at a few brief examples. First, a really simple HCL example that allows for a token associated with this rule to read from a KV store with the key of foo. You can also craft a rule to allow explicit write access on a resource, and you can explicitly deny access to a resource.

Tokens are generated through the Consul API, either through a CLI tool or through hitting the HTTP API directly. It's important to note that when a token is generated, the request generating that token needs a token that has permission to generate tokens.

This lifecycle management problem lies at the heart of our design, and it's what we're going to get into today in a few slides. Once a token has been issued, it can be leveraged by sending it alongside a request, again, either through the CLI, with the token flag, or in an HTTP request header. Tokens can also be assigned to an agent through the config file or through commands on the agent API or runtime.

What’s the point of Consul ACLs?

It's important to ask ourselves why. Why do we need Consul ACLs? What does this complexity bring us? What's the value that we're getting? Why would I add more config, more data, ask my clients to do more with my environment, when Consul isn't really secret data?

Well, Consul is data, though. It's data about our services. It's data about machines in our environment. It's config data, and that data needs to be secured. You may be in an environment that has regulatory or compliance restrictions, and access to your data needs to be authenticated and authorized, even in automated systems.

You may have multi-tenant environments where multiple organizational units are sharing the same Consul cluster, and that data needs to be segmented. You may simply want to reduce blast radius and prevent a buggy or an incorrect deployment from making unwanted modifications to your Consul cluster.

And in the case of Kong Cloud, it's actually all 3 of those.

HashiCorp and Kong

We're going to dig into what Kong Cloud is, what our infrastructure and architecture look like, and how Consul ACLs have helped shape and reinforce our designs.

Kong Cloud is, simply put, Kong Enterprise in the cloud: all of the features and functionality that you would expect to find in an enterprise-level API gateway, running in a managed environment and able to connect to any of your services running in any cloud provider. We provide the managed API gateway solution; you provide your services, your applications.

I do want to call out HashiCorp real quick. Kong and HashiCorp tooling integrate very tightly together. As an API gateway, Kong can leverage Consul for upstream service discovery. Kong can leverage Vault for secrets management, and I'm also working on the inverse of that, a Vault plugin backend that allows you to leverage Consul as a secrets engine so that you can have Vault manage your API token lifecycle on your Kong managed API cluster.

HashiCorp tools and Kong have a lot of the same philosophies, a lot of shared technology, and a love for open source. At Kong Cloud, we bring the infrastructure. Our goal is to run highly performant, highly available Kong clusters, and this means not only running the Kong cluster itself, but all of the infrastructure to support monitoring, logging, analytics, scaling up and down, so that we're aware of what's going on for all of our clusters.

We bring in expertise running and developing for Kong. All of the engineers working on my team have either developed for Kong or run Kong in large enterprise-grade environments. We're experts when it comes to running Kong. Most of our infrastructure is designed around open-source pipelines, from our CI/CD to our monitoring. We're also working on building and integrating some more custom in-house solutions from microservices to handle more complex deployment needs.

Kong is an open-source company, and Kong Cloud is no exception. We love open source. We love learning and studying from the projects we use. We love being able to give back and contribute to communities. At Kong we like to say that we have open source in our DNA, and it shows with our cloud design. And of course this is HashiConf, so I have to talk about HashiCorp. We use almost every HashiCorp tool in our environment, except for Nomad, and we're even looking at adopting that, too.

We use Terraform, Packer, Consul, Vault. I think we even still have some legacy things running the Serf CLI tool. HashiCorp tooling is everywhere. It plays an integral role in our deployments, our day-to-day operations, our config, and secrets management.

Automated, role-based identity provisioning

From Day 1 we set out to build a vendor-agnostic platform. Almost all of our functionality is deployed on generic compute instances using autoscale to manage lifecycles. Each node in our environment has a specific purpose, and we tie that in with IAM. This provides each piece of infrastructure a source of identity without needing to pre-bake credentials or worry about manually handling them or authenticating a node every time it comes up. This gives us a resilient environment where we can expect nodes to simply disappear, to be torn down by a provider, or for us to remove them if we no longer need them.

This is an expected part of our lifecycle, and it doesn't require any human intervention. So this automatic identity provisioning extends itself into our Consul architecture. Each node in the cluster is named based on its role, allowing for easy identification.

And we use Puppet to manage our Consul configuration, and that allows a simple way for us to define role-based services. So, for example, with a few lines of YAML, we can define a Consul service registration running Prometheus. We can define checks on it and all of the arguments that you would expect inside of a Consul config inside of a small chunk of YAML.

So we have an autoscaling, self-healing infrastructure. We need to manage service-to-service communication permissions in a variety of different contexts. We need to restrict access to Consul, we need to restrict access to Vault, and we need to be able to monitor access to internal services. And all those services have a variety of different authentication methods, from legacy API tokens to modern mTLS service mesh designs.

This access restriction needs to be automated, it needs to be reliable, it needs to be done, like I said, without the need to bake in credentials and do an AMI or to manually input any authentication data.

A day in the life of a network node

Let's take a deeper look at a day in the life of one of our nodes and see what this looks like for us at Kong Cloud. We'll take an example node in our world that's running Prometheus. When the node comes up, it's been provisioned by an autoscaling group and given an IAM role based on that launch configuration.

Puppet runs in the instance managing the services that are executed through systemd, and eventually this node is going to go away. We assume that it will fail. Either Amazon will tear it down, we'll scale down our cluster, we'll move to another region, but we expect failure and termination as part of our lifecycle design.

Recall that I said that various services and our infrastructure require authentication and authorization to access, and I specifically want to talk about Vault. How do we automate allowing access to a Vault cluster when access to the cluster itself requires authorization?

This is where Terraform comes in. I mentioned previously that we have IAM roles assigned to every node in our infrastructure. Vault provides an integration to log in to authenticate with a Vault server by way of an IAM role. And remember that because every node has an IAM role, we can automatically associate this IAM role with the Vault policy, and then we can log in with a simple CLI command.

We'll take a look at that in just a second, but I do want to point out what some of our Terraform code looks like to associate an IAM role with the Vault policy. This is not particularly novel. I'm sure many of you who are in larger shops are probably familiar with this, but for those who are not, what we're doing is taking a bound IAM principal Amazon Resource Name, which in this case is just the ARN of the IAM role that is associated with Prometheus Node, and we're assigning that to a policy in Vault called Prometheus.

Incredibly simple automation thanks to HashiCorp tooling

So what does this look like? How can we consume this as a client that wants to log into Vault? It's just Vault login. That's it. There's a couple of other things that are going on, like we have environmental variables that export the hostname and the CA cert and whatnot, but when I first came on Kong Cloud I thought, "Well, this is going to be really complicated. We've got IAM and ARN and policies and all these different things consuming. This is going to be really hard." And the beauty of HashiCorp tooling is that automating this is incredibly simple thanks to the CLI with Vault.

So we've established a paradigm environment where nodes are given authorization to assist them based on their identity and they're authenticated through Vault. Something really struck me yesterday as I was listening to the keynote talks and some of the breakout sessions that this identity-based authentication is becoming a central tenet of a lot of modern designs. It's encouraging to us as an engineering team that we're following some of the best practices that we're seeing in the rest of the industry.

Adding ACL tokens to the environment

So Node comes up. We have zero touch control for our access to Vault. How can we extend this to our ACL system? We set out to add ACL tokens to our environment, and we started compiling a list of requirements that was very similar to our requirements for managing Vault tokens.

We needed the same seamless transparent lifecycle that we were integrating with Vault and IAM. We needed to be able to rotate tokens either in maintenance or in response to some sort of unforeseen or unknown incident. We needed to be able to do this without adding extra credentials to a box, but at the same time, generating an ACL token requires an ACL token. So we have to deal with this chicken and egg.

We were sitting one day having a cup of coffee and talking and one of the engineers sort of jokes and said, "Oh, let's just write an auth server that integrates with IAM." And we all kind of got a chuckle out of that and thought, "Oh , that's funny. We're not really going to do that." And we started digging around documentation for a minute and we realized, Vault does this for us. Vault has a Consul secrets engine that generates Consul ACL tokens.

Tokens are associated with ACL policies based on the Vault config. And multiple endpoints can be configured to allow for multiple different policy groupings for an ACL token.

And since this is a Vault secrets engine, like any other engine, the read endpoint is protected by Vault policies that we've set up. And since this is a Vault secrets engine, the secret, the ACL token that we're generating, is associated with the lease, the Vault token that was created based on that IAM binding. So we don't have to worry about extra TTLs, we don't have to worry about extra lifecycles. We can tie in an ACL token strongly with a Vault token.

You'd think, "All right, this has got to be complicated to set up. We did the IAM and the ARN and the Vault thing, and that was great with the CLI command. But ACL tokens, there's got to be more to that, because it's another system. It's Consul, it's another backend. It's going to be really hard to set this up and consume it. No. vault read. That was it.

Consul is the name of the engine by default. This is just the word "Consul," but because you can find multiple secrets engines of the same type, we can define multiple Consul clusters to talk to.

A role at the end of this, when we're reading from this, is the mapping of a Vault endpoint point to a group of ACL policies that our token is going to be associated with when we read from this endpoint. This gives us granularity and flexibility about who can generate Consul ACL tokens and what those tokens can do. And that's all encompassed by Vault.

Lifecycle of a Consul ACL token

Here's the lifecycle. We have a node that reaches out to our Vault cluster. When it comes up, Vault login with the AWS method. It gets back a Vault token. When that Vault token comes back, the next step is for Vault to read from this endpoint based on its role. Before that request is responded to, Vault will reach out to the cluster, generate a new ACL token, return it, store the lease information in the Vault cluster, and send it back to the client. Once that client has the token, it can establish communication with the Consul cluster.

How do we manage all of this? Once again, Terraform to the rescue. Given a list of logical roles—Prometheus, Kong, internal microservices—we generate Vault policies that provide access to the associated Consul secrets engine. This allows individual instances, which are given granular access to Vault endpoints based on their IAM roles, the ability to generate Consul ACL tokens with permissions designed for that role and only for that role.

The Terraform code to manage all this is a little bit arcane. I'm going to show it on the slides briefly, but not dig into it. I will post the slides after the talk and we can chat on it later on, but I do just want to show what this looks like. It's terrible on a slide, but I'm sorry.

Because we have multiple regions and multiple roles inside of each region, we have a 2-dimensional array of different policy endpoints that we want to set up. We use a template file data hack to generate the HCL for the Vault policy path information. That's the chunk instead of the middle of this Terraform block. And again, because this is a template file, we're providing Terraform variables into that that just iterate over the list of all of our roles and all of our regions. And we can create the Vault policy simply by reading back that rendered template file.

What does this look like? I am nowhere near confident enough to tempt the demo gods, so I don't have a live demo for you, but I do just want to show briefly what this looks like.

When we want a Prometheus box and we read from our Vault endpoint associated with the Consul cluster. (In a real-life use case, this endpoint is not called "Consul," but something like Consul AWS, US West 1. But to keep the slide simple, the endpoint is just called "Consul," creds and then the name of the role of this box, which is Prometheus.) We read that back, we get back the lease ID, and then the Consul ACL accessor token, and the secret ID. And then Consul's ACL endpoint lets us introspect on our token so we can verify that the token we got back is given a list of policies, and that includes the Prometheus ACL policy.

If we tried to read from another endpoint that we don't have an association with based on a Vault policy, we're going to get kicked back with a 403. And this is all handled through Vault. So we're relying on Vault policies to protect access to sensitive endpoints that generate tokens, and keeping in line with our role-based architecture.

Roles on the Consul engine

The one aspect of this is the notion of roles on the Consul engine. This is a distinct endpoint in Vault that clients can read from, and it associates the endpoint with one or more Consul ACL policies.

When we started out with this project, there was no Terraform resource to manage this, and there still isn't one, so we had to manually glue it together with Bash.

Unfortunately, there still isn't a resource for it. We opened up a pull request, and I talked to the maintainer of the Vault provider for Terraform, and they're really encouraged about this. This is one of the reasons that we love working with HashiCorp tooling. We can give back to the community. We can get feedback from developers. We're really encouraged about how Kong and HashiCorp can work together.

So we have a way to manage our instance permissions. We have a way to generate tokens, and now we need to fill out the contents of our policies. If the policies don't have rules, they're not going to do as much good. Let's take a look at that.

Before we could build our policies, we needed to know what Consul is doing. To effectively build a policy for a logical role in our infrastructure, we had to know what the role was doing, what requests it was sending, what kind of data it wanted, and where that data was flowing across the network.

Consul doesn't have a traffic auditing mechanism, at least not something that is as robust as Vault. We needed a full capture of all Consul client traffic to understand how requests were being sent around the cluster. And because there's some legacy design choices, almost all of the HTTP Consul traffic that we cared about was going to the Consul servers, not the local agent.

Not a great design initially, but it did simplify us setting up a proxy to capture data and audit it so we could shape our policies. So we built our own audit log. We set up OpenResty, which is just NGINX and LuaJIT built on top of that, as a proxy in front of each Consul server agent and captured all of the traffic in the request and the response lifecycle.

We could then ship this data to our central logging infrastructure through Filebeat using OpenResty's non-blocking socket implementation. And this let us keep a very high level of performance inside of our Consul cluster without slowing down and without losing any data, without burdening the node because of unwanted disk I/O, without running into space issues. We could simply ship the data over TCP to a local agent and let the ELK stack handle it for us.

We were also able to leverage Consul's DNS server to grok the hostname of the client. When the traffic came in, we were just seeing the client IP address, and that wasn't particularly useful. Because we can do a reverse look-up on the Consul cluster based on the client IP, we could also grab the hostname, and you can see that this is just a snapshot in Kibana from one of our logs.

We have all of the data about the requests, the header, the URI response data, and then the IP and the hostname. And this made it very easy for us to correlate what host was making a request.

This is a relatively new example, with ACL tokens enabled. You can see that we've taken a SHA-1 and munched that data a little bit so that we're tracking the token associated with the request. This also gives us a stronger correlation and a tie between the ACL token and the Vault token.

So we can compare Vault autolog entries, see when an ACL token was generated, get the SHA of that, and go look through and figure out what Consul requests were made from an ACL token and where that came from in the Vault lifecycle.

Once we had a corpus of data to work with, we could start crafting our policies. Our goal was to build a positive model, explicitly allowing specific resources in a given role and denying everything else. We wanted to be as specific as possible, avoiding prefixes and, of course, we wanted to build rules that had zero false negatives. And I can tell you right now that didn't happen. The goal was to set up this with no service interruption. We failed miserably, but failure is going to happen. So we made it.

We did this with a bit of brute force. For each role, we took a sample of traffic, we shaped the rules that we thought would work based on the audit data that we had, and then we ran a replay of the packet capture in a lab environment to see what rules would be triggered and what requests would fail. And then we could rinse and repeat based off of this. Once we felt that the ruleset was ready to deploy, we deployed it in a canary region. Once that was good to go, we deployed it globally.

Where do these policies live? Since they're controlled by making requests to the Consul API, we needed a way to store and modify them through a change review process, just like we would do for any other part of our infrastructure-as-code environment.

Once again, Terraform to the rescue. There might be a theme here. Consul policies are HCL, like I said, and the Terraform plugin provides the capability to define policies as a Terraform resource. This is deployed alongside the rest of our VPC setup. There's no additional tooling that needs to be put in place. Engineers didn't need to learn new workflows. We could just add new Terraform resources into our existing modules and we're good to go.

Like I said, we define our policies as source, and then apply them to Consul through Terraform just by having Terraform read the HCL off of disk. And this screenshot is just copied verbatim from our repo that does the application.

Learning from mistakes

This was a great project to work on. I had a blast with it. We loved it. We were able to do some new work. We were able to interact with the community. We were able to give back. We were able to write some literature based off of this. We were able to give a talk about it.

That does not mean that it went without headaches and that does not mean that it went perfectly. I want to talk about where we failed, where things went wrong, where things went unexpected. Hopefully, you can take this back. You can learn from our mistakes, maybe point out something that I did really silly, and we'll all be better off for it.

The first thing that bit us was bootstrapping the ACL system. Consul needs to bootstrap its Raft environment itself, but then the ACL system also needs a separate bootstrap, and this involves running a CLI command one time on the cluster and then doing rolling restarts of the config.

If you have a live Consul cluster, it’s a total pain, especially when the configs differ about what the ACL system is doing. If one server doesn't have ACL config, and another server has ACL config that says, "Allow all," and another server has ACL config that says, "Deny all," what's going to happen is that all of your requests are going to melt into a puddle and cry. It's really not a pleasant experience, so I don't recommend it.

What I do recommend is that if you are interested in ACLs, just start out from scratch. If you have the opportunity to re-bootstrap a cluster or set up a new environment and you want to use ACLs in the future, just start off from the beginning. It's going to make your life a lot easier.

The Consul documentation about ACL roles is very thorough, and this is one of the things that I really appreciate about HashiCorp tooling. There were a couple of gotchas that we found when shaping our policies, though. Thankfully, like I said, we had a test environment, so we didn't deploy anything too spectacularly broken, but I did spend a couple of afternoons figuring out why things were not working.

Specifically, sessions and consul exec really bit us. This is detailed in the rule documentation for what different Consul resources need as far as ACL definitions. The docs are all there. I was just too lazy to read them and made some bad assumptions. The lesson learned is, "Test your policies." But I think the other lesson learned is, "Just read the documentation."

One day I walked in. We had had everything deployed for a couple of weeks or so. I sat down, grabbed a cup of coffee, got my day started, and then got notification that one of our machines was throwing an abnormally high rate of error logs. OK, it's cloud infrastructure. Things fail; no worries.

Then another machine reported the same thing, and then another machine in another datacenter reported the same thing. And then one of our Prometheus servers just stopped scraping data. I'm pretty sure you can get where I'm going with this. Everything just started falling apart.

We'd had the ACL system in place for a couple of weeks. We were really confused, because we hadn't made any changes. There was no reason to assume that the ACL system was causing this. But as we started to look into the failures, we realized that the root cause of every issue we were seeing that morning was a 403 from a Consul request resulting in the fact that the ACL token that was present in the request wasn't in the Raft.db.

And this was really confusing, because we verified through a TCP dump and through audit logs that the tokens were there. But then when we went and looked in the Consul UI and looked at the API, the tokens weren't there.

That was really confusing. One of the engineers that I worked with, his first thought was, Did they expire? I said, "No, that's not possible. Consul tokens and ACL tokens in 1.4 don't have TTLs. There's no way they expired. Something else happened." So we started looking through audit logs. We started looking through Vault audit logs, and we were seeing odd correlations of Vault token deletes at the same time correlating with when ACL failures started happening.

I think one of the folks at Bench Accounting mentioned this in their talk yesterday too, but one of the things that we didn't really realize when we were going through this is, because the Consul ACL token is a secret, that secret is associated with a lease. It's removed when the lease is expired.

In our Consul secrets engine, we had had a TTL (time-to-live) of 1 year, I think. And the TTL for the Vault login for each of our node-based roles was only 40 days. So at Day 41, we get in and we've already accounted for the fact that the Vault lease will expire, so a new token had been issued. No issues there. But the old ACL token was broken. So the lesson is, "Pay attention to your TTLs, and understand your lifecycle. Read the documentation." This is all explained in the docs.

One thing that bit us that I really liked about using Terraform is the count feature. It's super convenient until it's not. What happens if I no longer need a role? Let's say I'm on Kong Cloud and I decide that I just don't like Kong anymore. We're just going to run Prometheus instead. It'll be fine. I'm sure my manager won't have a problem with that AWS bill.

I don't need this role. I'm just going to delete it. This is going to work out fine. Terraform is going to handle this for me. This is why we have Terraform providers.

But what's going to happen to all of the other elements in this iterative list, specifically the ones declared after Kong? Their indexes are going to change. When the indexes change, the resource gets dropped and recreated based on how the provider is set up. That might not be too bad. For something like AWS security groups, that's really not an issue.

The problem here is that ACL tokens are tied to the policies that they're associated with, and specifically the policies are referenced in the Raft.db by UUID, not by name. So I can delete a role called Prometheus, create a new role called Prometheus, and all of the tokens that were issued under the old role are no longer valid. They're associated with an orphan policy. They won't be valid.

As soon as I make this change, it would apply in about 2 seconds. Immediately, any node running in a role that was declared after the deleted role will no longer be able to communicate with Consul.

The lesson learned here is, "It's OK to copy and paste. Be explicit with your definitions, use a bit of meta-tooling, a Perl script or Bash script if you have to declare the same resource over and over again. It's not worth using an iterator with Consul ACLs with this type of provider, just because of the headache when you try to manage it. If you have infrastructure where things are going up and coming down and you're changing roles, you're going to get bit by this eventually.

There are multiple ways to define a token audit agent. Like I said, we've talked about clients sending requests to an agent alongside an HTTP header, and then that'll be sent along with the RPC request if it's applicable. The agent itself, though, also needs a token in order to be able to register its checks, anything defined in the config file, participate in gossip and things like that. So there are multiple ways to accomplish this either through the CLI or the API or at boot time inside of a config file.

And there are multiple ways to define a token inside of a config file and I'm not going to go through all of those here, but a few things I want to cover. The first is that the best practice for registering services with ACLs that HashiCorp notes is that you should have a service-specific token, which would require service-specific policies for each service and then you would jot that down in the config instead of the service block when you restart your Consul agent.

That works well, but unfortunately for us, it is impossible, because we're using Puppet to manage our config. Recall our slide; we have just a small little chunk of YAML where we can declare new services and new checks.

Because Puppet doesn't have a notion of what the node’s Consul ACL token is, there's no way for Puppet to be able to manage this. We have to use what's called the "default token." The default token that you define in the config file will be used in any RPC requests when a client makes a request to the agent that doesn't have a token.

And that works out well. The problem is, if you have this definition, if you have a default token that's going to be used everywhere, and we have a request with a client that hits an agent, agent uses its default token, passes it along inside of its RPC request.

What happens if you have, say, an agent running with very permissive token policies, like one of your Consul server agents that has the ability to do whatever it wants? And let's say you have some legacy design choices, like all of your clients hitting your Consul server instead of the local agent.

What we did is effectively set up an architecture where you could bypass the entire ACL system because you could make a request to the Consul server agent without presenting a token. And because of our Puppet design choices, the Consul server agent, when it received the request, would run the internal RPC request with its own default token, which had permissions for everything. Whoopsie.

The lesson learned is to watch out specifically for default tokens. Again, this behavior is explained in the documentation. Just know how the ability to send a request when you're using default tokens is going to impact your design, your security stance.

So what have we done? We built an infrastructure that's designed to be resilient to failure, to expect failure, and we've accounted for that in our identity-based authentication. We've leveraged multiple HashiCorp tools that have complementary concepts to design and build a role-based architecture.

This role-based model allows us to very easily integrate with the HashiCorp stack in conjunction with our cloud providers. We've demonstrated how to manage ACL tokens as part of our cloud provider lifecycle integration, and we've used Terraform to glue everything together. We're contributing back to the Terraform community with our work, and this is at the heart of Kong's philosophy and Terraform's philosophy, all of HashiCorp's philosophy.

It highlights how our teams and our tools really play well together.

Thank you.

More resources like this one

4/11/2024FAQ

Introduction to HashiCorp Vault

Vault identity diagram
12/28/2023FAQ

Why should we use identity-based or "identity-first" security as we adopt cloud infrastructure?

3/15/2023Case Study

Using Consul Dataplane on Kubernetes to implement service mesh at an Adfinis client

3/14/2023Article

5 best practices for secrets management