How Paddy Power Betfair Secures 1000+ Daily CI/CD Deployments With Vault
Watch this presentation of Paddy Power Betfair's journey to secure 1000+ daily CI/CD deployments with HashiCorp Vault on OpenStack.
Paddy Power Betfair uses an OpenStack infrastructure that currently handles around 600 hypervisors, which translates to roughly 11,000 VMs split across 2 datacenters. The CI/CD tooling (Gitlab, Jenkins, GOCD, Artifactory) backing all of this up processes about 1,000 pipeline runs daily in all environments. In this environment, Paddy Power Betfair started an initiative to make its whole infrastructure more secure.
HashiCorp Vault was the best and easiest platform to integrate that they found. Vault supported Ansible and Chef, it has a great API, great documentation, great support, and fitted their high availability needs.
Cristian Iaroi and Alexandru Dima will share how Vault is being used to secure their CI/CD pipelines by using short-lived tokens throughout the estate. They are now looking to further enhance the security of their apps with dynamic secrets and automatic certificate provisioning. Additionally, they will be migrating from an on-prem solution to a hybrid cloud solution.
Speakers
- Alexandru DimaApplication Security Architect, Paddy Power Betfair
- Cristian IaroiDevOps Engineer, Paddy Power Betfair
Transcript
Cristian Iaroi: Hi, everyone. I’m Cristi Iaroi, DevOps engineer at Paddy Power Betfair, and this is my colleague.
Alexandru Dima: I’m Alex, working in the applications security team, and to be clear from the start, I’m here just to make sure that Cristi’s presentation is secure.
Cristian Iaroi: Perfect, perfect. Thanks, man. We’re going to talk to you about how we at Paddy Power Betfair manage to secure over 1,000 deployments daily. Long story short, it’s with Vault, so it’s still a better ending than Game of Thrones, right?
A merger and a need for a unified platform
For a bit of context about who we are, the company started as Betfair, which is a betting platform with an exchange offering, which means that you can set your own stakes and someone can bet against those stakes.
We also have a standard sports book platform, which means you can bet against the house. It goes from a wide range of markets, from tennis to greyhound racing to football to even presidential elections.
Then Paddy Power came along and decided to have a merger with Betfair and bring a lot of duplicated code with them, as you can imagine. We had to sort that out. We have a lot of other offerings for the Australian market, such as TVG and Sportsbet, and for the American market, such as Draft, and we’re looking forward to expand that offering further.
This merger produced Paddy Power Betfair, and because we hated to have to enter this huge email, we decided to ditch this branding and go with Flutter. Paddy Power Betfair is part of Flutter Entertainment. It’s easier to enter your login when it’s that simple.
We as engineers had to unify these offerings into a single customer platform. We’ve created this cloud automation framework, the i2 framework. This framework was built with the purpose of being able to deploy your services in unified fashion without caring about various brands or business logic that you have underneath.
Goals for the unified framework
It was built with the following in mind:
• Mutability
• Ease of use
First, infrastructure as code. We wanted the developers to be able to define everything, from their VM specs to their networking to their load balancing and everything in between, as code. That was the goal.
We wanted the deployments to be immutable. We made this possible. We didn’t want developers to have that argument of "It works on my machine" anymore. (But it still happens; we’re developers, and that still occurs.) But we wanted to have every single change, whether it was on the networking layer or on the software layer or on the hardware layer, wherever it was, to just be able to redeploy your whole state.
Last but not least, we wanted to make it easy to use. This is still a work in progress, which is good. Every framework should continuously strive to improve on that.
The i2 framework relies on a set of tools. What we want our developers to see is that they push code out to our source control, which is GitLab. From there, it triggers automatically a Jenkins job that hopefully runs some tests, some security tests, some integration tests, various unit tests. The artifact or the outcome of that build ends up in Artifactory.
From there it gets picked up by our internal CD tool, which is GoCD, and gets deployed by our various pipelines into our various environments and ends up in this magical place called production. At least that’s what we wanted our developers to see.
What we didn’t want our developers to see was the underlying layer, which means that we didn’t want them to know about it or interact with it as much as possible. Our i2 framework relies on OpenStack for VM provisioning, on Nuage for networking—configuring the ACLs and various networking things.
We rely on Avi Networks for GSLB (global server load balancing) and on Citrix NetScaler for load balancing. Of course, the developers can choose between Chef or Ansible. Terraform wasn’t out yet, so just be patient with that. But what they ended up seeing was YAML. YAML and passwords everywhere, and that’s what the securities team also saw—and didn’t agree with.
Vault for secrets management
We wanted to move away from all of our plaintext secrets and passwords that were stored in GitLab on our Jenkins server and on our various Go servers and move them into this secure place, which is where Vault comes in.
We had to configure Vault in such a way that it would be deployed on the same framework. So we’ve configured it with a load balancer in front, and it’s deployed to 2 datacenters. One is the active datacenter, the active primary, and the other one is the replication datacenter and the disaster recovery one.
Consul for the backend
On the active cluster we have 5 Vault Consul nodes and the same approach on the disaster recovery one. At any given point in time, only 1 node actually serves the whole content. The load balancer that sits in front takes care of the health checks, the API traffic, the replication traffic, and so on and so forth.
As I was mentioning we use Consul for our backend, and for that we have custom backup scripts. Every time we redeploy the whole cluster, we back it up, so that’s for high-availability needs.
Also we have on DR a custom failover script. We wanted to make this as easy as possible to roll to that disaster recovery cluster. In case the primary goes down, you could, as easily as triggering a Python script, make the disaster recovery cluster immediately go up and handle the load.
Integration
We wanted to integrate it into our i2 framework. As I mentioned a couple slides ago, we use GoCD as the CD tool. GoCD uses agents to handle its workloads.
Those agents handle the workloads for Go, and they get to be the trusted deployer that gets to generate the AppRole token and place it inside of your VM that was provisioned from OpenStack.
Inside of that VM your application can then authenticate via the AppRole authentication method and use that token to read your secrets, populate your config files, your properties files, and also use it at runtime in case you need to.
The whole integration process wasn’t as straightforward as possible; we did have some issues. Normally you would go about this by using the appropriate Chef or Ansible Vault modules, but those do come with some restrictions.
These are very thorough implementations of the API and they come with some requirements like having Chef version 13 and Ansible 2.4, at least, to be used. Also, they don’t offer a custom AppRole authentication method.
So it was harder to integrate that, but mostly this was because of our legacy stuff. Some of our users still use Chef 11, and the framework itself is written with Ansible 2.0.
We wanted to make this as easy to use as possible for the developers. We didn’t want them to have to switch to this new version of Chef just because they had to use Vault. It was a high priority to use Vault to migrate those secrets away from GitLab.
The details of the secrets protection
What we ended up doing is creating an Ansible lookup and an Ansible module that calls the API and generates the AppRole token and places it on the virtual machine that will be provisioned by OpenStack.
It’s, of course, based on a role that is defined by security. This token will have only read access to your secrets, and every time you redeploy, that token will get revoked and the new token will be created.
Additionally, we didn’t want the developers to have headaches with this; we didn’t want them to learn a new API, we didn’t want them necessarily to interact with Vault’s API.
We’ve created a custom Chef resource, which means that the developer could just as easily specify the Vault cookbook dependency in their own cookbook, and from there the Vault helper that we’ve created would be able to make use of your token, read your secrets, and populate your template files or your attribute files as well.
For Ansible it’s even simpler. It’s just a lookup. It uses the token present on the box and populates your files.
Of course, none of this would have been an issue had we just used Terraform, right? We’re working on it.
Some big numbers on the deployment
Our i2 framework is deployed into 2 datacenters. We have 2 clusters of OpenStack, which means around 600 hypervisors for those 2 datacenters. That translates to around 11,000 VMs, which go through 6,000 pipelines. Those get deployed roughly more than 10,000 deployments daily and end up generating more than 100,000 tokens daily. These get all revoked, regenerated, and so on and so forth.
Now I’m going to try to set up a simple web application deployed through our framework and hopefully generate some tokens, and I’m gonna try to show you those tokens. I’m gonna hand you over to Alex to talk about security stuff.
Alexandru Dima: Thanks, Cristi. While we wait for the demo to load, I’m gonna talk about the access control part. After all, who else should talk about this part other than security?
Pulling in security for Vault deployment
Quite unusually in our case, we did not just talk about this part, but actually worked on it. Unusual for security, I mean. Actually, the whole Vault implementation started in a rather pioneering way with our former colleague Florian.
He was demoing out Vault at various ad hoc internal events, and we liked the idea and then started to pitch it, then got it working. Let’s see about the access control, what we did.
It had to be automated because, as Cristi told you, we rely on this concept of infrastructure as code. And we had to take the same approach. We used the information that we have in our CMDB.
Usually you don’t find that much useful information in the CMDB, but in our case luckily we had all the information that we needed in the CMDB to start our implementation. For initial setup we use a repository to build the access control automation so we get the data from the CMDB and put it there.
We then create the mounts for each application. We’re using KV Secrets Engine behind. Then for the application we create the AppRoles, so we’re using the AppRole authentication methods, like Cristi has told you, to generate the tokens for the application at deployment.
Then for the human users of Vault we’re using the LDAP authentication method, so we created the initial access policies for these users, the Active Directory users basically. Again, initially based on the information we got from the CMDB.
The security, step by step
I’ll go through each of the steps now one by one. For the repository that we used, like I said, we get the data from the CMDB, we write it to YAML files. There is one YAML file basically per application, each named from the application name that we get from the CMDB. We, of course, git push
it.
Like you see on the slide, we get the details like the acronym of the application, the owner, developer lead manager.
The nice part is that these guys will have access by default. As we get the information from the CMDB, these guys will have access built into Vault so they can log in.
We also update this information daily. We fetch data daily from the CMDB, and whenever there are changes the access will be updated also.
For the mount points, like I said, we’re using the KV Secrets Engine. Under the mounts we create sub-parts named for the availability zones which we have defined in the pipeline, so that using the helpers that Cristi told you about, as a developer, when you do your configuration management, you’re not concerned with the whole path involved.
You can just insert the name of the key from Vault, and the secret will be read from there. Also we left some funny comments in there. Maybe you can see that you are using the HVAC Python client for interacting with Vault. We didn’t find any specific exception for existing mount points, and we were careful not to overwrite the information with that exception.
For the AppRole parts, like I said, it’s used for the applications interaction with Vault. We get the name of the application from the CMDB. On each update, we update the policies. So, if there are extra policies added to that role we update it. We found out along the way that—the way the KV store works—we cannot just update the policies list, so we have to recreate the role basically whenever there’s an update performed on it.
We have different policies for development and production, and different AppRoles. So, clearly for production, they are more permissive for development, restricted to the development availability zones from under the mount point.
User roles
We now have over 3,000 AppRoles, and considering that we have one for development and one for production, that basically means that we’re managing somewhere around 1.5K applications with Vault.
On Vault, this is used mainly for the UI interaction but already there are some users which are using this to generate the token using the LDAP authentication backend and then doing custom automation for their needs.
As policy, for prefixes, we’re using the application name, and for suffixes, we’re using timestamps. This is a rather interesting approach. I will detail it further.
When we refresh the policies for these users, we delete the old policies with the old suffix and add the new ones with the new suffix. Of course, as with the applications, we have different ones for production and development.
This slide tries to explain the little hack that we did. Let’s say that for application X we want to replace user B with user C. But we didn’t know how to do this because there is no way to check what users are assigned to a policy.
The policies are assigned to users rather than the other way around, and we cannot query a policy to find out which users are assigned to a policy. So we add the old policies in a Python set using the suffix. Then the new policies, which are generated with new comments and with the new timestamps, are added in a different Python set using the different sub-date function.
From the Python sets, we just do the difference update for the new policies. Under the old LDAP users, we keep just the new policies, and the old ones are replaced.
In this way, user B, that is not supposed to have access anymore, will have a reference to a non-existing policy, which has been removed from under C’s policy, the part where the policies involved are kept.
In the first part of this slide, we delete the policies from under C’s policy, those with the orange timestamp. Then for user A, which remains as a valid user, we delete the old policy and add the new one, shown with green. For user C also we add the green one, but nothing happens for user B because the script does not know what happened to him; it just processes the changes for the user that it finds.
If we look now under user B, we see the old policy, which has been deleted from under C’s policy. He will not have access anymore. Also, to keep Consul happy behind, we do the cleanup. Basically, we compare the sets of policies from under C’s policy.
We iterate through all the LDAP users and what’s not in user C’s policy will basically rewrite the key-value store from under LDAP users.
The YAML structure
To get an idea of the YAML structure is very simple.
The first key is used for the default users that we get from the CMDB. Then the manual key is used for Active Directory users that you add through new commits, separated for production and development.
The same way for AD groups. There’s another key possible here. If you like your application to read secrets from another mount point in Vault, you’ll add an extra policy key with the name of the policy. That way your application would have access in that place also.
To wrap things up, the developers manage their access in a self-service fashion. They just issue the Git commits and submit the merge requests. Then we have a lint built up in the GitLab pipeline that will check the YAML syntax and the validity of the AD users and groups.
Then security will review that merge request, and hopefully approve it. For the final touch, Git hook will trigger the Jenkins job that will run the Python scripts to make the update in Vault.
Passing back to you, Cristi, with the demo.
Cristian Iaroi: Again, we use 2 datacenters for high availability, IU1 and IU2. Both are private datacenters in Ireland. This application was just deployed to both of these data centers. I’m going to try to show you the pipeline for the first one.
It goes through these various stages. It all works on our i2 framework. Every stage does something else. Obviously, we create the network, we check the capacity on OpenStack, launch_vms_os
, have the run_chef
or run_ansible
stage, and create all of the load balancing things that we need for that particular application to work.
At the launch_vms_os
stage we generate the token for the TLA. This is generated by the Go agent, as I was mentioning, the trusted deployer, and it fetches the Vault token and then it gets placed and copied over to the VM. Of course, at the run_chef
stage, for instance, you could use the token directly, populate your config files, and use it at runtime.
This small web application is just a simple static page that connects to a backend, reads that secret or that token from the box, and displays it in the UI. So thanks, Vault. It’s kind of a disappointed security team. Obviously this isn’t a very secure application, but don’t try to use the token. It’s not really valid.
Alexandru Dima: It’s short-lived, anyway.
Cristian Iaroi: Yeah, it’s short-lived anyway. It’s just 90 days afterward, it gets revoked automatically. But security have enforced the policy of 30-day redeployment. So you will have to redeploy every 30 days regardless.
Why Vault?
First of all, because of the great API. It was easy to use and it was really easy to integrate in our workflow. No API is great without great documentation, which was really easy to understand. When we didn’t understand that API fully, we had great support. So thanks for their understanding regardless of the stupid questions that we raised.
The tool itself is reliable, so it fitted for our high-availability needs. Last, not least, it’s easy to set up and very secure. Yeah, thanks, Vault, for real this time. Thanks, guys.