Beauty and the build: Charlotte Tilbury’s move from CloudFormation to Terraform
Learn how, and more importantly why Charlotte Tilbury’s platform team migrated from CloudFormation to HCP Terraform.
» Transcript
I am Annem. I am a cloud engineer at Charlotte Tilbury Beauty. Fun fact: my name in Turkish literally means my mother. I go to Istanbul a fair bit and that does get a few weird looks at border control. I started my career in the Department of Transport as an Azure engineer, so we were moving from on-premises to Azure.
I moved to consultancy, worked with a variety of FinTechs around London, met some great people along the way, and that was a great experience. I decided that it was a valuable stint. But I wanted to try product-led development again after speaking to one of my mentors. I found myself at Charlotte Tilbury last October, which feels really full circle again.
I definitely see our developers as our stakeholders; making the platform more secure and scalable for them generates revenue for us. In these efforts, I also set up an infrastructure guild to share knowledge across the team. It's great to be somewhere that considers DevOps an integral culture rather than a traditional infrastructure request team. The people make the place, so honestly, I'm really enjoying my time there.
» Quiz time
I thought we'd start off with a quick quiz. There is a prize at stake. I am offering our flagship product, Charlotte's Amazing Magic Cream, the best selling moisturizer for a few years. The first person to answer correctly wins. Put your hand up. The question is, "What is idempotency in the context of infrastructure as code?".
We'll get into that more deeply. Do find me at the end, and I'll make sure that you get your Magic Cream. If nothing else, I've emphasized the importance of moisturizing to you all today—if that's the only key takeaway that you get.
Not too long ago, I promise you, when I started my career in cloud computing, the closest my team was getting to infrastructure as code was caching credentials in a Microsoft PowerShell IDE—writing the PowerShell scripts that we needed in the command line to create instances and resources in Azure. It's pretty inelegant and defunct now. We then discovered Azure ARM, Azure Resource Management, and Blueprints, which you write in some flavor of JSON—and you keep these templates in source control. It's still not readable or easily understandable at first glance by everyone.
» CloudFormation to Terraform
That brings us to Charlotte Tilbury. At the time, Charlotte Tilbury was exclusively using CloudFormation for application and infrastructure deployments. CloudFormation is AWS's own IaC. It only works with AWS resources. And similar to YAML, we write resources in a nested way with a CloudFormation stack representing a collection of individually defined resources.
At the time, their only CI/CD tool of choice was CircleCI. But this was set up in such a way that our application and infrastructure deployments were both running in parallel—and dependent on each other. So, each time we would run an application deployment—pushing code up—the corresponding infrastructure code would also run and go through each stack to see if there are any changes.
But, as a fast-paced beauty organization, where things are always happening, we have something quite unique in that we offer 24/7 support. So we have to be on the ball at all times to make sure we provide the best experience for our customers.
Say, for example, an incident occurs at 3:00 in the morning, and an alarm goes off; the appropriate course of action at that time would be to manually update the scaling values in the AWS GUI from 2 to 8 to 8 and 16 of a service.
It makes sense to do this manually in the AWS portal. This bypasses the alarm. The next day after the incident postmortem, we then decide if we want to revert the values. But there's been a gap between those values being manually changed, and—up until the post-mortem—they've been scaled out and not reverted back.
Even when another engineer is editing part of the codebase and pushing up their changes, CloudFormation won't recognize there's been a change—and there's drift, and it won't revert those values back. It won't recognize there's been a change in the state, and it won't revert those values back to the original values.
» Drift management: CloudFormation vs Terraform
That brings us onto this scary thought, which is that at any given time while we were using CloudFormation, there would be services declared in CloudFormation that don't match what exists in the actual state. That can be a lot of drift that goes undetected, and we are not sure where it exists or in what stacks.
While CloudFormation does have some sort of drift detection and is relatively good at enforcing the configuration declared in the code, it does have flaws. Drift detection is done manually—and to fix drift can require manual intervention, as opposed to just supplying the CloudFormation code.
If something is changed in the infrastructure manually, like our example before, certain CloudFormation resources won't be detected to have drift and won't go back to their original values on subsequent runs. And it's not always clear what resources and services we detect will be flagged up for drift and what won't.
With Terraform, it's harder to drift. Even on the next run with our last example with our on-call engineer—even with manually updated values—Terraform recognizes there has been a change between the AWS portal and the Terraform state definition. And it will revert the values back to what's in the state, meaning that our state is always taken as our single golden source of truth.
» Idempotency and declarative code
Here's what we get deeper into idempotency. Terraform is also idempotent and declarative. So, pretty much in the saying, declarative code is describing an intended goal, rather than the stage to reach that goal. So, the ordering of resource blocks and the way they're organized isn't particularly clever. Terraform only considers implicit and explicit relationships between resources when determining an order of operations.
Here, we have an example of two different deployments of the same resource, AWS CloudTrail. CloudTrail is an AWS resource that logs and audits events and activities in the portal. It is usually used for security SecOps and auditing. As well as the actual CloudTrail resource we're defining, we also have an S3 bucket. We are also creating a CloudWatch log group, which we need.
We can define those resources in any order in Terraform, and Terraform still knows what we're creating and will order it in the apply process when we're creating our resources. Unlike CloudFormation, where we have to write each resource in a nested way—similar to how we write YAML—CloudFormation stacks have a specific template that we can't veer from otherwise it will throw an error.
Idempotency is talked about a lot with Terraform. It means that no matter how many times you run your infrastructure as code and what your starting state is, you will end up with the same end state. A natural consequence of this idempotent approach is being able to run the same code over and over again without any side effects or changes to the resources being managed.
In the context of Terraform, we're defining in Terraform exactly what is being deployed. This is really important with CI/CD tools because they can be run to change configuration—and also to verify that your configuration actually matches what we want. We can continuously run Terraform apply whether manually or in a CI/CD job. And, if your configuration matches what is actually defined in your plan and state, it won't change anything.
» Planning our Terraform migration
For our migration, we wanted to emulate best practices from Terraform and ensure our code was as modular and flexible as possible. For us, this meant defining each CloudFormation stack as a Terraform module.
Now, this was pretty easy and seamless for us because we had already defined every CloudFormation stack as a service like CloudTrail—as you saw before—or AWS Event Bus, or S3. These modules should be reusable. This means that we can reference common resources across other areas of our infrastructure codebase because we've already defined the resource with the parameters that are in line with Charlotte Tilbury security engineering practices.
For example, we have a policy that enforces a default limited retention policy on our S3 buckets. So, we don't want to keep data in our buckets for more than 30 days, and we want AWS to delete it afterwards. That's in line with Charlotte Tilbury engineering SecOps practices.
We already have a fully made module for S3 that we can use, rinse, and repeat across every other area of our codebase. So, if we have an example where we don't want to apply that limited retention policy, we can also add other non-static parameters and properties to that module where we're using it—so we have a more bespoke approach as well. If we want to rinse and repeat that same module, we can do that as well—and it'll always be in line with our engineering practices.
» HCP Terraform
Up until this point, I had only used GitHub and GitLab with Terraform, which is where we write bespoke YAML jobs to plan or apply Terraform in steps we define in our code. This also means steps to download Terraform, logging into your chosen environment before deploying. And this is all before being able to read the plan that comes in the output before we apply the changes when we push those up the source control.
So, unless we write some scripts to be able to format that plan output, it can come out in a way which isn't very user-friendly. I've had issues before where plans have failed because they've been too long, and GitLab hasn't been able to pass that plan. There can sometimes just be too much riding on it—and that's part of the reason why we decided to use HCP Terraform, which you might know better as a Terraform Cloud—HashiCorp's CI/CD tool of choice.
» Cultivating a high-velocity platform team
Whilst we, as a platform team, look after the shared infrastructure, we do emphasize a DevOps mindset culture across the team. Each squad looks after their own services, and that's both application and infrastructure code as well. That means engineers less experienced in infrastructure also use Terraform, and sometimes for the first time as well.
We decided to use HCP Terraform as our chosen CI/CD tool for infrastructure because it takes care of that steep learning curve and inaccessibility, which can be present when running Terraform in a more generic CI/CD tool like GitHub Actions or Jenkins.
A well-formatted and laid-out Terraform plan with the changes really clearly highlighted and color-coded was empowering for the junior devs in our team. They were able to put through small single-line changes themselves, like unblocking or blocking an IP address or amending a parameter in an alarm. It's easily isolated. It can be seen as something not scary or something that won't break other changes.
That helps build their confidence in Terraform as well. And in that we, the platform team, were cultivating a high-velocity infrastructure team that benefits all, shares knowledge, and minimizes bottlenecks for us as well.
» The migration
Now we've set the scene. For me, the easiest way to start writing a Terraform module was by looking at the CloudFormation stack, finding recognizable resources, and starting there—as well as cross-referencing the AWS HashiCorp provided documentation when needed to find other mandatory parameters that we might need.
We can successfully test that the module works first by calling the Terraform module we've created in the chosen environment. Of course, we started with pre-prod first and took it from there. Then, when we run the Terraform plan in the cloud GUI, we know we've been successful when the plan declares all the resources we've defined in our module configuration. We don't want to hit the big red button to apply there first. Rather than creating our resources, we want to import them.
» Terraform import block
Why would we want to import our resources rather than create and apply them? The resources already exist. CloudFormation created them, but Terraform is unaware of what exists outside of its state. So, we would end up with duplicate resources, or it would just fail upon applying, saying that you already have these resources with the same names that exist in AWS.
Where we have S3 buckets that contain tags, metadata, and assets, we want to transfer them from the cloud transformation managed policies to be Terraform managed to keep the assets in them. Previously, the import command was only available in the command line. I've messed up before. You do not know pain until you have to manually import a Terraform resource line by line, find the unique identifier that only exists in the plan itself, and run it against Terraform import.
But the Terraform import block—it's similar to a resource block—and where we reference resources that exist outside of what is known to our Terraform state, like our CloudFormation infrastructure. We can then preview the import in our plans before it's created and define it in a CI/CD job as well.
Once imported, Terraform tracks a resource in the state file. We can then manage the imported resource like any other, update its parameters and properties, and destroy it as part of a standard resource lifecycle.
» Case study 101: Assets S3 bucket
This brings us onto our case study. One successful case study we did recently was on our assets repo. In this repo, we had to find a CloudFormation stack, which created an S3 bucket and contained images, thumbnails, and fonts, which were used in customer emails.
This repo was quite small but very impactful because it meant—for some reason, at some point—we weren't loading these fonts and some thumbnails in emails that customers were getting. We triaged the issue to this repo. It only needed some additional cause configuration—which is configuration that we just used–point to the S3 bucket to be able to pull those assets in there to these emails. That's our fragrance launch. If you don't know, look on the website.
As we needed to add additional infrastructure code—and this was a CloudFormation template due to be migrated—we decided to just do the migration and add the additional cause configuration in Terraform all in one go, rather than doing a quick patch fix.
» Define the S3 bucket in a Terraform module
How we did this. We created a module for the assets repo and wrote the resources we intended to import in first as import blocks. We then defined empty string variables as we were adding these in Terraform Cloud in the console for each environment workspace.
» Add environment-specific variables
This picks up environment-specific variables here. We only need to define one main configuration in code. But we have multiple variables that aren't static. This fits in well with dry, don't repeat yourself best practices. We're not repeating code that we don't need for each singular environment.
» Add import statements
After calling the module block, adding the source, which is a relative path to the module, we can then add the import statement for the existing resources. As the assets S3 bucket already exists, we can define this as a data block that is a reference to an existing resource, including the name of the bucket that we're referencing being the exact name of what the bucket is currently called.
We can then add the rest of our import statements. Now, our import statement has two arguments: What resource we're importing to and the ID of the existing resource. We want the target to reference our module name, adding the Terraform identifier which is defined in the module. And the ID is the exact existing name of the resource.
» Testing
We can then test this either by running a CLI command locally, which triggers an HCP Terraform run. Or we can just push the changes to our branch and see the results in the console. We know we're successful when the Terraform plan picks up that we're importing our resources rather than creating them. You can see there is a clear distinction between importing resources and creating them.
» Create resources
In this case, we're also creating resources as well. We needed to add the cause configuration, pointing to the assets in the bucket that we're importing. So, we have the flexibility to import resources but add extra properties and parameters as needed. Terraform would create those properties in parallel in the same run—as well as importing the resources that we need.
Importing resources rather than deleting and recreating is especially important in this context because we don't want to delete any images, fonts, or assets stored in the assets S3 bucket that we're bringing into the state.
We also had a deletion policy on the bucket, including Git contents. We kind of have that fail-safe. More on that in a bit. Before applying the imports, one thing to note is to add that deletion policy to your CloudFormation stack. Otherwise, when you go to decommission the CloudFormation stack, you may well delete your resources as well—which is something that we definitely do not want.
» Tie up any loops
Now we've imported our resources, we have no need for the CloudFormation stack. Those CloudFormation resources are parallel to Terraform, and we don't want two different tools governing the same resource. So, we can safely decommission the CloudFormation stack. But first, we want to comb over our Terraform resources and check the names, policies, and bucket policies to ensure everything matches what was defined in the state.
Then we need to check that all the resources have that appropriate deletion retention policy. Then, we can nullify the CloudFormation stack. Once this is done, we can remove the resources from the stack, and CloudFormation will remove those from its state when we push up the empty stack. Now only our Terraform resources that exist, and they are the single source of truth. And now we've imported everything we can safely delete the stack as it is empty.
» Pitfalls
These are some things to keep in mind if you are considering moving from CloudFormation to Terraform. One pitfall we faced was that the ID of the import block for resources differed for each. And it's not always clear what it is. For example, for an S3 bucket, the ID to import—like we said before in the example—is simply the name of the bucket.
Whereas for a CloudWatch resource, it could be the path of the logging group or a unique string that concatenates with the region. Check the documentation, though. As far as I know, this has now changed, in the last few weeks on the AWS provider documentation. So you can see what an import block is, and it gives you an example of your target, your name, and the ID as well.
With the plan, if you're not going through with a fine tooth comb and checking all the resources, sometimes you can miss changes in the configuration. Again, with the assets repository, when we reached all the stages of the import plan, the route 53 DNS record that was declared in the plan was changing upon import. Terraform flagged it as being imported, but when you expand on the resource, the actual string of the DNS record was changing and it was adding .S3 or something.
It didn't throw an error but we realized that the ID of the import block for the DNS record was concatenated wrongly. So, it's always worthwhile before you hit that apply ton just to go through, check, and double-check.
The last one was pretty much all on me. I didn't realize that by deleting the CloudFormation stack, AWS believes that the now-managed Terraform resource still belongs to it. However, we need to include that resource deletion policy in the empty CloudFormation stack.
That means it won't delete the resource attributes itself. It will just remove the stack. So, it will disappear from the AWS CloudFormation console. But, luckily for me, I only got as far as pre-prod before I realized that our assets bucket had no policy attached to it. Don't do as I did.
» Some key takeaways
AWS CloudFormation generally works really well in a single CSP environment, which we were. With the way our infrastructure was set up—where we had our infrastructure and application deployments dependent on each other—as we were adding new services in and our platform was getting bigger, that migration was the next natural move for us.
HCP Terraform helped us with this. Especially with a team like ours where we have so many developers with so much varied experience—some of whom have never used Terraform before. I think that Terraform Cloud really provides a user-friendly experience for those. It empowers them to make changes themselves. So, it means the infrastructure team can focus on the good stuff, making our platform better, and more secure, and scaling it.
If you've gotten to this point, there's no Q&A. But if you'd like to find me at the networking reception—even to say hi, or if you have any questions—I would love to speak to you. If you don't want to, reach out on LinkedIn if you'd like. There's my QR code. Prize winner, come find me, and I'll make sure that you are united with your magic cream. Please do connect with me if you'd like.
Thank you all for being here. It's great to be on stage with you. Thank you. Big shout out to HashiCorp and Claudia for making sure that everything ran so well today. And, big shout out to the Charlotte Tilbury engineering team for getting this presentation across the line. It was a huge group effort, and I really appreciate all the work everyone put in to help me get to this stage.
Thank you ever so much, and I hope you have a good rest of the afternoon.