Bugsnag Uses HashiCorp Terraform to Quickly Provision and Safely Maintain Their Infrastructure
This guest blog is by George Kontridze, Production Engineer at Bugsnag. Bugsnag is an automated production error monitoring tool, supporting over 50 different platforms. Bugsnag provides open-source libraries for most popular programming languages which make it very easy for customers to integrate Bugsnag into their workflow. Once integrated, Bugsnag automatically detects application exceptions and provides the data and tooling to prioritize and fix errors with the greatest user impact.
At Bugsnag part of the challenge we face is the fast pace of iteration; be it external, connecting with an API to make a new integration available to our customers, or internally to regularly provisioning and scale a cluster of machines to run our services as our system’s performance characteristics evolve.
As our product evolves, it becomes incredibly important to put the tools in place to help us evolve the infrastructure that runs our services. The time and effort we invest in these tools are also quite valuable to us, so we need to choose wisely.
On the infrastructure side of things, we need to be able to ship configuration changes in production. We do configuration changes for existing resources or to add new resources. Regardless, we need to be able to do this with ease and high visibility. This is where the HashiCorp toolset comes into play.
It’s important for us to be able to quickly and reliably test configuration changes to our infrastructure and then push them into production. Changes may include a node-local configuration update or integrating an entirely new cloud service which requires setting up multiple interdependent resources. We require any changes to be tested efficiently and avoid manual, error-prone methods.
» Our setup with HashiCorp Vagrant, Packer, and Terraform
We use three HashiCorp tools: Vagrant, Packer, and Terraform.
» Using Vagrant to launch local VMs for testing quickly
We use Vagrant to quickly launch virtual machines on our developer laptops to test configuration changes against a fresh Linux system. The integration of Vagrant with Chef’s test toolkit, Kitchen, fits into this workflow. After making cookbook changes, we run “kitchen converge” to easily see whether it’s applied correctly or if it breaks the Chef run.
» Using Packer to create reproducible machine images from source configuration
Sometimes, we need to be able to launch VM instances into production quickly in order to support increased load during a spike window when our customers send us a high amount of traffic. Packer saves us time by creating reproducible images to launch identical VMs. Using Packer we bake the initial Chef run into the image to create a VM meaning the image is ready for work upon creation. This saves us the few minutes it takes for the initial Chef run when launching an instance, which has a big impact when we need to scale quickly.
» Using Terraform’s efficient execution model to interact with Amazon’s and Google’s cloud APIs
Terraform plays an integral role in our day-to-day operations to provision and maintain our infrastructure on GCP and AWS. The following code snippet is an example of using our mongod module to provision one of our Mongo data nodes in GCE:
module "bugsnag11-db2" {
source = "./modules/mongod"
name = "bugsnag11-db2"
availability_zone = "us-west1-b"
machine_type = "n1-highmem-16"
image = "ubuntu-1604-lts"
replicaset_name = "bugsnag11"
disk_size = 1500
chef_version = "${var.chef_version}"
}
This allows us to use a consistent workflow to interface with almost all external infrastructural APIs, including DNS records, CDN caches, object storage buckets, VM instances, and so on. We encode all of the information into the infrastructure repository as Terraform configuration. We use a standard branch-change-pull-request workflow on this repository, so product engineers are just as aware of changes to our infrastructure as the infrastructure engineers are. This also encourages collaboration, allowing product engineers to PR the repository when they need to provision infrastructure and have their changes reviewed and deployed.
The most important part of our Terraform setup is that we store all of our production database machine configurations in it. Chef has a concept of “node attributes” or machine-specific configuration information that can be looked up by generic configuration logic (“the recipe”). This can include things like which block devices to use for data storage, marking machines for performing database backups, and other important database operations and maintenance tasks.
We store our Terraform state in S3 and have turned on object versioning in order to be able to restore previous infrastructure state in the case of a bad Terraform run that could lead to a corrupted state.
» Final thoughts
Everyone in our company is welcome and encouraged to make changes to the infrastructure repository. When a pull request is submitted, we use our Buildkite CI setup to run a global (non-targeted) “terraform plan” to let us see the changes before applying them. After that, a production engineer merges the pull request, pulls down the new changes, and runs a “terraform apply” locally.
Overall, we are happy with these tools, and especially with Terraform, because of our extensive use of it to continuously and safely provision and evolve our infrastructure, while also making these changes visible across the company.
Sign up for the latest HashiCorp news
More blog posts like this one
Fix the developers vs. security conflict by shifting further left
Resolve the friction between dev and security teams with platform-led workflows that make cloud security seamless and scalable.
HashiCorp at AWS re:Invent: Your blueprint to cloud success
If you’re attending AWS re:Invent in Las Vegas, Dec. 2 - Dec. 6th, visit us for breakout sessions, expert talks, and product demos to learn how to take a unified approach to Infrastructure and Security Lifecycle Management.
Speed up app delivery with automated cancellation of plan-only Terraform runs
Automatic cancellation of plan-only runs allows customers to easily cancel any unfinished runs for outdated commits to speed up application delivery.