How Datadog scales support test environments with Vagrant and Terraform
Learn how Datadog uses Vagrant and Terraform to build, manage, and scale test environments as a team.
Datadog loves HashiCorp tools. They're power users of Terraform and Consul. And HashiCorp loves using Datadog as well.
In this talk, you'll see how Datadog's engineers built a “Sandbox” that used Vagrant and Terraform to help solve their scalability challenges.
Datadog’s monitoring solution touches on hundreds of different technologies and it runs on thousands of possible stack combinations. Their solutions team is regularly challenged to build test environments that reproduce unique customer conditions and issues. Building test environments is onerous and time-consuming, but with the help of Hashicorp’s Vagrant and Terraform, they’ve developed a more scalable system with which they can build and manage test environments as a team and easily re-use each other’s work.
Speakers
- Stephen LechnerProduct manager, Datadog
Transcript
I’m just here to tell a story about a problem that we were facing on Datadog’s solutions team. We were facing a scalability problem, and I just want to tell that story of how we ran up against the problem and how, with Vagrant and Terraform, we were able to find a solution to it, and a solution that I found pretty interesting. So I wanted to share it. Hopefully you find it interesting as well. But before I get too far: About me, just to get some of those details out of the way. I’m currently a project manager at Datadog, but I spent a good year and a half on Datadog’s solutions team, as we now call it. Back in my day, we called it the support team. When somebody has a technical question, a problem, an issue rather, or feature request, or just, “How do I do this or that with Datadog?” they’d reach out to us, and our team was the one that would handle all of those questions, all those tickets, as we would receive them.
And some of my time on the solutions team is where this story comes from. Before that, I spent a couple years as a product manager at another, smaller company that no longer exists, sadly. But as far as the story that I’m going to tell today, if anybody has any questions about it, or even better, if anybody has any suggestions about how we can better do what we are doing now, please drop me an email. That email is stephen.lechner@datadoghq.com.
So, what was the problem? It was a problem of scalability. How do we support it all? What do I mean by this? First off, we have to start with some context. Datadog is the context of the problem that we faced. Datadog is the company that had the product that we’re talking about. Hopefully, a lot of you here in this room are already familiar with Datadog, and if you’re not, we’ve got this great booth out in the booth section where we’ve got a couple engineers who are demoing the product to whoever is interested. Definitely recommend taking a look at it. It’s a great product. I love it a lot, and I always like people to know about it.
What Datadog is, briefly, is an infrastructure performance monitoring solution. It’s our goal, our mission as a company, to make it easy for everybody to be able to have their applications work the way that they want them to work. To just reduce time to resolution of any issue that comes up. Hopefully to reduce the amount of issues that you face. We just want you to do you with your applications, and we want to make it easy to solve any problems that come along on the way. That’s our goal.
Now what that means is we have to offer a large number of integrations, because there are so many tools out there that might end up in people’s stacks that they might need to monitor on. We have a huge number of integrations, which is part of the context to the problem that we faced, the scalability problem.
But then also, part of the context of the problem that we’re going through today is just the fact that Datadog, over the last couple years, has grown an awful lot. Which is a good problem. Today we consume trillions of data points every day, which is a ton. We have lots and lots of new customers. This is wonderful. But with all that came a lot more support tickets, a lot more people reaching out with questions about, again, “There’s a problem and I can’t figure it out with how the product’s working.” “Can you add this or that feature?” Or, “Hey, how do I set this up? I’m not sure how I can get to this end goal that I have.” And again, the solutions team, our team, was the one that would handle all those. So we just had a huge volume increase in terms of how many problems we needed to solve on a regular basis. Before I get too far into the actual demo, though, I do also want to point out our careers page. We’re still growing a lot and we love HashiCorp. We use it a ton internally in Datadog, so I would love to have some new engineers come from this conference. That would be awesome. I did want to make sure to encourage you all to take a look at our careers page. It’s a good thing to do.
But moving on. When I say we have a lot of integrations, I mean a lot, a lot. This is just a GIF of me running through our integrations page, and it just takes an awful long time just to scroll down the whole thing. So, that’s a lot. And again, why this is part of a problem that we were facing is that on the solutions team, there are 200 integrations and counting. There are all these integrations that a customer could ask a question about—“How do I set it up?” “There’s a problem with it.” “Can you add something to it?”—That the solutions team would have to handle. There’s a huge range of the different kinds of technical questions we would be faced with on a regular basis.
Again, the question is, How do you support it all? All these integrations. How do you support them all? When in fact, every customer environment out there is unique, is different in its own way. And when you have 200 and counting integrations, nobody can actually have working knowledge of all of the things, let alone expertise in all of the things, when something’s going wrong or teaching somebody how to do it. So what do you do about that? Well, you make test environments, which means virtual environments. Which for us on the solutions team simply meant Vagrant.
Very early on, we started using Vagrant. While I was preparing this talk, I asked the first team lead on the solutions team, “Hey, Garner, when did we start using Vagrant in the solutions team?” And he just stopped and laughed, and went, “Since the beginning. There was never a time where we didn’t use Vagrant.” Our dev team was already using Vagrant to provision our dev environment, so it was just the natural thing for us to use on the solutions team.
The problem is, on the solutions team, we were pretty lazy about the way that we used Vagrant. So what we would do is, of course, just “Vagrant init Ubuntu/precise,” whatever kind of box you wanted. Vagrant up to spin up the box. SSH into it. And then you’d actually do the stuff, in order to build an environment that reproduced the issue or the question that the customer had.
Now, the thing is that we would be doing all that stuff manually. All the setting up of an environment that would have the kind of tools that the customer was asking about or had a problem with, you’d have to install manually. What does that mean? Maybe somebody has a question on how to set up the Tomcat integration with Datadog. That’s not maybe something that we use a whole lot internally, so we’d have to spin up a box, install Tomcat, install Datadog, and configure the integration for it before we’d finally be getting the data into our Datadog account. Then at that point, we could start tweaking the environment to mimic what the customer was doing. There was an awful lot of setup work involved in any one reproduction environment, a testing environment for these tickets.
We had some problems with this manual approach. One is specifically, each engineer ended up having tons and tons of Vagrant boxes, which is very hard to keep track of, it turned out. That’s just a GIF of me running Vagrant global-status, and just, all of my Vagrant boxes just pouring out there. Any five or six of them might have been running without me realizing it, which eats up an awful lot of RAM and is just hard to keep track of and manage.
Another problem we had was that, as you can imagine, it would just take an awful lot of time to get to the point where we could start reproducing a customer’s problem or issue. And with more and more tickets coming in—again, the company’s growing very well—we had lots and lots of tickets. When it takes so long to build a reproduction environment, a testing environment, it just makes it less feasible for us to properly investigate what the customers needed us to investigate. That was directly impacting the quality of the support that we were able to give.
The third problem was that, across the team there was just a ton of overlap in terms of the kinds of environments that we were building. A lot of us were doing the same work over and over again. And then on top of that, we found ourselves reproducing our own work individually. Maybe it’s been three weeks since the last time I had to work on a Tomcat ticket, and the environment that I have is all tainted by the fact that I had changed it a lot for that other ticket. Now I have a new ticket that came in about Tomcat. I would have to install it again, but it’s been a while, so I’d have to remember how to do it all again manually, which means I probably have to start over from scratch, run back over the documentation and relearn it all again, and that just took more and more time. Lots of redundant work. So those are the problems we were facing. What were the solutions? Well, it turns out that the solution that we ended up going with was already at our fingertips. The provisioning feature within Vagrant is something that as a company we knew about. In fact, our dev teams were using it already an awful lot, very heavily. On the solutions team, we knew about it, but we just hadn’t used it very much. We hadn’t really thought a whole lot about the kinds of problems it could solve for us.
It’s a simple idea. It would be the same use case, it would be the same init, Vagrant up, SSHing into it and all that. But the difference is when you provision, you can have all of this stuff that you need to do in order to set up the test environment, you can have it all happen automatically when you hit Vagrant up.
The thing you have to do is take all those manual steps you would take to install Tomcat, the Datadog agent, the integration for it. You just had to save them to a script that you would run, a provisioning script, if you will. Which, if you want to get all fancy, you can do Chef, Puppet, and Ansible and all that. But what we were working with were simple test environments. So Bash was certainly enough for our use case.
You just needed to save all those steps in a Bash script. And then we had to find a good way to share those provisioning scripts with each other, which is as easy as a GitHub repo, right? So, that’s what the solution was. In order to get it working and share it with our entire team, the key was to get a protocol in place, so that way we could build the test environments, which we ended up calling sandboxes, in a way that we could easily share them quickly with each other and not have to worry about logistical problems.
There are five things that went into each of our sandboxes. This is just a directory tree of what our sandbox repo looks like. I’ll go through all the important pieces to it. Again, there are five of them. The first being, you need a working directory to run in. Each working directory would have three tiers to them. The first being what kind of operating system you’re using. In this example, we’re using an Ubuntu box, but it could be any Linux environment, Windows, what have you.
The second-level tier being the provider, if you will. In this case we’d be using like a Xenial 16 box. You could also do something like a Trusty 14, Preceise 12, whatever. And then the third tier of the directory structure would be a directory for the box itself, the sandbox that you’d be running. It would generally be something you’d want to make a name that’s descriptive enough to say what kind of sandbox you’re running in this case. In this case, this would be a Kafka environment, which, if you look a little closer, will have ZooKeeper as well. And what this box would do is, if you were to use it, it would spin up a small Vagrant box. It would install Kafka and ZooKeeper and all that, the Datadog agent, the configuration files, so that way you can get the metrics for Kafka and ZooKeeper into your testing Datadog account right away. And then you can start working to try to reproduce the customer’s issue.
The second thing you need is a configuration file, because every sandbox user, all the solutions engineers, they have some user-specific variables that would go into their environment. If you’re familiar with Datadog, you’re familiar with things like the Datadog API keys, of where to send data, application keys to act as a user with your Datadog account via the API. Tags, often times you want to be able to tag the environment by who started it, and you want to test with certain default tags all the time, host name, base string, if you will.
So we needed a way to let people have their own user-specific variables that they’d pipe into their sandboxes. We would just have that in everybody’s home directory under a hidden file, that sandbox.comf script right there. This is a quick look at what that script looks like. Those are just the four variables that are required, and then you could also add custom or optional ones. That would be used for some environments, if you wanted default passwords that were just to stand in place to make sure you can use certain features within some integrations. This is more or less what it would look like. Just a few options that you would set once. And as soon as you had that, you’d be able to use the sandbox environment.
The third thing you’d need is, of course, a file to configure the actual virtual environment. For Vagrant, it’s a Vagrant file. One of the requirements that we had here was that we needed a solution that wouldn’t be hard for the solutions engineers, the support engineers who were doing all this stuff, to adopt. There was a reason why we were lazy Vagrant users. The time we wanted to spend is on reproducing problems and solving customer issues. We don’t want to be learning how to use the system that we have in place. So, it’s convenient that the Vagrant file was able to be super, super simple. When you’re making a new sandbox, the only part you have to change about this file is the top one, the kind of box that you want to run, which is easy enough to put in. In this case it’s a Trusty box. The important parts of this Vagrant file are the last four lines, the provisioning lines. And what those do is it takes the user-specific variable file, the configuration file, plants that into the box, setup script, it plants that in there, and all the supporting documentation, it also plants that into the Vagrant virtual machine. It plants it all there, and then it runs the provisioning setup script there, which we’ll get to right now.
The fourth thing is you need the setup script. This is the script that will have all the magic that happens for a sandbox, all the installation steps, the preparation of the box. That would all happen in the setup.sh file. There’s an example at the top just to give somebody some guidance. And then each directory will have a setup.sh. The important thing is it would have to start with sourcing, the user-specific variable configuration file.
So that’s where it’ll have to start, but then after that you can have it do whatever you want. Install Tomcat, install Postgres, install Kafka, whatever it is you might need for this specific test environment. The last thing you’d need there is an additional directory within the sandbox directory, we would call it the data directory, where you could have additional files, whether configuration files or other additional scripts that would be useful for setting up this environment but aren’t things that you would necessarily be able to just keep in a Bash script. The ZooKeeper configuration files, some of the Datadog integration files can get pretty long. You don’t want to have to write a way of creating those files in the setup script itself. It’s nice to be able to just move them or copy them over. So that’s where basically anything else you might need in a setup environment that isn’t captured in either a variable or the setup script itself, you would just throw in the data directory. And that would just get copied over.
So those are the five things that we would need in all of our sandboxes. That’s the structure of the protocol that we’ve put in place for building and sharing sandboxes that were easy to use, that wouldn’t take time to run.
Now, how does it work? It’s super, super easy. First, you just go to the directory itself that contains the sandbox you want. The Kafka directory, the Postgres one, the whatever ... And then you hit Vagrant up, and that will spin up a new Vagrant box, run the setup script that’s proper to that directory for that sandbox, and it’ll prep everything for you so that way in 5 to 10 minutes or whatever when it’s done, you can just jump on in, and then further tweak it to reproduce the customer issues specifically.
This is a quick video of that. That’s me moving to a directory for a specific sandbox. This is a sandbox I’d never used before. One of my colleagues set it up. It’s a CentOS box that runs Postgres and the Datadog integration for it. I just Vagrant up, and that’s it installing Postgres, then the Datadog agent, and then at the end there, you can see the data started to pop into my test Datadog account. I didn’t have to do anything but go to the directory and Vagrant up. It took about 10 minutes to run. This is a shortened version of that video to save you some time, but that’s a good time to get coffee. Maybe work on another ticket, whatever. But the idea is I didn’t have to spend any time to get the environment set up that I needed in order to investigate the problem properly. That was the big win for us. So what did it solve for us? Test environments at this point are now, as a team, build once and reuse. If my friend Joe over there built that CentOS box that installs Postgres, the Datadog integration for it, I don’t have to do that anymore. I don’t have to know anything about Postgres. I can just do this, get the environment running, and then I can tweak it, and I can already start trying out what the customer’s trying and then further understand it. You know, investigate it from there.
We’re all learning these technologies as we go, on the solutions team. It’s a great place to work, in a sense, because there are so many tools that you’re constantly learning about. You’re always learning new technologies. And what we found the sandbox becoming was an encyclopedia of all these different tools. You already have working examples of how to install Postgres, Tomcat, whatever, all there. And if you want to understand quickly how it works, you just run over the setup script for that particular box. You just read through it and you’ve got a nice example of how to install and integrate Datadog with it, which is great. It’s become a very nice tool for faster onboarding and training.
But we still had some problems with this. Specifically, because at this point, everything’s running locally. You can only have so many Vagrant boxes running at a time. I found once you get past say five or six on my machine, things would really start to slow down, and you’d have to start closing things off. So you have to pick and choose which test environments you want to have running at the same time, which is inconvenient.
But then on top of that, you also can’t share these environments with other people easily. So if you wanted to pass off an investigation, it’s already maybe 7 pm, you need to get home, but this ticket still needs work. You can’t easily pass it off to somebody in the next time zone. Another solutions engineers, they can continue investigating the problem. They basically have to set up their own test environment themselves and redo that work a little bit. Also, if you want to escalate a problem to a dev, which occasionally we would have to do, you couldn’t easily share the test environment. It was hard to communicate to them, What exactly have I done so far in this investigation?
So, we’ve made progress. We still had some problems. That’s where Terraform came in, though. These problems that we had, they were problems related to the fact that when we were using Vagrant, we were only using it locally, which is great that we can do that, it was all very helpful, but we found Terraform had an awful lot to offer here.
Terraform is something that in Datadog, we’d been using for a long time. Very, very early on, we started to adopt Terraform in our dev teams. We had heard a lot about it in the solutions team, but we hadn’t used it a lot. We hadn’t had a whole lot of reasons to use it per se. And we hadn’t had a whole lot of opportunities to try to find out what kinds of problems it could solve for us. So as we started looking into it, we realized that it can be used very similarly to Vagrant. It would be a gross simplification simply to say that as Vagrant is to virtual environment, Terraform is to cloud environment. But the way that we were able to use the two tools, that’s sort of how it ends up working for us.
So all of the simplification that Vagrant does for us in setting up our virtual environments, Terraform was able to help get us to that point to basically simplify how we spin up our cloud in test environments with the same ease of use. And that’s what I think is especially interesting, the way that Vagrant and Terraform were able to provide us with a solution to our problem was how easily they were both able to work together to offer us the same stuff, but to solve different problems, if that makes sense.
What our environments needed was the same thing, the same protocol was what we needed. We had all five of those things still there: the working directory to Vagrant up in; the local configuration file for your own stuff, your own variables; the configuration file for the virtual environment itself, which for Vagrant, again, is a Vagrant box; the setup script to do all the magic; and then the directory to hold all the extra files.
We needed the same thing in this case with Terraform, but the only thing that had to change for our use case was, instead of having a Vagrant file, we’d use a .tf file. Now, we did have to add a module that all of the sandboxes would use, in order to keep that similarity of use. Right now, again, you can do Vagrant up in a box and a lot happens to set up the Vagrant box. In order to get that use case in Terraform, we had to have this main.tf at the top of the sandbox to handle all the heavy lifting of setting up our environments. For us, that meant setting up very small EC2 instances. T2.micro is the default that we use in our AWS account. And that’s where all of our test environments using Terraform would end up. The main.tf is what does all the heavy lifting, but you also see each sandbox will have it’s own .tf file. In this case, it would be a Kafka.tf.
What does that configuration, that Kafka.tf look like? Similar to the Vagrant file. It’s super, super easy. Again, another requirement that we had was that we needed to keep it simple, so that way we wouldn’t have to spend time onboarding our solutions engineers to use the system that we had. I had to be able to tell them, “Hey, it’s just like Vagrant up; it’s just using a different command.” And so that was a requirement for my use case here. To set this up, it’s just as easy as that Vagrant file. In this case, you have to change two values instead of just one. One is you have to name your resource, which is easy enough. Just give it a custom name, whatever you like. And then the operating system, you just have to say what operating system you were using. You could add in something like Ubuntu, or Xenial, or Windows 2012 R2, whatever would be a good name for the kind of operating system you want. That string would just get piped into a dictionary in the main.tf that would map all of the possible operating system names that I would be expecting our solutions engineers to use and would map it over to the AMI image that we would use that would correspond to that operating system.
But those are the only two things that you would have to change in order to set up a .tf file for a sandbox. The heavy lifting would go into, again, the main.tf, which is stuff that I just had to write once, and wasn’t super complicated. When I did this, I didn’t have much Terraform experience, but it was something that I was able to figure out with just the documentation, and some trial and error, and work. It was 170 lines or so of configuration that does all the work of deciding what kind of boxes, where in AWS it will go, what region, etc.
The part that’s important and worth sharing here is the provisioning lines, which, if you look at them, are very similar to the Vagrant provisioning lines we went through before. Again, the three things it does is it grabs your individual user’s configuration file and it plants that in the box, it grabs the setup script and throws that in there, it grabs all of the attending documents in the data directory that would be necessary and it plants those in the box, and last, it runs the setup provisioning script. So it does the same thing, it’s just with Terraform instead.
So how does it work? Again, you just go to the directory, whichever one that has the sandbox that you’re interested in. And then instead of Vagrant up, you simply run Terraform get, plan, and apply. Get to grab the module, plan to, of course, see what you’re going to provision ahead of time before you do it, and then last to provision everything, the Terraform apply.
The really cool thing is that we use the same directories that we already had. We had the same setup scripts—the whole thing is the same sandbox, it’s just we added .tf files in addition to Vagrant files so that, at any point, you can go into them and either hit Vagrant up to run things locally or Terraform apply or get, plan, apply to spin up an EC2 instance that would hold your test environment and provision it appropriately, just the same way. They even use the same setup.sh scripts to prepare the boxes, which is what gets really interesting, I think.
So how does it work? Here’s another quick video of me using the Terraform version with the same CentOS 7 box that uses Postgres. That’s me going to the CentOS Postgres directory, hitting Terraform, which I just use an alias for TF, TF get. That was me authenticating with my WS account, which was complicated to set up originally. But that’s me hitting Terraform plan to see what I’m going to do, I cleared, Terraform apply, and that’s where everything starts installing. There it is installing Postgres, then also the Datadog agent right there, and then also you see that data start to pop in there into my test account. And again, just like the other one, it took the same 10 minutes or so for it to run. Again, a good time to get coffee, work on another ticket, whatever you have.
But it’s the same use case, same user flow, and the beauty of it is that the solutions engineers who were using this product, this system that we had in place, this protocol, they didn’t have to learn a whole lot. I was the only one who had to learn how to set things up with Terraform, and of course I encouraged them all to learn as much as possible because once I found out what we have on our hands here, this is a super valuable tool. But it’s something that they can already start using without having to learn Terraform, without having to learn a whole lot about Vagrant provisioning, things like that. Which again, was a requirement for our use case.
So what problems did this solve for us? Now that everything was in the cloud, all of our test environments, everybody could run as many sandboxes as they wanted without having to pick and choose which ones would run and use their RAM, because things slow down. But you can also share sandboxes more easily that way which I’ll get to in just a second. The sandboxes can also run for longer, which is important because every once in a while you get a really complicated ticket, and in those cases, they take time to investigate appropriately. And sometimes you’d rather not have to shut things down. And you’d rather just be able to keep your examples running for more time, so you can properly troubleshoot. And so that extended the kinds of customer problems we could solve.
But then also we can share our boxes with teammates. Whether it’s because you want to send it off to the next time zone, escalate it to a dev team. In that case basically we could just share the actual box, give them access to it, so that way they can, from exactly where we left off, continue the investigation to solve the customer’s problem. There we could share the boxes themselves, not just the templates. Of course, in order to do that, we had to set up a fairly complicated way to authenticate the boxes both individually and across the team, so that way we could securely share our boxes with each other. But we were able to do that.
Now, those are the things that it solved for us that we had hoped it would solve. But then we also found that there were a bunch of other bonus points that we got from this whole Terraform setup that we really liked. One interesting thing is that, the solutions team, in addition to solving customer problems or answering their questions, taking their feature requests—because on the solutions team we end up getting to know the product in a very broad way very well, we end up being experts in the use of Datadog—that ends up being the team that’s the most suited to introduce the product technically to prospective customers. We do a lot of demoing of the product, and in this case, with the Terraform approach, quickly we could set up environments that make it easier to demo the product for using those tools that our prospective customers might be using that we aren’t using ourselves. So we can build more pointed demos, which are more interesting. That’s turning out to be a very good use of the sandbox. The second thing is that there’s a bunch of scripts that we were using, tools on the side we were using to help our support work, that we were maybe scheduling on cron jobs locally, but you’d have to make sure that your laptop was open at the time that it was scheduled to run for it to happen. We’re able to just throw these up in sandboxes instead, and have them able to be quickly provisioned so that it’s super easy to maintain those tools now, in a way that it wasn’t before. That’s something that we’ve gotten a lot of really good use out of. That’s been a huge plus.
The last part I want to talk about is just where we expect to go from here. The first thing that is the next big step we need to take is to further modularize the sandbox. Right now you’ll notice that each sandbox is limited to its directory. For every sandbox you want to have, it needs to have a directory, and so you have to set it all up. Sure, it’s easy enough to pick out a similar sandbox and then copy it and adjust it from there, but much better would be, instead of having a single setup script for each kind of sandbox, if we could break them down into setup scriptlets that match our integrations instead. That way you can have something like a configuration file where you just pick out all of the different integrations you want to install. That would just much better extend the sheer number of kinds of test environments we could build in a quick way.
So that’s, I think, where we’re going to go with version 2.0 of our sandbox. Less urgent probably, another thought would be to find a way to share our Terraform states. Because right now, our states don’t really know anything about each other. So, sure, it’s easy enough right now for us to just jump on Slack and see who’s running what kinds of sandboxes, but I can see it being helpful if there was some command-line way to find, like a bigger and global status type thing where you could see, What sandboxes are up? So I can just quickly jump into them and try something out really quick, without having to necessarily spin up a whole ’nother one. That’s probably less urgent, but it’s another thought as to where we might go.
But those are just some thoughts, and some people here might have some questions, which I’m happy to take after this, but may also have some suggestions as to where to go with all of this. The next few minutes I’m just going to spend over in that corner in case anybody does have any questions or suggestions. But outside of that, feel free to email me them instead. Again, it’s stephen.lechner@datadoghq.com. Hopefully you found this all interesting, and thanks for listening.