Nomad and Vault at CircleCI: Security and Scheduling are Not Your Core Competencies
See why CircleCI has adopted Nomad and Vault to solve his team’s most challenging security and scheduling demands.
When thinking about whether you should build something in-house or adopt a ready-made solution, it’s incredibly easy to fall into the trap of, “that’s such an easy problem, we’ll just build it ourselves.” In this talk, CircleCI CTO Rob Zuber will talk about why CircleCI has adopted Nomad and Vault to solve his team’s most challenging security and scheduling demands, while decreasing complexity and increasing throughput on the things that matter to CircleCI’s customers.
In the past six years, CircleCI has grown from a few engineers building a CI platform for single-page web apps written in Rails and Python, to powering the delivery pipelines for tens of thousands of organizations across the globe, on infrastructure processing tens of millions of jobs per month. As CircleCI’s platform has grown, the need to focus and ruthlessly prioritize has grown with it.
Rob will talk about the importance of understanding your company’s core competency and value proposition to users, and how to factor that into the build vs. adopt decision. Rob will talk about how the “not invented here” syndrome still persisted for a time even at CircleCI, and how to help your team focus on delivering differentiated value to end users.
Speakers
- Robert ZuberCTO, CircleCI, CircleCI
Transcript
How's everybody doing? Right on. I never expect an answer when I ask questions onstage, so we're off to a good start. Congratulations to me on creating the most ridiculous slide. It took a lot of effort to try to figure out how to fit this title on the slide. It was probably not really that well chosen, but I wanna talk a little bit about the things that are our core competencies at CircleCI, and how we make decisions about using other people's technology. As I mentioned earlier today, we use Nomad at the core of our product, and I wanna talk a little bit about that journey.
I'm gonna start with a story because that's always the best way to go. Every good story at a tech conference starts with an architecture diagram, and this is my favorite architecture diagram because it's very complex and there's a lot to reason about. I'll give you a little bit of time to sort of digest it. This is CircleCI, circa 2011, 2012. We have a simple code base. We have a server, maybe two, and we don't have a lot to reason about. We're just trying to figure out if we have a business. Every good monolith is also backed by a mono base. It's the database where you jam everything. You have no concept of separating out design components or parts of your code or whatever.
Our mono base happens to be MongoDB. It's a long story, but at the time only MongoDB, everything in MongoDB. If you're looking at this, and you know where this story's going, I can only advise that you fasten your seatbelts.
What was happening inside the monolith? It's easy to call it a monolith. It's a single piece of code, but what's described there is "app." What do we do at CircleCI? We take your arbitrary workload as one of our customers, and we place that into a container to execute a task, so 2011, 2012, kind of pre-Docker. Docker was just starting to form. We were using LXC containers directly. We had an app sitting on a very large host and partitioning that host up into separate LXC containers. Making decisions about where to put jobs as they came in. And getting all that information from the database.
Then one day you wake up and instead of a monolith, you have a few monoliths. Then you have problems. You have problems like, we need to coordinate across these different machines that have their own sets of containers on them. We need to make decisions about queuing and scheduling and priority, and we need to manage the state of what is happening in our system.
We had a hammer at the time. You already know what it was. It was called Mongo. So we said, okay, we're gonna use this tool to solve this problem. This is how to boil the frog of job scheduling, if you will. One step at a time. Does anybody here use Mongo? Somebody's used it somewhere in anger. There's a notion in Mongo, or a command, a tool called findAndModify
. You say, "Find me the document that matches exactly this definition, these conditions and then change it to match this state."
I think everyone's well aware that there is not a concept of transactions, especially this is Mongo 2.4 maybe. You can do atomic operations on single documents and they'll get validated against the filter that you've passed. So we said, "If the state of our entire fleet of these 5 boxes or whatever and all the jobs running on it looks like this, then change it to look like this. And if it doesn't fail, then we'll try again." It's a pretty reasonable way to make use of Mongo when you have 5 boxes.
Then one day, your fleet looks like this. This is an order of magnitude smaller because I got bored making green boxes than where we actually were as a business in total capacity. Every one of those operations that describes the entire state of every job running on every single one of these fleets has to get passed to Mongo and say, "If the fleet looks exactly like I thought it looked when I got this document, then update. Otherwise, fail."
This is not good. This is not a good situation to be in. There were at least 2 things that went horribly wrong and then a whole list of additional things that went horribly wrong. The first is, as you can imagine, every time we tried to do that, we were wrong about what the fleet looked like because someone else, another monolith within that giant pool of monoliths, had made a similar request in advance, got queued in advance and then changed the entire state so we had to start over.
Separately in Mongo, which we were also using as a queue, because if you're not using a database as a queue, you're not trying hard enough, we were pulling jobs. Any one of those hundreds of machines could pull a job and then try to allocate it into the fleet. And when it failed, it would put it back and try again. Now we have hundreds of boxes all independently pulling jobs and then all independently failing to decide where to put it and then putting it back into the queue. So we got to a point where the more boxes we put in, the fewer jobs we could run. We were at a point of existential crisis as a business. Our job is to run jobs, and we cannot add any more capacity to our fleet. This is not a good state to be in.
Separately, somewhere around the same time, I think this was July of 2015, there was an incident on GitHub where for maybe 2 and a half or 3 hours, they were unable to deliver webhooks to us. It came back, and we were down for 24 hours. Basically, we were fine. We were just sitting around ... I was sitting in the Toronto airport on Wi-Fi waiting for hooks to come back, which was really interesting. The Pan Am Games were on in Toronto, so I was enjoying all the teams going by, but they came back and it was terrible.
We went 24 straight hours of outage trying to get our systems back online because we slammed the system queues and job scheduling that I just described to you that was super robust. I would say we rewrote a good chunk of our platform in the 24-hour period. One of the things we had to do was bring everything down to a complete halt and rebuild it from scratch because this document that represented the entire build state basically collapsed. Yeah, I could go on, but that's just the tip of the iceberg of problems that we had there.
This is an interesting point of reflection as you could probably imagine for us as a company. Our job is to run your jobs. We take a workflow, we break it up into jobs, we execute those on our platform. So it feels to us like job scheduling is a core competency or a core part of what we do. But ultimately, we were thinking about it the wrong way. As you probably heard at some point today, we use Nomad now at the core of our platform, and a big part of that was figuring out how we take advantage of things that other people have done to focus on delivering value to our customers.
I wanna dive in today to what I would call "How to Stand on the Shoulders of Giants." And 3 key concepts. One is understanding your core competency. Really understanding what it is that you do as a business and making smart decisions about focusing on that. The 2nd is mastering working with constraints and then 3rd, designing for change.
Understanding your core competency
Is anyone familiar with this book, "Domain-Driven Design"? That's disappointing. I'll tell you a couple things about this book. One, it is amazing content, and you should devour it and think about it and use it. Two, it's really long. It is a project to get through, and many people have tried to write it in different ways and written even longer versions of the same book, which is an interesting outcome. If anyone has really summarized this well and wants to put that out into the world, I beg of you, please do that.
Ultimately, what this book is about is understanding how your domain breaks down and then modeling that within your software. Those of us who have evolved a monolith know that we generally don't do a very good job of that. When everything is in the same code base, it becomes intertwined, and it becomes easy to add this other thing right here because I have access to this other code, et cetera.
One of the first exercises that you go through as you go through domain-driven design is to break down your domain. You break down your domain into subdomains. This is a little bit of what that looks like for us. This is a very surprising outcome because we had never done it before. If you look at some of the pieces around the perimeter, you see things that are identified as generic or supporting, and then in the middle you have things that are core.
At the core of our business, what do we do? We coordinate workflows and we execute jobs. And everything else, post-processing, which is streaming logs to a browser. Someone's figured that out. Managing identities and permissions and access control. Plans and payments. Charging credit cards. People don't come to CircleCI because they're really excited about how we charge their credit card. That is a thing that has been solved.
But until you make the effort to take your system and break it down in this way and really think about where you're adding value or as we think about it, the thing that people come to us and pay us for, then you spend time working on things that are really not your core.
What's interesting about that is, you can zoom in another level. The pattern breaks down again. As an example, I just called billing and charging generic. And when we look at our plans and payment system, we do billing and charging, which is literally, "I have a credit card, I'm gonna charge it monthly, I'm gonna charge it annually, I'm gonna put a discount on it." All those sorts of things. We have capabilities, which is, you define a plan and it has certain features that are available to it.
Those things are super generic and easily available off the shelf. But we do usage tracking. We measure how much you've consumed in terms of platform usage, and that is distinctly tied to our business. Taking out that one small piece and saying, "Okay, let's build this little piece, let's not build a whole platform. Just because we need to track usage doesn't mean we need to build a PCI-compliant credit-card-management system."
How do we break this down and think about it? Then let's talk about job execution. We just talked about task scheduling and how horribly wrong we went in terms of doing that, but to us that is a small piece of what we do in that core job execution domain. Resource availability is basically saying, "Do we have enough machines, are they online right now, and do they have the capabilities of the jobs that we need to run?" Auto-scaling groups. This is a thing that exists. We have some complexities, but this is a thing that exists out there. Task scheduling. This is a thing that exists. There are a lot of people doing task scheduling.
Environment construction and making sure it's perfectly suited to execute a CI job or to have the tools ready to do a deployment and make sure the secrets are in place, et cetera, that is core to us. That's the DSL that we provide to our customers over top of these machines to ensure that they can execute what they need to execute. Input and output. Did I cancel the job? Did it work successfully? Did it complete? Were there tests that failed, et cetera? Those are the things that are core to our domain and the things around the outside are not.
This sent us on a journey. We just talked about task execution. And we said, "Okay, that's not actually our core domain, what can we do to do task scheduling, how can we find a way to do task scheduling that aligns with what we need?" We did a little bit of a bake-off. If you don't recognize all of these, this is basically Nomad, Kubernetes, Docker Swarm and Mesosphere or Marathon, I'm not actually sure which is the product anymore. DC/OS, thank you. One of these many things.
If you look to the right, everything is now just a wraparound Kubernetes anyway. We didn't pick any of the Kubernetes wrappers or future Kubernetes wrappers. A big part of this was that we needed something that would fit into our domain in terms of being performant. We execute builds, and we need this stuff to run quickly, and we're doing arbitrary job scheduling. We're not saying we wanna run 100 web servers and we wanna run 10 of this service, we are taking this pile of jobs that's coming at us that we have no idea what they are and asking a scheduler to find room for them in our overall fleet.
Nomad happened to be very, very good at this particular type of job scheduling. So we said, "Okay, we have a winner out of the tools that we've tried," and then we got to the next problem, which is, it doesn't do exactly what we want. This is an amazing place for all of us as engineers where we say, well this one does 97% of what we need, so let's build our own.
Mastering working with constraints
Do you wanna build 3% of a job scheduler or do you wanna build 100% of a job scheduler? The answer should be 3%, but somehow we convince ourselves that 100% is the right answer. I think in software development we are not good at recognizing our constraints. Does anybody know what this is? Anybody? Somebody? Did anybody see "Apollo 13"? Thank you. So this is a carbon dioxide scrubber from "Apollo 13." Did anybody see that film? Please tell me somebody saw this film. Apparently the whole world saw it. These are constraints. Three people are about to die. You have tube socks, duct tape and this thing that totally doesn't fit. Make it work. My constraints are, I have Vim and a blank page. I could build anything. We don't work with this mindset of, "What are my real constraints here?" So then we head into these projects and build these crazy things just because we could, but we're not focusing on what matters.
This is a quote from Charles and Ray Eames. I don't know if you know who they are. Furniture designers. You have sat in an Eames chair in your life, I promise you. One of the things that designers always talk about is the ability to take constraints and work with those constraints. My favorite part of this is the willingness and enthusiasm, the enthusiasm for working within these constraints. Pronoun adjustment was mine. It's an old quote.
We choose to build things ourselves, I believe, for 2 reasons. One is, it's our core business. That's a good reason. This is absolutely the core of what people pay us for. This is what we do. This is why we come to work every day. Let's do this. And the other is, we're too lazy to design to the constraints. We just don't think about them.
This was someone else's idea to quote me. Your constraints in software development are not just "Build this thing in my mind." They should be "Build this thing with these tools." What I mean by that is, at least run the thought experiment. For us, it was Nomad. If we sit down and think about it, can we build what we need to build using this tool? And there's a risk here.
Don't get me wrong. We are software developers. We will try to use every tool that we've ever seen or at least seen a blog post about and say, "Okay, I definitely need one of these in my stack." That's not what I'm proposing. But rather, as you evaluate whether there's something that you can take off the shelf and use, run the experiment in your head. It doesn't have to be complex. You don't have to spend a year building something with it and just say, "How can I reshape the problem to make this particular tool fit?"
What problems did we run into? I showed this chart a little bit earlier. We take a workflow, we break it up into CircleCI jobs, which is a representation of the work we're gonna do, and then we take that and break it down. When we set out within our jobs, we have this concept of parallelism. You can say, "I wanna run this job and I wanna run 5 of them at the same time." Nomad doesn't or didn't. Maybe it does now, but it definitely didn't then. We said, "Okay, I guess we can't use Nomad." Of course, that's the initial response.
Then we thought, "What if we just break it up ourselves? What if we take our concept of a job, break it into five parallel tasks," we call them tasks within our language, but they'd be Nomad jobs, and submit those to Nomad to execute. Then we lose the coordination of those parallel steps. This is the engineering thinking because today our product has this notion of lockstep between all of those things that are happening in parallel. So we go and talk to the product team, and we say, "We have this problem with using Nomad." They say, "Everyone hates lockstep. Every single customer wants that to go away." Oh. Then we have a perfect solution.
We're constraining ourselves based on this perception of how things can work, but if we change the problem a little bit to say, if we run all these jobs independently and allow them to finish and give you a workflow at a higher level where you as a customer can define what you want to be dependent and when you want things to be able to continue because the dependencies have been resolved, then you have a much more powerful solution. Side effect, we can free up a lot of capacity in our system and fit other things in at the same time and use a lot less compute to run our entire platform.
Additionally, we had this concept of injecting secrets. You need secrets within your jobs as they execute and we said, "Oh, we have this way to do this in our existing system, we can't find a way to do it in Nomad." Guess what? Vault cubbyholes are awesome.
Here's another problem. This is a typical day in the life of CircleCI. I have no idea if I can put this chart up here. I cut the metrics off the side at least. It's from Datadog for anyone who doesn't recognize it. Probably hard to see. I forget which color's which, to be honest. This is basically today from another day, a week before and a day before. What are we looking for? We're looking for changes in patterns of how much consumption we have and anything that's interesting happening in system. And we tailor our total available capacity to match the demand. It's the cloud. It's elastic. You should not have computers that you don't need.
One of the things that we have to do here is scale down. As we scale down, we drain jobs. We drain machines actually and free up all the jobs. The thing is, a job scheduler designed to run a website, when you say, "Shut down this piece of hardware," it says, "Oh, you must want me to take this job, kill it and start it somewhere else," which is exactly the opposite of what you want in a CI platform. Your CI build is an hour in. If I said, "Oh, we'll just start it over in this other place," you would not be very happy.
Great news. Nomad is also open source. We wrote a patch to allow the stream to happen correctly and our builds to close out properly completing the job. Again, 3% versus 100%, especially when you're choosing open-source tools, you have the opportunity to modify to meet your specific needs and allow you to get all of the complexity and capability that comes from someone else's product and use it effectively with a minor addition of your own.
Designing for change
Finally, we have a solution. We're excited about this solution. But how excited are we about this solution? Are we willing to commit everything? Are we gonna go all in and say, "You know what, let's bake the assumption that Nomad is gonna be in our product forever and we'll just put it everywhere?" No, I don't think that's a really good idea. Let's talk a little bit about application design. Is anyone here familiar with hexagonal architecture? Ports and adapters? Clean architecture? Any of those things? Wow. Okay, I think there's maybe one. I can't actually see hands so I don't know why I'm asking these questions.
We have done abstraction incorrectly for a long time. I hope no one's offended if I say that. I will pick my favorite example, which is the ORM. We insulated ourselves for a very long period and maybe people are still using them, I'm not sure, from our choice of relational database. And then took the assumption that we were gonna use a relational database and spread it all over our code. And the notion of hexagonal architecture and other sort of ports and adapters similar is, I wanna take my core business logic and completely isolate it from any understanding of serialization. And serialization might be coming in.
I have an HTTP API, maybe that's REST, and tomorrow I wanna have a GraphQL. I use Rabbit for events but I'm gonna switch to Kafka. I have Postgres and Mongo and maybe I'm doing two different kinds of data storage with those two places. I don't want to know about that inside my architecture. I just want to know that this is a thing that needs to be persisted.
I wanna hand that off to a persistence layer and allow the understanding of the underlying persistence store to be represented independently. So the port is effectively an interface, for anyone who's done any kind of object-oriented programming, to say put this over here, I don't care what you do with it. Just make sure it's properly persisted. This allows me to do things like embrace the capabilities of my particular relational database.
For a long time I heard this notion that ORMs were great because you would be abstracted from your change, let's say from MySQL to Postgres. But I think more commonly we say, "I need to go from Mongo to Postgres," or this part of my data model needs to go from Postgres to Redis or to Cassandra or something else that's more appropriate for it.
None of this insulation has protected me at all, but I've still managed to bury those assumptions deep inside my project. No one would let me use index hints. And that's crazy. Because I know exactly how I want this query to run, but I'm letting the database decide, and then I'm trying to trick my ORM into doing the thing that I already know I need to do. This is not an abstraction that's helpful to me.
We try to build our applications in a way that separates out the kind of work that we're doing from the mechanism that we're using to do it and that gives us the ability to make these changes downstream. We are super excited. We've been using Nomad for 2 years in production. It's working well for us. But we would not go down that kind of path at this point without building an interface that allows us to change it later if we need to.
I love this slide. I was trying to figure out a visualization for this. There isn't one. Just make great use of the tools you have. Use all the capabilities that they have but isolate them appropriately. Don't bake the knowledge that you are isolating into layers of your domain and application code.
How do we do this at a service level? There's a lot of stuff happening around this, but this is what this looks like inside of CircleCI. Nomad is a scheduler, so obviously the thing that schedules into the scheduler had to get more "-ers" (i.e., "schedulerer") on it. No one can say this word, which is really fun. Everyone gets super confused, but we have this thin service that does nothing but translate into the format needed for Nomad. Later, not only do we change a minor abstraction layer, we can just throw out a service and put another service in place that reads off the same queue and translates into a different language if that's what we need to do.
Then Nomad places our build agent onto a Nomad client, and that's where the work starts happening. All of the allocation of resources appropriate to execute this job is done by Nomad, but our knowledge of the fact that it is Nomad is isolated, allowing us to swap out any of these pieces down the road. The build agent is what then streams out events and knowledge of what's happening inside the application to downstream processing. After that it branches out into all of the log delivery and notifications and stuff like that. We have two very simple interfaces that understand the existence of Nomad and could be replaced very simply.
To recap, first of all you want to use other people's software. This is how you can focus on your own business and delivering the core value of what it is that you do. But in order to do that, you very much need to understand your business at multiple layers in terms of where is the value that I can provide, how am I delivering the thing that my customer needs, the thing that my customer comes here and pays me for every day and how do I not work on the things that are not doing that? This is a long exercise, but I highly recommend you go through it.
Second, figure out when you get to the point that you have that 80%, 90% fit, how to adjust the shape of the problem so that you can get 100% fit or augment it with small amounts of work instead of building an entire thing from scratch and taking on the maintenance and the overhead and the complexity of that over time.
Finally, insulate yourself. None of us are perfect. We don't make perfect decisions basically ever. But we feel good about the tools. Let's use them and allow ourselves to grow and change over time, whether it's choosing one tool and then switching to another or if it's starting with something you built yourself, but make sure the thing you built is tightly encapsulated so that you can replace it with an off-the-shelf tool if it gets to that point.
I mentioned this this morning. This is the same chart that I just showed you, but this is from yesterday. I also mentioned in 2015 GitHub delivering a couple hours of hooks to us and taking down our site for 24 hours. This is 24 hours worth of GitHub hooks arriving between 11 and 12. Sorry this is PDT because I'm lazy and didn't switch the Datadog thing to EPC.
We sort of just sat there and watched it. We said, "All right, well there's a bunch of jobs we gotta get done. Let's let those run." They ran. And we were back to normal. As soon as GitHub came out of their incident status, we came out of our incident status. This is a very different outcome for us, and this is founded on accepting that other people are better at writing job schedulers than we are.
What does that mean for us next? I kinda threw this in just 'cause it was in the title, but we have secrets-management challenges as well. We've made some small changes already to start adopting Vault in how we manage secrets in our platform. We're continuing to roll that out, and everywhere that we look inside of our platform and we recognize this is not the core of our business, we're taking pieces off the shelf and figuring out how to make those work for us. So that we can focus our team on the things that really matter for us and the things that will really deliver value to our customers.
All right. Thank you.