Case Study

How We Used the HashiStack to Transform the World of Roblox

Roblox is a heavy HashiCorp user. Learn how they use products like Nomad, Vault, and more.

Speakers

  • Rob Cameron
    Rob CameronTechnical Director of Infrastructure, Roblox

Transcript

Hello, I'm Rob Cameron, and this presentation is about how we used the HashiStack to transform the world of Roblox.

I joined Roblox almost 3 years ago, and we've had an amazing adventure transforming our infrastructure.

One of the technology stacks we've used is the HashiStack from HashiCorp. Today we're going to talk about how we've used that and how it's enabled us to grow our business.

I always love to start with a quote or a funny story, and today I'm going to use my favorite quote:

"The best way to predict the future is to invent it."
—Alan Kay, the inventor of object-oriented programming and lots of other cool technology

What I love about it is, when building designs, building infrastructure, building services, a lot of times you want to take things off the shelf and just win as fast as possible.

And that's great, right? There are a lot of services out there or design patterns that may work. However, you may need to have different solutions or different ideas or options. When I look at the off-the-shelf designs, we need to say, "Is this going to work for me today? Is it going to work for me tomorrow? Or do I need a new design that maybe doesn't exist?"

This is the route we took when we looked into this project of how we're going to uplift our architectures and infrastructure into a modern world. For me, this project was about inventing the future and creating some new solutions using the HashiStack and other tools, and I'm really happy that we did it.

But let's talk a bit about how we got there. First, let's give a bit more introduction for myself. Why listen to this guy?

I am a technical director of infrastructure at Roblox, one of a few of the folks who are able to take designs, architectures, and a look into the future to see how we can guide the company, specifically around the infrastructure stack.

This character that you see on this slide is my avatar on Roblox. I love Linux, so I'm of course a penguin, but, I don't know, why not have a chicken head? Why not have cookies for ears?

What I love about it is I can express myself on the Roblox platform as an avatar, by creating games or using things. I love things like making orchestration work, containers, Linux, Golang. I hate gluten, outages, and weird configuration file formats. Who needs that stuff?

What Roblox Is

Let's talk a bit more about Roblox.

When I first joined Roblox, I was like, "What is this? I'm Rob; I love lox. Roblox. It seems like a match. But what does Roblox really do?" It's confusing if you don't have a kid in your life who's 9 to 12 years old, which is our primary target market.

Roblox is primarily about this concept of powering imagination. The idea is that in this 3D environment we walk through, how can I express myself as an avatar? Maybe I want to be a penguin with cookie ears. Maybe I want to make a game where I'm fixing printers or I want to make a game where I'm building HashiStacks.

That's what Roblox is about. It's not a prescriptive way to play; it's not a prescriptive way to dress yourself. It's a way for you to do what you want, for that imagination.

First, it's a massively online multiplayer game environment. The idea is that you can play with players all over the world. You don't need to sit close to them, bring your computer over, or worry about latency. We deal with all that for you.

If you're in the United States and you want to play with somebody in France, go ahead. We'll figure that out and give you the best possible gaming experience.

Now, who makes these games that people play? We have what we call creators or developers or external Roblox that create these experiences that you can play.

What I love about it is you could right now download Roblox Studio, start building a game, publish it, have it globally deployed on our infrastructure, and basically you just hit go.

My only suggestion is you watch the rest of my talk. Don't just start creating these things; do it at the end of this, and I definitely think it might be fun for you to do.

You could even make a virtual meeting room; are you tired of boring meetings?

We have over 150 monthly active users, or what we call in the industry MAU. People come from all around the world to play on our platform, but what exactly do they play? Maybe you want to go through a hack-and-slash through a dungeon. Maybe you want to hang out in the community and build a house. Maybe you want to go back to high school and have a fashion show.

I didn't do fashion shows in high school, so maybe that's something I'd want to do through the Roblox platform.

One of my favorite games is called Work at a Pizza Place. It's been on our platform for over 10 years.

What do you do in the game? You work at a pizza place. People come in, you take their order, you have people that will make the pizzas, people that will deliver them. That seems crazy, right? To me it's not. Maybe that's just fun to do. I love playing that game with my nieces and nephews, and I think it's just a fun experience. You don't always have to shoot people to have fun.

It could be anything you want. To me, that's powering imagination.

The Roblox Stack

Let's talk about the Roblox stack and how this operates, so that way we can talk about how we've used the HashiStack to empower it.

The Roblox engine uses a globally distributed compute. We want to place compute as close to the players as possible; that way we reduce latency. No one wants to have to wait 200ms, 300ms, 400ms to be able to jump or deliver a pizza, so we placed the compute as close as possible.

We reduced the player latency so they have the best possible gaming experience. No one wants to lag out, where you're driving a car and all of a sudden you fly right off the universe. That's a horrible experience.

By minimizing that through a series of game servers, technologies, and various elements, we reduce that challenge. And lastly, and one of the cool reasons that I joined Roblox: We use this awesome matchmaking technology to place you as equidistant as we can between the other members in your party so everybody has the best experience.

That way, instead of you saying, "I want to play on the West Coast servers, which are close to me in Los Angeles, even though my friend in France is not going to have a good experience with that," we figure that automatically.

You don't have to know. You just hit play and we do the rest for you.

For building this out, this Roblox Stack, we have also a centralized platform. In the gaming industry, the game server is where the games are played, and the platform is where all the boring stuff happens.

This is done by having regional datacenters that power this experience. We have North America, Asia-Pacific, and Europe, and various states of moving services around. That way, we can have low-latency services for our own services.

We use a mix of technologies: a lot of C#, which is a great language for building high-scale software. Also, Golang, Java, Node.js, LUA, and some other things. We want our internal developers to be able to deliver on the player experience without technology being a particular limitation.

We originally were Windows-centric. This may seem weird, because we see Linux everywhere, and there are penguins all over my presentation.

It's actually very common to use Windows for software development for games and to use that same Windows stack to run the games. This journey, though, is about how we moved away from Windows to empower us by using Linux.

I love Windows, Linux, Mac. I use them all equivalently all the time, all day—just switch in and out. It's all about the best tool for the job, and I think that's what it comes down to, to drive the player experience.

On the Roblox infrastructure side, we use a hybrid cloud model. Our goal is to build a Roblox cloud for cost savings. We wanted to focus around empowering the players and not be stuck with somebody else's stack and how they use it.

This gives us the flexibility that we need to build services and use specific hardware designs that empower what we do. We focus on this by using our bare-metal infrastructure layer. This is very difficult to do. Do you know why?

Managing servers, being able to build everything out and control everything that you're doing, it's tough. And if it wasn't tough, cloud wouldn't be a $1 trillion industry. But it's been highly advantageous for us, and it's really helped us to make the company provide a better player experience overall.

We also use customized designs where we pick specific hardware that empowers our workloads. If we want to take advantage of a new processor design, some new memory architectures or whatever, we're able to pick that and cycle the hardware in and out.

I love that, because we've seen instances where we've moved over from one processor type to another and doubled, tripled, or even quadrupled capacity, at a lower cost.

Of course, we still use cloud providers. They're really advantageous when you want to do things like bursting compute or using other services. We always focus on best of breed, so if a datacenter from a cloud provider is in a region that makes sense for us, we're going to take advantage of that, consider that fact as well as cost.

The lower our costs, the more we can focus on hiring developers to make a better gaming experience and the less we worry about how we're going to pay next month's cloud bill.

We do some flexible bursting where, if we want to have a big compute job, some machine learning, whatever it may be, we burst into the cloud and just extend those bare-metal environments.

That's been really good for us, because we don't have to commit to buying a lot of hardware sometimes, or if we want to do some experimentation, it's very flexible.

Lastly, we use a lot of specific services like object stores, queueing, whatever it may be. We always want to take advantage of best-of-breed stuff. Sometimes we take these services in-house; sometimes we focus on using them outside.

Again, it's about the advantage and us being able to have a winning hand for our players.

In the Roblox Cloud itself, we have 2 really distinct elements, edge compute. This is becoming a hotter topic. Typically, with cloud providers, you see regional-based datacenters. This is great, but the problem that you run into with a regional datacenter is that it's not close to the player.

By building edge compute, we're able to bring the compute to the players, lower latency, and ultimately we can attach to cloud providers to get that hybrid type of experience.

The rest, as I mentioned is in datacenters. These are core services, all the boring stuff, hundreds of little components or microservices that run things, not the fun gaming things. We also have these regionally located.

Again, advantageous for us and how we want to choose and place them, and, again, attached to cloud providers. To me, this is that hybrid cloud model. We build our own cloud, we use external cloud, we're winning all around.

Enter the HashiStack

How did we get started using HashiCorp products? I joined in December 2017 under the guise of being able to migrate our workloads from Windows game servers to Linux.

I love Linux and have used it for as long as I can remember, but how do we get there, and how do we build this out?

We have to build the requirements of what's needed, all of the different capabilities. It took dozens and dozens of meetings to create a plan for us to roll out. Once everyone got excited about the advantages we could have, we wanted to take a look at how long it would take.

When I joined, the goal was 24 months to change thousands of servers, but that rapidly declined to 9. It seems crazy and, trust me, it was.

The thing is, it's not just about Linux. It's about optimizing our workflows. We needed to deploy services at scale rapidly to thousands and thousands of nodes, and we needed a design pattern that will grow with us.

We also needed to be able to manage secrets; each one of these servers uses very sensitive information to connect, manage, and interact with our platform. We don't want any of that information to get out, so how can we manage those things and rotate them appropriately?

Lastly, to do this, do we need 100 people? Do we need 1,000 people? How many people does it take, and how long does it take to build it? We need to be able to quickly manage and build a site so we can get players into the game and enjoying the things that really matter.

Orchestration Is Needed

In today's world, orchestration is the solution. Being able to have centralized services that you can use to ship code globally. You don't want to interact with each individual box; you want to interact with an API or interface and say, "Go put this out there for me."

We also want to be able to do configuration management. We don't want to go to each box and update things; there are a lot of tools to do this. We want it to flow very simply.

We also want that secrets rotation. I hammer on this because people often will just bypass security because it's inconvenient. I like to say that if you make something so easy that you cannot not do it, you'll be forced to do it. And that's what I think about secrets and why I hammer it home.

What orchestrator do we use? At the time of this project, around the end of 2017, early 2018, we looked at the big 4 players.

  1. Docker Swarm: An easy tool, very simple to set up and scale, and some features and capabilities that were nice. But ultimately chose not to go with it. The industry as well chose not to go with it. It still works fairly well, but it's something that I think in the long run will go away.

  2. DC/OS: I had some previous experience with DC/OS, which runs on Mesos. Great solution, highly scalable, a lot of folks have used it. We just found challenges around being able to run Windows workloads.

  3. Kubernetes.

  4. This Nomad product from HashiCorp. This being HashiConf, it's unsurprising that we didn't choose Kubernetes. But that is really crazy, because Kubernetes is the market leader for orchestration. For all of its capabilities, why would we choose Nomad?

Nomad is not just a singular product. Nomad is a solution space where we have Nomad, Consul, and Vault. Nomad provides orchestration container management. Consul provides service discovery, semaphore lock, and a lot of different features and now service mesh. And Vault provides that secrets management.

What I love about Nomad is it doesn't have a prescripted design pattern. I could use it right now on 1 node, 10 nodes, 5 nodes, security TLS, non-TLS, Docker container workloads, non-Docker container workloads, and an extensible driver system to be able to run anything I want.

This is the flexibility and the future I wanted to invent to be able to build out.

We're also able to use as much as we want at first and then grow with it. In the beginning, we were doing very simple tasks like single pod or container per job, and we've expanded to using very complex design patterns with persistent storage from companies like Portworx. And it also supports Windows.

We can use Windows containers. We can do random Windows stuff and even extend it to run custom Windows services. This is important to us, because while we always want to move over to microservices and Linux, you don't want to bet that you're going to do it rapidly or rush it. You want to do it gracefully over time, and I think that's what Nomad provides for us.

We started on 0.7.1, and now we're always keeping within one major version release, so we're not held back.

Every new release has provided more features, and it's been an awesome experience for us so far.

I want to address this again: Why not Kubernetes? It's an awesome tool. It's super powerful. There are a lot of design patterns, a lot of capabilities, and I 100% agree.

I started looking at it, but the concern was the complexity of running it in a hybrid environment. We have some really old designs for how we provision things. Is that something that's cloud-compatible or how you'd use it in the cloud?

It ended up that it wasn't, and that was a concern for us. The other part is, when we started building this out to manage between 8,000 and 15,000 servers, we're going to get about 2 to 3 people to do that. How do we onboard folks?

For Nomad, we onboarded an Active Directory expert and guru, somebody who's never used Linux or containers, and now he's upgrading Nomad sites on the fly with no player downtime. To me, that's a big win.

How do we have all of these engineers in the company who don't have experience with containers or Linux be able to use it? Can I do that with 2 people? Probably not.

Then, how do we support them once they're on the platform, explain job spec, pod spec files? We could write automation, and perhaps in the future we'll invest enough people to do it.

I really love Kubernetes. I'm running it in this room. I do a lot of fun stuff with it. Unfortunately, we can't be friends and hang out all the time. We do use Kubernetes for some specific workloads and cases, just not for generic container utilization.

The game server results: What did we get? This picture shows me and Victor, who works at Roblox, getting that first game start to happen. We couldn't believe it, and I sat in this room crying with excitement once we got it to happen.

What we gained was our deploys for new game services went from 45 minutes globally, with some failures, down to 10 minutes, and very seamlessly worked out. That's awesome, right? That's 4.5 times faster.

It's even faster today, but let's just go with 10 minutes as a safe estimate.

Our game servers, which were running Windows, now have 2 times the amount of workloads on them. That's like getting free hardware.

We also migrated to 64 bit, enabling bigger services, bigger components, and loads more memory access.

We now have everything containerized, so we can dynamically tune containers. Does the game need more resources? Does it need less? Do we need to clamp it down? We're in control of all of that now.

We have anywhere from 240 to 1,500 nodes per pod or per site. This varies based upon the day, what we're rebuilding.

We build out sites now in 2 days, and what I love about that is that used to take weeks and weeks to build out and now we have that process streamlined across the board in the organization.

We did the rollout 8 days in a row, skipping the weekend because we needed some rest. One day at a time, 8 sites came up. The initial project finished just about 2 years ago, on October 31, 2018. We're now up to 24 sites and have basically rebuilt all of them to optimize them even further.

And we're always sharpening the razor. How can we make the solution better? How can we work things out? That's what we're always looking to do, and I think that's fantastic.

To me, you can never automate enough, and you can never make things simple enough. I want a world where I drop off hardware via racks and, boom, within an hour, I can bring the site up. I'm still going to push everybody to get there; it'll just take some time.

What else did we gain out of all of this?

We now have HashiStacks all over the world, and what I love about it is now we can run other stuff on top of them, non-game server-based workloads to be able to drive and use more compute.

We also have a design pattern of how to operate this stuff, so we're able to scale and grow the team. Now we're at 6, with a manager. We're able to scale up and have people that have a repeatable process to build and manage stuff.

On to Microservices

Now for the hard part, the journey to microservices.

I think everyone wants to be in a world where everything's CI/CD, instant deployment, instantly scaling, autoscaling, all this magic. The problem is, it's never really easy to do. So how are we moving that monolith over to microservices?

You have to plan this out. You have to understand how you can tie everything together. It's a very difficult journey for everybody. And any company, no matter what orchestrator it chooses, has to look at this.

But we have all these HashiStacks, so maybe we can take advantage of this as a forcing function to be able to move over services.

Are there challenges? Absolutely. I love to say orchestration is the perfect way to destroy your company. That's a scary and bold statement, but it's true. You're going to change everything: how the process works, how you develop software perhaps, how you do code check-ins, how you do builds, how you do push to production.

It's hard, right? And how do you get everybody aligned? When I started, we were 600 people. Now we're like 800 to 900 people. How do I scale that to 3,000, 4,000, 5,000, 6,000 people? It's tough to do, even if it's just something where you have expertise around the orchestrator, just getting people on board is hard.

We also need to be able to onboard any external technologies, even HashiStack. For so long, we were always building our own stuff. So it made it a bit easier as we learned, as we grew.

But bringing HashiStack in, with command-line tools and flags and environment variables, it's weird for folks. And that's OK. We can help each other understand.

Lastly, we have shared services versus dedicated services. If I get a server, I deploy my 1 app. I can go there, do whatever I want, install Spotify, stream music, the sky's the limit. Now you get this tiny little slice of compute on the system and things might act weird. It's a devastating journey for an organization to go through, but as you take advantage of it, you end up really winning.

This is something that's expected to go on through 2021 and probably beyond. Transforming the org doesn't involve just code or infrastructure; there's a lot to it.

Now with infrastructure: If you look at the history, we have these old applications from 20 years ago, thousands of lines of config, manual synchronization, random redundancy protocols like VRRP. But as you're moving toward orchestration or other services, it's about reusing and carving those elements out of your application and relying more on infrastructure.

From a Consul perspective, this allows us to do service discovery. Get rid of those static host lists. I remember first joining where we had a node go down and, boom, half the platform failed out of thousands of nodes. I'm like, "What's going on?" It was because they were done statically. In integrating with Consul, we can dynamically update our configs, find the next node, find the closest one, all available to us.

We're also able to do unification. We can run Consul on Windows, Linux, Mac even. In Nomad for your services, we could run it in Kubernetes if we needed. All of them now can have that singular layer in how they find each other, which is amazing.

And then also from a key value perspective, we can use that for configuration management, where we store it and do changes and then rapidly be able to respond as those changes come up.

From Vault, I'm going to hammer on that secrets management, because you want those secrets to be secure and easy. We have dynamic secrets. That way I could change a key in Vault, update a certificate and, boom, it's automatically rolled down to the system. With secrets management on game servers, sometimes it would take us a quarter to rotate secrets. We took that down to 10 minutes or less.

The same thing as well for dynamic accounts. Do you have a database? Do you know the password? Why do you even know it? Why not have Vault manage that for you and rotate that password out?

We do this in a lot of cases and it's really nice. We rotate through hundreds of users a month, and we never even need to worry about leakage because, by the time that password gets out, it's already gone.

And lastly, PKI, or public key infrastructure, with SSL. We're able to generate our own root CAs, intermediate CAs, do signing, all of these things. It eliminates the need to have a lot of expensive certificates for internal services and gives us the ability to dynamically sign them with very low TTL.

If a certificate gets leaked, we don't care. It's already expired by the time you've seen it.

And back to Nomad orchestrating our tasks so we can deliver applications rapidly. It's very fast in orchestrating and delivering your workloads.

We have dynamic templating using Consul or Consul Template, which is integrated into Nomad. Update your key value, update a secret, boom, already re-rendered and application's ready to use it.

And faster deploys. By doing this on microservices versus our old ways, we're 5 to 10 times faster, and that's amazing. We minimize the amount of effort it takes to build and look at a microservice, minimizing code, but also making the shipment much quicker.

I love this, because we want to focus on delivering player features and capabilities; that's what really matters.

What about the rest of the environment? Let's take a look at HashiStack for us. We grew into the Enterprise product space. What I love about HashiCorp is that they have a really good open-source software offering that you can use out of the gate. It's free, you can get started now, and you even have all the code to modify it, do pull requests, all of those things.

But you'll get to a point, potentially, where you need to grow into some different services or capabilities. One feature example is namespaces, the ability to take each product, carve it into a namespace, whether it's Consul, Nomad, or Vault, and be able to say what you can and can't do or what you can and can't access.

This way, you could have a multi-tenant environment. Maybe you don't need that. Maybe you'd rather run 50 instances of Vault. The option is up to you.

I love that, because you can still use most of the great features. Some of the more advanced stuff is locked under Enterprise from a support costing perspective, and I totally get it. We want best practices, we want people to help us. We want to be able to eliminate any sort of confusion. If we're able to staff up, maybe it's something that's less important to us.

But immediately when we're able to take advantage of support, that was great for us to be able to minimize worries, have them on the phone if there's an outage or an upgrade. Generally, it's gone really well for us.

The feature of Sentinel: very deep policies that you can utilize inside of your application.

For example, you submit a job to Nomad with the word shoe, say, and I don't like that word. I can find that in your job spec and have a policy that says, "No, you can't use shoe here." That eliminates a lot of challenges you might have where people are doing the wrong thing, whether on purpose or accidentally.

Sentinel is available across all of the products; I just use Nomad as an example. Our Hashi team is always one phone call away to help us no matter what we need, and they're always really supportive.

Deploying HashiStacks, as we have site after site come up, every one enables us to take advantage of our compute in a more unique and more dynamic way.

What I love about it is that we're able to take compute that we're buying as hardware. Maybe that server costs $3,000 to $5,000, but I want to use it as good as possible.

If I buy a car, it's going to sit in my garage 99% of the time. But if we look at compute, how can I make each server run 80% of the time doing computing of various things? That's why I think it's powerful.

As we've spread this all over the world, we have that repeatable design pattern to run our workloads and capabilities no matter where we go. Now we're to the point of just bending the HashiStack to our will and expanding it to do what we want.

Expanded Windows support is important because we love containers but everything can't be a container. We wrote a driver to run IIS-based workloads on top of Windows inside of Nomad and guess what. We've open-sourced it. You can take advantage of that today.

We've also extended it to support Containerd. Normally you think of containers and aligning that with Docker. Docker has been great to get us where we're at, but it can be a bloated demon that makes very difficult decisions in how it's operating, the state of it, all of that.

We hired one of the best in the industry to help bring us Containerd support. This now eliminates the need for Docker, focuses on using containers as they are, in a very similar model to Kubernetes.

We've even added seccomp profiles, right on top of when Kubernetes provided it. Download that and you can start using that too today.

We also want to run our metal systems like a cloud. And so we use Terraform to provision them. We use a service called MaaS, or Metal as a Service, from Canonical. You can download as well, open source. Use our Terraform provider to power imaging servers just like if we were running in cloud. That's awesome, because instead of deploying those images one at a time or by hand, boom, we hit a Terraform apply job.

And now, in Terraform Enterprise, everything can be deployed. Download that and get started on your own metal infrastructure.

And lastly, we wanted a little bit more detail out of Nomad and what it could provide. So we created this cool tool called Nurd.

This allows us a bit more detailed utilization around different applications, and it has a very simple interface. You can plug right into Grafana. Go ahead and get started with that as well.

When you're going through a big journey like I've had for the last 3 years, it's really risky when you start. You don't want to be wrong, you don't want to be fired, you don't want to let people down. But trust the Hashi process.

You can start very small with 1 application. Maybe it's Vault to manage simple secrets or generating certs. You could add in Nomad, add in Kubernetes, add in Consul, whatever you want to do so you can build your own world that you believe in with these various tools that are really well supported by the community and really well supported in cloud environments.

It may seem scary at first, but if you invest the time, if you invest the patience, if you get involved in the community, you can truly build something beautiful to power your workloads. That way you can enable your customers to work better.

This has been an amazing process, an amazing project, and I'm very proud of the work that I and the team have put into this. I'm very happy to see every time we have a weekend and I don't hear about what happens because the players just come and go and the environment keeps on running.

My advice is, if you're interested, get started on one of these products, get involved in the community, start attending webinars. There are a lot of options to get going. Or enjoy that great learning portal from HashiCorp to start learning it.

You also can take your workloads no matter where they are and bring them into Nomad and find service discovery. Trust that Hashi process and just pick the entry point you want to begin in.

Thank you for listening to my presentation. I hope you enjoy HashiConf.

More resources like this one

  • 1/6/2021
  • Case Study

Self-service discovery at scale with Consul at Bloomberg

  • 1/5/2021
  • Case Study

How Roblox Developed and Uses the Windows IIS Nomad Driver

  • 12/17/2020
  • Case Study

Consistent development and deployment at Comcast with Terraform

  • 9/2/2020
  • Case Study

Service Mesh in the Real World