Cost reduction, risk mitigation, speed: Please choose 3 with Vault & Boundary
Hear how the platform team lead at dx.one carved a path for zero trust security at his company with HashiCorp Vault and Boundary, and see how the positive results came with few compromises to speed, cost, or risk.
» Transcript
Now, you're totally here to see me and not just sitting here because Armon is next. But anyway, I stand between you and beers and everything else. So, let's get this shindig on the road and talk about cost reduction, mitigation, and speed.
Usually, you have those triangles where you get to choose two. Why don't you choose three and start with that? Any guesses? There are a bunch of quizzes during this talk. Any guesses what this is? A password — that, in the not-too-distant past, I found on a production server accessible from the internet. I cannot prove it, but I'm pretty sure that 2001 is the year the password was set. This is the world that I used to live in, but we can do better. So, this is what we're here to share.
» What we do at dx.one
You've probably never heard of the company. If you ever bought a Volkswagen — an Audi commercial vehicle — we know where you live because we are the data processor for Volkswagen's retail partners. We get all the data from the dealerships — who's buying cars, who's bringing their cars to service — and prepare that and make it ready for marketing, for product development.
Makes sense to know what kind of people buy what type of cars if you're building new cars every now and then. We have this wonderful topic of GDPR, DSGVO in German, meaning we process the data and let the rest of the Volkswagen group know who you can call and send emails to. That is a big part of what we do.
I run the platform engineering team at dx.one. My mom tells her friends, "The boy does something with computers," for the last 25 years. I know Sebastian from my time at Google. I worked at AWS. So, this whole software cloud infrastructure thing used to be my gig. Now, I switched to the customer side and get to play with all the fun toys that you normally only get to talk about.
» How did we get here?
Why am I here talking about this? To give you a little bit of the journey, we used that term before, maybe a bit more towards how we actually did the implementation. A bit less about strategy: I have smarter people to talk about that.
So, really, how did we get started with our journey from — not that that password was in our datacenter — but how do we make that journey?
I also talked about a quiz. I have a few pictures in there. Took them all myself. So can anybody point out what this is? You get points and a free beer from me at the party afterwards. Anyone know? You should know. You don't? First Google server. Home-built. Anyway, I digress.
When I took over this job, I found some unexpected challenges in the team setup. Again, coming from a modern environment, working with cloud providers, suddenly I am working with a very diverse team.
Normally, diversity is a good thing. We cherish diversity. We encourage it. But diversity also applies to technology skills — to maturity levels of the team's understanding of what a secure environment looks like.
» How do you build secure applications?
Together with me, we had a bit of a change in leadership. As predominantly happens in the automotive industry, people coming from a mechanical engineering background go into leadership positions. Suddenly we have a team that's coming from financial services — from technology, like myself. We started to look at how we can rebuild a few things that we used to take for granted — and do differently. Today I'm here to talk about a few things like access to privileged resources, for example. And, the compelling event of: why change something? Because people tend to not like change, in many cases.
As is probably true for many of you, we needed to get out of the datacenter. Then the question is always, where do I go from there? Cloud providers are an obvious choice. Hyperscalers. We did the same thing. So, we're currently starting to build some initial applications, cloud-native on AWS. And now our existing tech stack is moving over from the datacenter in Berlin to the cloud. To set the scene, this is where we're coming from. Probably not too different from many of you in your day-to-day work.
» Some of the challenges we faced in the existing environment
We decided we wanted to do this differently going forward because of connections. Connections everywhere. You have a bunch of servers. You have a bunch of systems in your environment, and they're all connected somehow: That’s not just between them, people need to access this.
Over a period of 15, 20 years of running a datacenter, you suddenly end up with an environment that is, on one side, fragmented because everybody seems to be doing their own thing. But then totally connected because over time this system needed to talk to that system, that person needed to access that server or database. So you end up with a very traditional datacenter infrastructure with a traditional perimeter.
There's the firewall around it. VPN access for developers. We move a lot of data, being a data processor. A lot of people work with virtual desktops that sit right next to our database servers and data warehouses just to be able to move data around very quickly.
But you end up with a bunch of long-lived credentials on top of that, growing over time, multiple sources of identity. VPN access is different from database access, which is different from access to the system itself. Way too many firewalls with way too many exceptions.
Of course, you have a skills gap in the team that is used to running things in your own datacenter. That brings with it a question of: do I upskill my team to do something like this that I can easily outsource to a cloud provider? While we're a data processor, we don't derive business value from being able to run Postgres. It's not something that brings us value. Frankly, there are other people that can do it better than we can.
No takers on this? This is a CM-5 connection machine. You've seen it in Jurassic Park. Crazy, crucial role in that movie.
» How do we audit our environment?
When I took over I also needed to find a solution for how we audit all of this. We are dealing with customer data. We have way over a hundred million customer records and other types of sensitive information. In an environment like that, doing auditing by hand and doing the traditional compliance approach of, here's an Excel sheet, please fill it out and state that you are secure and compliant. It's probably not the best and most efficient way.
This fragmented way of looking at access, credentials, and compliance across the different data sources. It used to be a manual process of looking at who has access to what? Are they supposed to have that access? And still on the other side having to prove to the organization and external regulators that we are in compliance with what we're supposed to do because we're dealing with customer data.
This is one of the reasons I was brought in. You’ve probably seen all that, been in that situation. It's broken, go fix it. There's a million things you can look at when you're in a situation like that.
» Where do I start?
What do I fix? What can I fix probably easier than other things? Like culture change. Like moving a team that was able to access pretty much any system at any given time because that's what they're used to. A modern DevOps platform engineering-based approach — it’s probably a long-term goal. So, there has to be some lower-hanging fruit. And, among the many things we're working on, one is rethinking how we access critical resources.
Any takers on that? Very few people remember where they were when the world changed. I was at the Apple Worldwide Developer Conference in 2008 when they introduced a way to build applications on the iPhone. Completely changed everything.
I think this whole zero trust approach, in my little world of running IT, is a similar fundamental shift. Moving away from this traditional sense of, I have a secure parameter and then have the soft underbelly within my datacenter. Meaning if I'm within the firewall, I can do whatever I want.
Moving from there to an environment where, basically, every time something or someone accesses another resource — be it a database, server, or application — we always tie that to identity. Not just traditionally meaning you have credentials, you get access. But have that combination of you have credentials, you have access — that's great. But are you allowed to use those credentials in the current environment and at this current time?
Moving away from here's a username and password — or here's a username, password and token — towards great that you have those, but are you really the person you claim you are or the machine you claim you are? And are you supposed to do what you're doing right now at this point? And, can we prove and validate whatever you did afterward?
Finally trust no one except maybe our identity provider within the Volkswagen group. AWS because that's our cloud provider of choice — and if you don't trust your cloud provider, maybe you shouldn’t move to the cloud — and HCP.
We decided to run all of our critical infrastructure from HashiCorp on the HashiCorp Cloud Platform, and we're super happy with it. There you go. Yes. Are we happy? Good.
» What are our tenets?
Not tenants, tenets. Those were the big things that we established that we are trying to communicate clearly to the team and everybody who works for us.
» Access is always temporary
What has changed or what is changing in regard to access? Access is always temporary. I don't want any passwords that have a year in them where I was a lot younger and prettier and thinner than I am now. But access is always temporary. We don't want — wherever possible, there's always exceptions — grand long-lived credentials to either resources or complete environments. That includes database servers, AWS accounts, and so on.
» We assume that no network is secure
We're probably preaching to the choir here. This is not something that should shock anyone. But for many people coming from a traditional datacenter environment, this is a big shift. You're thinking, well, it's my datacenter — it's super secure. I can treat things running in the datacenter differently to what's out there on the big bad internet. We don't make that assumption. We assume that no network is secure. There are always bad people out there, and we treat them accordingly.
» Always show ID
As I mentioned, everything that anyone ever does is tied to identity. Identity being a virtual machine with an instance profile, if we have machine-to-machine communication. Or somebody–a human — accessing something always needs to be validated and assigned to an ID in our identity provider.
» How did we do it?
» HCP Vault
How we made that jump from here's a username and password to a database server that has all of our information — to how can we be more sensible about that? We had a HashiCorp conference. We implemented two solutions.
We're working with Vault, an obvious one. Everybody loves Vault. We manage all of our privilege credentials in Vault. Dynamic whenever possible so that notion of: you don't get a username and password on the database that lives infinitely as long as you are with a company and often beyond.
» HCP Boundary
But the moment you connect to a database server, you create a new username and password. You do your work. We audit your work. And, the minute you log off from that server, we delete your credentials. We use basic Vault functionality, but together with Boundary — to allow people to go from their workspaces, on our network or in the VPN, to the machines they need to connect to.
So, every time anyone wants to access a production resource, this is what we started with. It's actually my recommendation to you. Don't start enforcing zero trust across everything. That's an end goal or journey towards.
Somebody talked about developer experience before. You want to make sure this is something that is accepted by the people who are supposed to use it. Don't be too rigorous and draconian and say you don’t get to connect to any system anymore. But rather gather some experience, see what works in the organization or doesn't, and iterate from there.
That combination of Boundary that moves away from this notion of: I have to poke a hole through a firewall for an infinite amount of time so somebody can connect to a machine — towards: the moment you want to connect to that database server, because you need to, you connect to that machine. It opens up that connection to that server — to that database you need — injects dynamic credentials, and by the time you're done, we tear down that whole connection, which brings us a wide range of benefits. I'll talk about that in a little bit.
» Where are we now?
Before I get to that, the one thing to keep in mind is that nobody's asking for this. Even the people who ask for it don't really know why they should. The phrase I heard is, if we needed this, we would've done it already.
It was actually in German. This whole idea of zero trust — not just from an end-user perspective but also for access to infrastructure — is very new to very many people. Both on the operator side, where many seasoned engineers or datacenter people, For them, it's this weird pride thing: ‘I'm the one who has access to protection. I'm very proud of it, and you're not taking that away from me.’ Yes, I am. But let's talk about this.
» Challenges
» Manage operator acceptance
Make sure that you are working on change management. You're educating people about why we are doing this and what the benefits are. Because, as a technologist, this sh*t is really cool.
But making sure that this is not just something — here's the new technology and you have to use it. There is a certain amount of change management for people who are used to working differently to moving over to this.
Not necessarily because they abused their privileges. But what I like to call, well-meant disasters happen more often than you think. Oh, I thought I dropped the table on the development environment. No, it was the production environment. Sorry. Same username. Happens all the time. So, be sure of that.
» Prepare your security team and CISO
Interestingly, for me, bigger discussions also happened with the security team because it is, as I mentioned, it’s a new concept. I remember the first time presenting what we call “firefighter access” in our production environments. This means when somebody needs to access production — which is not something we grant by default, it's always an exception — they have to fill out a ticket. Four-eyes principle, it needs to be approved by two people. They do their work. We actually use the audit logging both on AWS and in Boundary to track what that person did.
So, using CloudTrail — using the SSH recording capabilities in Boundary — to really see they run these commands. They made these changes to the AWS environment. Afterwards, we look over the commands that the person ran and map them towards the NIST EP 300 catalog. Did they do something that would breach any controls or make changes to any controls? For example, did somebody turn off encryption for an sg-bucket? Just a simple example.
That is so beyond what many security teams are used to and can deal with in the sense that you need to get your head around this. I was super excited and like, "This is more advanced than any of your colleagues can appreciate within the organization." The only reaction I got from our security team when we outlined this process was, "But that four-eyes principle, those are two people in your department. So, this isn't secure at all." Be prepared for some discussions. A lot of enablement to help people understand what this concept is all about and what changes.
» Initial results
What does it bring to the organization? From our perspective, there are three obvious ones:
» Cost reduction
We're able to do away with a lot of VPN access. It used to be that people needed to connect to our environment — to our datacenter — via VPN because at some point they might need to access some database or work on some server. The same goes for virtual desktops.
If we can provision access dynamically and securely and tie to identity, we don't need long-lived working environments within our environment. The cost reduction for us, fingers crossed, might be enough to offset the cost we have for the HashiCorp Cloud Platform anyway.
From a project scope of not having to deal with a complete reworking of our network infrastructure in an existing datacenter, but move things to the cloud and rely on Boundary and Vault to provide secure access, actually significantly reduced our project scope.
» Risk mitigation
What's the benefit? Much simpler firewall rules! Oh my God, just being able to reduce the number of security group rules that you would normally need to make sure that everybody gets from point A to point B as they need to is something that makes my life a lot easier because it's a lot less to audit.
And also, just running the infrastructure. For us, that's a benefit of running on HCP — on the cloud environment — because we don't need to run things like Vault ourselves. I mean, I love the product, but if you're running Vault yourself, I have two comments: Either I want to hire you, or I think you're insane. It's just not for the faint of heart to run this critical infrastructure yourself. So, being able to outsource that to people who know what they're doing worked well for us, and we can actually put a number to that — what that saved us.
» Speed
Being able to take a lot of teams out of the loop when it comes to provisioning access to a new environment. And, being able to tie this idea of: who gets to have what kind of access to which environment — into our regular deployment processes, as part of our CI/CD pipeline (we roll out the Boundary configuration, we create the dynamic credentials in Vault and it's not a segmented element), it works across different teams. So if we take out, for example, the team that used to work on desktop firewall rules reduces the amount of effort that team needs to spend on security aspects: Quite dramatically as we found out.
» Summing up
There are some really cool things you can try out and work towards by just rethinking how people and users access the environment. What happens? How do you set up a situation where the metal meets the meat — in the sense that we're moving away from traditional static boundaries, pun intended, to an environment where we have a very strongly segmented environment where every team has four tenets — four different environments — that have no connectivity in between. But still, we're able to have a very manageable, very simple set of firewall rules. For example, that allows them access because we can centralize things around Boundary infrastructure rather than having to think about where all these different people come in from.
» It’s all worth it
We Germans tend to see the problems where everyone else sees potential. It's funny because it's true. There are problems. This is a very new concept for many organizations. But it's totally worth it because, from an operational perspective, this removes so many headaches and heartaches that you normally have.
When I learned about Boundary for the first time two years ago-ish, I wanted to use this because this is what I've always wanted — in the sense of how I work with access to infrastructure.
And for the young folks, what's this? Anyone? Ether killer, 220 volts on one side, Cat-5 on the other? On your last day at work, plug it into a timer, leave it, and ping the external router of your network. Be happy once the pings stop coming back.
Kidding. But you never know what's going to happen. So, this is why I love this cloud computing stuff because that is one less problem I need to worry about. It's not my biggest one, but you never know. Anyway, thank you guys so much. We'll talk soon.