Non-technical challenges of platform engineering
This talk surveys the non-technical aspects of building infrastructure platforms including product thinking, service design, release planning techniques.
Building successful infrastructure platforms is not just about infrastructure as code or using the latest technologies and tooling. Poppy and Chris will draw on their experiences of creating successful infra platforms to talk about why many technically awesome infra platforms fail. This will be a session about all of the non-technical aspects of building infrastructure platform, from strategy, product thinking, service design, and release planning, to team ways of working.
» Transcript
Chris Shepherd:
This is a treat, isn't it, on a beautiful day in Amsterdam and looking at a big room of very beautiful-looking developers and people. This is lovely. So welcome to our talk today. We're going to talk to you about some of the non-technical challenges of platform engineering.
Poppy Rowse:
Awesome. My name's Poppy Rowse. I'm a business analyst working at ThoughtWorks for about four years. And to be honest, I don't really know anything about infrastructure. I'm not technical at all, but I've been lucky enough to work on a few different clients on infrastructure and platform products. So I'm going to teach you some of the things that I learned along the way.
Chris:
And my name's Chris Shepherd, I'm a tech lead at ThoughtWorks. I've done a bunch of infrastructure, as I'm probably guessing the majority of the people in the room have, and we're finding that it is this ubiquitous inescapable thing, right? And so it's a good idea to try and do infrastructure properly.
Poppy:
Just as a reminder, when we're talking about infrastructure, we mean all the stuff that devs need to build great products. So security, scalability, performance, all of that good stuff that you need to build all the good stuff on top.
Chris:
Infrastructures are made up of these building blocks, right? And out of these building blocks, you can build more sophisticated platform products and things that have reuse across the organizations. But what you tend to see is development teams, infrastructure teams, building the same things again and again, to achieve the same results. And it's this wasted effort. So this is where we start talking about platform teams, building these more elaborate components out of smaller building blocks, but doing them in single ways basically, and then owning and running that platform.
Poppy:
And one of the things our talk is about is people. There are so many great practices out there. We have Agile, we have all the great ways of working, but for some reason, when people are building platform products, they just forget all of that good stuff. So hopefully there's nothing new in this presentation, but we are more going to remind you of the good stuff that you should be doing when you are building infrastructure or platform products.
Chris:
So we have four principles for you today. These are not a silver bullet or anything that you need to follow in a sequential order. But we think that they're useful things that you can take away with you, and they might help you to make better platform products in an easier way. So sit back and relax, put your seats in the upright position and your trays forward and sip your complimentary G and T whilst we take you on a crazy world of building platform products.
Poppy:
Awesome. So this is a real-life quote that we heard before. “Datadog is cool. This should be our next platform product!” But what's wrong with this statement? Actually, in this case, no one asks the users if this was really needed.
» Talk to Your Users
In this presentation, when we talk about users, we mean developers. We mean the teams who are consuming your products. So it could be that people are struggling to even get in production, they don't need a fancy monitoring tool yet. We actually need to talk to our consumers and figure out what's most important to them. How do we do this? We do a discovery [also known as scoping phase]. We do discoveries in our regular products, so we need to do our discovery in our platform products as well. A discovery is basically a fixed period of time where we actually go out and figure out what our business needs and what do our users need, and then we use that to prioritize the most important thing.
So, talk to your users. That's the number one important thing to take away from today: actually talk to the people who are going to be using your products, because it's the most important thing to understand — not just what's useful for them, but also what are the real pain points for them right now. Find out what's holding them back from delivering awesome products, because at the end of the day, that's the purpose. That's why we're all here, right? We're enabling different teams to build awesome products for our organizations and fulfilling our business needs.
My favorite way of doing this is a method called “event storming.” Event storming was created by a guy called Alberto Brandolini, and basically it's a way of storming a lot of information on a wall, be it a physical wall or a mirror board or whatever your favorite virtual tool is.
I think for a platform product, it's nice to do from the start of a project to live-in-production — or maybe even a bit further than live-in-production, to day-to-day running. And what you want to do is storm everything that happens. So storm all the things that the teams need to get sorted. You want to storm all the governance in process, literally every activity that happens.
Then on top of that, you want to put all the different pain points. This gives you a beautiful visual, like, heat map of, “Okay, we can see the areas where there's a lot of pain points going on. This is where we maybe need to focus some of our efforts.”
One of the things is, don't assume all your users are the same. I know your users are developers, and I know a lot of you are developers, so you think — okay, I already know what's the best thing and what's most appropriate here. But actually not all of the users are the same.
» Comparing Team Needs
One of the things when you're doing your event storming or even just talking to users and doing interviews is, get a wide range of people in there. So talk to different teams, talk to your teams that have been running for five years, talk to your new teams who are spinning up brand-new products from scratch. Talk to your people who are brand new to the organization, talk to the people who've been there for years, and just get that broad range. Get them all in the same room, do the event storming with them all at once.
Or you can do multiple event stormings and then compare those maps and see, okay, what's the actual differences here?
Are some people already doing things? Awesome. And other people have highlighted that as a pain point? Maybe you can just share learnings there and you don't need to necessarily create something brand-new for that.
So you've got different users, just to reiterate this point. You might have some people who are super experts on infrastructure. They do their own infrastructure as code, they’re super experienced, and they're like, "Why should I use your platform infrastructure product? I can just build my own and it'll be so much better." And then you've got other users who are like, "No one on my team has experience. They're all front-end developers, and they're really frustrated at the process and don't necessarily have the skills or capabilities to deliver infrastructure stuff themselves and spin up things themselves."
In this case you want to do personas and really be purposeful about — okay, what are my different customers here? What different kinds of people are going to be consuming my infrastructure products, and do they have different needs and do they have different requirements?
At the end of the day, as well, particularly if you're in a bigger organization, you need to accept that you might not be able to build something that's suitable for everyone. So you need to be really purposeful about the scope and who you decide to be your customer. It might be that you've got a load of legacy stuff and actually it's a bit too mangled and awful to migrate over, so actually you're going to focus on building infrastructure products that can really accelerate the delivery of your newer products. Or it might be that actually you've got a handful of products being built and they're the top priority in your organization, so you're really going to narrow in on those ones and make sure that you're supporting the delivery of those products.
» Keep Talking
One of the things I really want you to take away as well is, you don't just want to talk to your users, your developers, your customers, in this discovery period before you've actually built anything yet. Talk to them all the time, talk to them through your development process, and obviously talk to them once you finish building to get them to use your stuff. Because the amount of times we find people do all the right things — they do the discovery — but actually during the build process, they also start making assumptions about what different people and what different teams need. In the end they built this awesome thing. But the team's like, "Oh, actually that won't work for us." It's so disappointing because you spent six months on this thing. So keep talking to the different teams that you're going to use as well.
Don't assume that people are just going to come because you've built this fancy thing and you've spent loads of time on it.
You really need to get that, like, marketing and investment from the different teams and your customers. The more you engage with them to understand their needs, the more they'll be on board and be like, "Oh, we actually know what you're building. We know what's coming up."
This is all the points that I've just gone over. This is the slide to take a picture of. We've got this at the end of every section. So do a discovery, make sure you understand your users, and just do this investigation all the way through your development process.
Chris:
Thanks, Poppy. One of the other things that we can hear on the ground quite a lot as well is, “Where is this platform going? I don't understand what it is that you're actually building here.” The understanding of what it is that you're building as a team is super important to stay local within your team, but also communicating that out to the widest stakeholders, to the architectural review board who's going to be given the sign-off of the thing you're building, your actual customers as well. Being able to communicate to these people is super important too.
» Don’t Under-Communicate
And what could have led to that quote having been made is simply that the roadmap hadn't been communicated. When we talk about this, we talk about “under-communicating” — actually not letting your documents and your diagrams get out and circulate and socialized in the wild sort of thing, but also communicating the wrong things and inaccurate artifacts or things that don't actually communicate anything that is useful or important.
On the other hand, you've got people asking, “Who the hell even built this? How does it even work?” And this is the concept of — as much as you need to communicate what it is that you're building, to communicate out that technical vision, being able to communicate out to engineers in the future who've maybe adopted or taken ownership of the platform that you've built is also super important. So we're going to talk a little bit about those two perspectives here.
Fundamentally platform teams should be longer-lived than product teams, arguably, because you're building stuff on which other teams are going to be running their stuff, right? But folks leave, ownership changes, organizations shift, and when something goes wrong, you don't want to be in a horrible situation where you've got this thing, you don't understand how it works, and you're being called out at 4 AM to try and fix it, right?Trying to figure out why something's broke shouldn't be horribly opaque.
So understanding that “architectaeology” is a thing. Sometimes when you are on the ground, working with software, you're as much of an archaeologist, with your big magnifying glass out trying to figure why this thing isn't working and decompiling things at two in the morning, as you are an architect actually trying to build your software.
» Build a Technical Vision
So point number two is, build a technical vision. Again, this is forward-looking, but also backward-looking, to help engineers in the future understand the decisions that have been made and the thing that you're building. Now, this doesn't have to be some high-fidelity masterpiece, right? This is — write it on the back of a cigarette packet or a beer mat or something, or a whiteboard or a mural or whatever it is that you want to use. Simply just starting the process of trying to document these things is a really good spring point.
So one of the techniques to discuss today is the C4 diagram — not the Citron C4 you see here. This was created by Simon Brown a little more than a decade ago. And it's still considered to be a really good way of documenting architecture at different levels of abstractions.
You have four different Cs. The first C is the context of the diagram. in which you can describe at a very high level the things that you're building — in this case, the shiny new service and some logging platform, bits and pieces. By building a diagram in this way, you can speak to stakeholders about what the overall vision is for the thing that you're building. By then exploding an element of the context diagram, you can explain some of the underlying componentry that supports it, their responsibilities and how they hang together, that kind of thing.
The third C is the component diagram — again, where we've taken one of those components and exploded it a little bit further. And you can see here, you might be exposing things like port mapping and that kind of thing, or firewall rules, or anything like that. And again, this is useful for engineers on the ground figuring out how to use your thing or how to configure it in their environment.
And then finally, you've got the code diagram. The code diagram goes into details like UML diagrams, basically. You're looking at entity relationships and classes and that kind of thing. And that might not be super useful for your platform, given that you might be working with some off-the-shelf scripts and that kind of thing, but it can be very useful if you're trying to explain some weirdness in your solution.
» Keep a Record
We've talked then about conveying the roadmap for your product and what the intended end state for your application is, or your platform. Now, answering the other question there about looking historically in the past, architectural decision records are a really useful way of capturing decisions that have been made in the architecture for your platform. At their very core, architectural decision records are text files that you'd be committing to your Git repositories, and they simply highlight a decision that's been made, the date, and the context and all of these things. They'll be useful for engineers in the future to read through and understand why the architecture is the way it is.
You start with a lovely pithy title, like Record Architectural Decisions.
You date it — other date time formats are acceptable, but this is the one true date format for sorting purposes.
Whether or not the decision has been accepted or postponed or denied.
And then crucially, you've got a context here which describes why the decision needs to be made. So this might be, "Hey, we need some highly scalable distributed database" or something like that. I don't know. But crucially, we're trying to answer the question of why we need to make a decision here.
The decision itself. So in this case, "Hey, we've adopted ADRs to help us do this lovely documentation thing."
Any consequences that have been made, or will be apparent having made that decision. Maybe you are introducing an off-the-shelf database technology, which has a limited shelf life, or you need to consider licensing and costing for the future, right? These are all things that are really important whenever you make an architectural decision.
And then finally, who was in the room when the decision was made. Now this isn't a finger-pointing thing. This is just useful context for people looking at this decision. So if they're still in the organization, you can go and have a conversation with them about it in more depth, that kind of thing.
So that's point number two: build a technical vision.
Poppy:
One of the other things that we see happen so commonly is that you have stakeholders who have different reactions when you build your products. So you've got stakeholder one, who's like, "Awesome, great work." This is usually someone technical, your CTO, your technical directors. And they're like, "Awesome. We've set up this infrastructure platform. We've already got a few products on the go. We've got some consumers," where it's a success.
But then you've usually got another stakeholder who's like, "Oh, but we didn't achieve what we wanted to achieve." And actually the reason for this is because we didn't get the stakeholders aligned. So one of the really important things to do in the technical space is to get your non-technical stakeholders aligned and invested in what you're actually doing. Because it happens so often — when the budget cuts come, when the companies are trying to slice thin, the first thing that's going to go is your infrastructure platform because the business people don't understand the value of it and they can see the money's adding up. So we really need to get these people invested.
» Create a Strategy
One of the ways to do this is to create a strategy, and create a strategy with your technical and non-technical stakeholders, like the people who have the money basically, so your directors, your CFO, your CTO, whatever level you need to work at. It's really important to do this.
So when I say strategy, what do I mean? It's such an overloaded term. I mean these four things, and this is a lean value tree method, which is probably the easiest way to create a strategy.
We've got a vision at the top. So the visions are, “We are going to do this and be great at this thing.”
Companies are awesome at visions. and they usually forget the next bits and they usually call their vision a strategy. It's a whole thing.
But under your vision, you're going to have goals. Your goals, like all goals need to be measurable, need to be achievable. We need to know if we've hit them or not. Right?
Then underneath these goals, we've got bets. So bets are the things that we are going to deliver to reach our goals and therefore to achieve our vision.
I'll give you an example. One of our examples, and the most common thing we see, is actually people just want to deliver things quicker, right? So one of the reasons that we always build infrastructure platforms is, things are already there. Teams can just pull up and play, and in theory, it should accelerate the development process. And you can see here, I've just made up a couple of goals and there's a bunch of different bets that you might make if you want to deliver this vision.
Here's a completely different one. In this case, the reason that we want to create an infrastructure platform is to increase our security. Maybe we're not so good at this, maybe we've had some breaches, and actually this is one of the top priorities in our organization.
The reason I wanted to show you two different examples is because the things that we might actually build are actually quite different depending on your vision. So it's really, really important to get that alignment of, What do you want to achieve by having your infrastructure platform? Maybe it's cost. Maybe it's just pain points from all your different developers and keeping your developers happy. Maybe it's that you don't have capability within the teams so you need these centralized products that are super easy to use. I don't know. It's going to be different in your organization, but depending on what it is, you will actually prioritize completely different things to build. And there'll be some overlap, but usually they're quite different. So prioritize the most valuable bets. Like, “this is super important — the ones that most align to your vision.
The way I like to prioritize — because let's face it, a lot of the time when prioritization happens, it's just people who are in charge putting their finger in the air and going, “Hmm, I think this thing,” and it's almost like a roulette and no one really knows why they've prioritized it. Some people disagree with the reasons they've prioritized it. No one can really articulate why they've prioritized it either. It's just someone in charge has decided they're the most high-paid person, it's their decision, they've made it.
» Establish Priorities
So I think it's quite nice to put a prioritization framework in place. And one that I like to use is the weighted shortest job first. Now, caveat, this is a prioritization framework from the SAFe method. I do not condone SAFe. I've adapted and changed this to be actually more useful than the one that's from SAFe because that one's not very good, but essentially it's value over job size.
And what is value? Value is the thing you just identified in your vision. Your value can be to support growth. Your value can be time criticality to get things into production fast and get your products live. It can be reach across the organization. Maybe you want to reach loads of different teams and actually it's more important for you for it to be scalable. Maybe it's cost. There's all sorts of things it could be. But you need to define what value is for you and make sure that's aligned to your vision and get your non-technical stakeholders in the room when you're defining this.
Then it's basically just a little bit of maths. You define what value means to you, you do some rating, make sure it's all relative, do some adding up, and then you've got your value score.
Then you want a job size. And I know developers hate sizing things, but we need some relative thing. I know you've not defined this specific scope at this point, but just put some guesses out there to make sure that you can get that job size. Just so then you'll know.
On the next bit, what is the weighted shortest job first? Because we want to front-load that value, but actually have the least amount of effort. So you can see on this one that actually the most thing that we'd actually do is “LaunchDarkly on-premises.” This is what you should work on first. And depending on the size of your team, you might want to work on more than one thing at once, but I do really recommend keeping your limit of products small and in line with how many people are working on your platform team.
Don't rush this. Like I said, this is so important, not just to get that priority, but also to get that alignment and make sure that actually all your non-technical stakeholders really understand the value you're going to deliver here. And because you've spent that time doing it, you are fairly confident that that is going to deliver your vision and your strategy.
So, create a strategy, take a picture, make sure you are really purposeful about it, don't just prioritize the stuff that seems cool, and make sure that you are all aligned.
Chris:
Thanks, Poppy. So, you've done your user research and you've outlined your technical vision and you know what it is that you're going to build. You can start coding, right? You can just grab some templates that you found off the internet and you can spin something up and get something going.
The truth is that it's never that simple. You can find scripts, templates, Helm charts, all sorts of stuff on the interwebs that will get you going with a really simple proof of concept of the thing that you want to build. That's sometimes a bit of a false sense of security given that you will have to do additional unicorn configuration for wherever you're parachuting this thing into, right? There are other things to consider than just getting going coding, is what I'm trying to say.
What about all of their cross-functional requirements (CFRs), things like observability and hardening and that kind of thing?
» Simplify the User Journey
One of the other CFRs that a lot of folks forget about is onboarding. How do your users know that you've built this thing? How do they understand how to use it? All of these sorts of things. Unless you're thinking about that earnestly in your development process, what you have is just a bunch of separate components. You don't have a differentiator.
So point number four is to try and do service design. And we'll talk a little bit more about that. Having your technical blueprint in place is awesome. More awesome is getting it used by people and actually realizing the value of the thing that you've built. And if you can ease the process of getting your thing into the hands of your users, then you're going to be reducing a lot of unnecessary toil and you're going to be making it a lot more user-friendly for them.
One of the ways you can do this is by doing a user journey map. You start by drawing out your users' onboarding experience. Here we have three swim lanes. We've got a developer, the system that you've built, and then the platform team that's built the thing. As you can see here, there are a lot of arrows going back and forth everywhere. All of the arrows that we're seeing here are handoffs between the platform team, the team using your thing, and the system itself. And it's far from ideal, right? You are having lots of different communication points that could be removed, like time. Time spent with your users is really valuable. Don't get me wrong. But time spent with your users to have to onboard to your system is wasted effort.
» One and Done
This is a more ideal user journey, where the developer uses your system and that's it. Now, it might not be possible to get to this user journey flow easily, but there are certain things you can do to achieve that more optimal onboarding experience, right? So, keeping it self-service is really useful. What if the users already have at their disposal everything that they need just to press play, and then just onboard to use your system? One of the ways you can get things to be more user-friendly and into that self-service state is to bear in mind the one-and-done principle.
The one-and-done principle comes from customer service, right? It's the idea that — I'll give you an example. Say you have an e-commerce website, for instance, that's selling fridges and freezers, and you have a customer on the day of delivery who wants to arrange to have their fridge shipped a week later or something.
So they give the call center a call and they say, "Hey, I'd like to get my fridge delivered in two weeks time." The customer service person then has to wade through five different screens of their app to try and find the right action to perform. He then has to put them on hold to go and speak to their manager. And the manager says, "Yeah, that's fine." And the whole process has taken 15 minutes. In terms of efficiencies for a call center, that's massive.
What we try and do here is to reduce that number of actions down to one. So that if you just have one single operation that you need to do to support your customer, or for your customer to onboard your system, that's a massive efficiency gain there. Your customers shouldn't have to do many, many things. They're going to get confused. They're going to have a bad time.
By taking these things into account and trying to keep it simple and reduce those operational steps down to a minimum set, you might end up with a user journey that looks a little bit more like this. The platform team's still involved here, you can see. But there's much less back and forth between everyone. This kind of user journey flow for onboarding is suboptimal, but it's probably fine for your MVP. It's probably fine for you to roll out for a private beta and get feedback and start learning about what it is that you've built and crucially getting it into the hands of your users.
» Conclusion and Wrap-Up
Finally, eat your own dog food. If you've built a monitoring stack or something, go through the onboarding steps that you've described for your customers and use it yourselves. Flush to the surface all the little problems that are going to affect your users ultimately. That's point number four, with a cute little dog wearing a hat: do service design.
Those are the four principles that we think can help you do better at platform development. Now, this is a condensed set of material. This has evolved over a period of time, hasn't it? There's loads more stuff that we wanted to fit into this. Remember that these steps aren't a magical recipe by any stretch of imagination. Pick and choose what works for you. But we do have a TL;DR slide as well, because as the memocracy dictates, you need a nice TL;DR slide at the end.
Poppy:
So, do a discovery.
Chris:
Build your technical vision.
Poppy:
Have a strategy.
Chris:
And do that service design thing.
Poppy:
We're going to be hanging around in this area here after the talk. If you've got any questions, here's all the stuff we literally had slides on and couldn't fit in because we had only 30 minutes. So come and talk to us about any of these things, or come and talk to us about any other questions that you might have.
Chris:
Thanks very much, everyone.
Poppy:
Thanks, everyone.