Improving SLOs and Observability with HashiCorp Vault at Citrix
Learn how observability and HashiCorp Vault come together to automate secret management at Citrix.
» Transcript
Hi, everyone. I'm George Hantzaras, and I'm here to talk to you today about Citrix's journey to improving Service Level Objectives (SLOs) and the role that Vault has played in this journey. I'm part of Citrix, a cloud platform engineering organization. I work with the teams that are building the new observability platform and the teams that are building a secret management infrastructure. And also, I'm a huge fan of, of HashiCorp products. Earlier this year, I became the organizer of the Athens HashiCorp user group.
Today's presentation is in 2 parts. The first part goes over our implementation journey of observability and the role that Vault has played in this journey.
The second part goes deeper into secrets management and how secrets management played a role in improving our SLOs.
Finally, we're going to see the lessons we've learned through this process and what the future holds.
» Citrix's Cloud Transition
The reason this is important for Citrix is that Citrix is a driver of change, due to the nature of our products. And as we transition to the cloud, we drive digital transformation for our clients.
When we're talking about any kind of transformation, perceptions are important. It is important for our customers to trust that their service is going to be equally, or even more, reliable, available, and secure than the on-premises offering.
With that in mind, all the observability efforts and the security efforts should always encompass this customer centricity.
» Citrix's Observability Mission
Our mission has 2 goals. One is to provide better product insights for the engineering teams. The second is to optimize incident response. Both of these goals address one specific need: to improve customer experience.
This is probably the key takeaway from today: observability nowadays should be customer-centric.
The requirements when we started our observability journey were 4:
Real-time tracking of our SLOs for our customer support teams, engineering teams, and SREs
Self-service to reduce the overhead of onboarding the observability platform by automating the process
Configurable alerting so we could use use multiple channels and multiple platforms to communicate and alert with the many different monitoring platforms and various data sources already in place within the organization
A new SLO framework that describes this customer centricity, operationalizes error budgets, and puts them in our everyday development life
This is pretty much what we ended up with.
» What Observability Looks Like
On screen you can see a multitude of applications, different deployment methods, different types of infrastructures. We wanted to be able to address these needs, and that's where the ingestion layer comes in. That is where you can do filtering and rate limiting. You can power this with open source tools like Kafka or with SaaS tools like Cribl.
The blue box in the middle of this diagram is the first important part: the metrics storage. Because we wanted to serve a mature organization of 2,800 engineers, a key component was being able to serve different data sources, different platforms that were already in place.
The core observability platform component is our observability API. This is where data from different sources comes together and is aggregated. The data is correlated with customer data.
Finally we have the error budget. The SLIs and SLOs are calculated. We provide the interfaces for the dashboards to be implemented for the SLOs, customer impact, and customer user journey uptimes.
» Observability as Code
We went with this approach because we wanted, first of all, 2 components on the onboarding:
Self-service onboarding (GitOps and zero trust) to enable the self-service aspect of the observability
Metrics as code, dashboard as code, and alerting as code in order to reduce the implementation time for the product teams and to promote best practices across the organization
First, observability as code gives us faster implementation across product teams. We created reusable assets, which reduced the time to onboard for product teams. With the reusable assets, we ended up with reusable best practices coded into our tool set.
And this Git-driven approach provides versioning and automation for metrics and alerting and so on.
This is how the observability deployment pipeline looks today. It starts with the engineer cloning the repository, creating simple configuration files, and then just opening a pull request. This pull request triggers the build pipeline, which creates the final Terraform modules, the necessary configurations, and so on. Then the testing pipeline is configured from static analysis integration test with linting, Terratest, and so on.
Finally, the assets are deployed in staging, published in Artifactory, and ready to use in production.
» The Role of Vault
The role that Vault plays in this whole process, first and foremost, is to enable zero trust on the deploy time. Custom secrets engines can be developed by everyone. Secrets engines provide a way to manage secrets for different services. And in that way, we can manage API keys and access keys for the different monitoring platforms like New Relic, Splunk, Datadog, Elastic, Prometheus, and so on.
The second part is that Vault enables our synthetic monitoring to be more secure. Synthetic monitoring can use passwords and API keys, and all these are managed through Vault.
Finally, the access to our observability API, from the dashboards, from the CI/CD pipelines, and so on, is managed by API access keys managed by Vault.
The deployment after Vault comes into play starts with Git workflow. Then, the build is triggered, where testing and templating take place. Finally, the artifacts are published and the metadata is stored.
Then, when the deployment starts, Vault takes over. That's the part where the deployment pipeline requests the credentials from Vault, and Vault creates the dynamic API keys to the different metrics providers.
» Examples of Observability as Code
Let's go a bit deeper into the code and see some observability as code examples. This example is with Terraform and New Relic. At the left, we see an example of dashboard as code, and here we see a New Relic dashboard with multiple widgets being created, and specifically 1 widget, the HTTP response codes that are not 200s. We're using an NRQL query for that.
Next is an alerting as code example. In this alert policy, we define an NRQL query and some conditions. These can be multiple.
In this second example, we just have the critical one, where we set some thresholds. At this point, we want an alert to be raised.
In the third example, we see the notification channel being configured and associated. This email channel is going to be used when the alert condition in the middle is being violated.
This is great, because it covers a lot of the needs we saw earlier. It's reusable, it's versioned, and it's testable.
But this is a Terraform script, so this means that the product engineers should understand how HCL works, and should be able to write that. So in order to operationalize this, we need to go one step further.
» Open SLO Example
Here we see the condition alert example again, but then on the left and in the middle, we see that all the thresholds and the NRQL query have become just variables.
And on the right, we see that we just have a YAML file, which we require, and the NRQL query and the threshold are put in the configuration file. Then, the CI takes over and puts this all together, and it creates this alert in New Relic.
This is how we end up with something that is truly reusable. So the middle and the left parts are something that can be used from all the teams just changing the YAML definition file at the left. And this can happen with metrics, dashboards, alerts, and even SLOs.
This is an SLO example. For defining SLOs as code, there have been a few really good open source projects lately, the most significant one being the Open SLO specification, which we can see here.
We define an SLO, which is latency-based, and we define it in Prometheus with a Prometheus query. We want this to be greater than 99%, so 2 9’s is the SLO defined here. In the bottom, we can see that this is on a monthly window, which is non-rolling.
Looking a bit deeper, these are things that are standardized across the company, so the only thing that would change would be the SLO target and the PromQL query.
This can be templated with a YAML file, which would be really easy for a team to define. We saw how we can move from dashboards as code to alerts as code, and now to SLOs as code.
» How Citrix Visualizes SLO Today
The last example I want to show is how the next step looks for us. This is how we visualize SLO today. These are our new SLO dashboard mockups. You see all the metrics that you would expect to see in an SLO dashboard, SLIs, error budgets, number of incidents, and so on.
Where I want to focus is the top right, where you can see number of impacted customers. There, you can drill down even deeper and see pair incident impact per customer, and you can measure specific uptime for user journeys per customer. I want to focus there because I want to show you that observability should be customer-centric today.
» The Impact of Secrets on SLOs
Moving a bit away from observability, I want to see a specific use case on how secrets management and secrets rotation played significant roles in our SLOs and what impact we were able to make just by automating this management.
When a customer launches a virtual lab or a Citrix virtual desktop, we have 4 or 5 different services coming together in order to deliver this experience. One of those 5 services is the Citrix Cloud Platform, which is what enables us to deliver SaaS products. This platform is comprised of roughly 135 microservices. And at any given point, we have about 500 to 1,000 secrets stored per environment. That can be database credentials, API keys, and so on.
So imagine the number of secrets stored across all the services in order to deliver the end-to-end desktop launch experience. In order to manage that, we decided to use Vault.
» Super Fast Vault Deployment
This is pretty much how we have deployed Vault. We have deployed Vault on a high-availability mode within a VPN, and access is restricted via Azure Private Link. Of course, in the background, we have other DR mechanisms as well, off-site storage, and so on, off-site backups, and so on.
On the right, we have the way infrastructure as code deploys Vault and how the observability as code is part of this pipeline. In the first step, we see Packer is used in order to build Azure VM images.
On the second part, we have Ansible, which during this build installs the required software, including the New Relic agents, which are installed and configured in build time via Ansible. Then the built image is published.
The final step is that Terraform takes over in order to deploy VM scale sets with the previously built image. We see here how observability as code is built into the infrastructure as code.
» Vault Onboarding
This is how new services are onboarded in Vault. We start with the team that wants to be onboarded.
Our repository is cloned, a new branch is created, and then a pull request is opened that contains the configuration files for this new service. Then after this is merged, the automation is triggered, and Vault namespaces are created, and access to the VNet is granted for the new product tenant.
A little spoiler here: A lot of the stuff we're doing is a work in progress, so this is half true at the moment. The other half is manual and a work in progress.
I talked about Vault's impact on the SLO and secrets management's impact on the SLO. Let's see about that. The cost of secrets management service for us annually is about 250 to 300 human days—manually managing secrets, rotating them, and so on.
This means that automating this process gives you back about an engineer and a half to work on roadmap work, feature work, things that are going to make a direct impact to the customer.
The most important thing we found about our SLOs was that around 20% of our incidents were related to manually managing secrets. Just by rolling out Vault, you are able to eliminate that.
» Lessons Learned
We've talked today about the observability API, the core platform where data is aggregated, correlated with customer data, and then error budgets and SLIs are calculated, and all the dashboards are created. The second part, the dashboard creation, reporting the SLOs, and providing insight to the product teams, to the leadership, and so on.
Of course, this is significantly powered by synthetic testing, simulating user experience in order to measure the user impact. And finally, the first part of observability code. Since it's a big journey, we have started with alerting as code. This has been working for quite some time now.
I have 3 important takeaways.
First, on-call is different than reporting. What I mean is that when you implement monitoring and alerting for reporting, it's a totally different use case and totally different requirements from when you use it for SRE usage and operation team usage. You have to have these 2 different sets of requirements in mind, and there's no one-size-fits-all for you to address both at the same time.
The second is customer-focused observability. This means that with everything you do, you should have in mind measuring customer impact. Sometimes the service might be up, but the customer experience might be degraded.
Finally, you should start somewhere. Observability is not a project that's going to run 2 or 3 months and then you're going to be done. Observability is a big journey, and you have to map it out to see what the low-hanging fruits are for your organization and try to start somewhere, and have a continuous improvement mindset.
» What the Future Holds
As for the future, customer impact is really important, so we have to focus more and improve customer impact measurement, being able to know exactly what our customers are experiencing one-by-one and how they are experiencing our services.
Then there's rolling out more assets of observability as code. We've been using alerting as code, but we also want to roll out metrics as code, dashboard as code, and, partially, SLOs as code are now being rolled out.
Third part, performance SLOs. As I mentioned during the previous slide, a service might be available, but the experience might be highly degraded. As we saw in the Prometheus dashboard before, we need a latency SLO, where you measure what the customer is experiencing, rather than if a service or a user journey is up. Rolling that out is also important.
Finally, putting it all together…
Putting the CI/CD together with observability
Being able to measure the impact of a change in production
Being able to use the error budget in deploy time
These are the important next steps.