How to Operate and Scale Infrastructure without Alienating Humans
Infrastructure changes at lightning speed today. Large companies move services from machines to virtual machines to containers. They use multiple clusters and cloud providers and add numerous microservices every year. And one question central to any infrastructure change is how it will affect humans—the end-users who rely every day on services and applications to do their jobs.
But when infrastructure changes are made, users are often totally dependent on DevOps for support, resulting in confusion about the level of service available to them and lost efficiency. To be able to make necessary infrastructure changes today without impacting operational efficiency, DevOps teams need to put more capability and control in the hands of users.
HashiCorp customer Criteo—an advertising platform on the open Internet—recently spoke at HashiConf Digital 2020 where they discussed how the company has moved from service-on-machine to pure service architectures using HashiCorp Consul.
Just as importantly, the company highlighted how they’ve been able to give their people the tools to discover, learn about, manage, and investigate the services they use. Here’s a closer look at how Criteo uses Consul to manage networked services and bring more autonomy to service users.
» Keeping Human Users Front and Center
Criteo employs more than 2,500 people with around 500 on the R&D team using their services directly. R&D team members often have questions about their applications, including how to call a particular microservice API or how to create a new service, and the development team is often tasked with answering those questions — frequently a time-consuming process.
Criteo was already using Consul for service discovery and networking, but the DevOps team wanted a better way for employees to know what was happening on their systems to reduce the time they had to spend fielding questions. They created a custom UI, called the Consul TemplateRB, as an open-source application that shows live information about Consul displayed via a simple static web server. The Javascript-based Consul TemplateRB can be plugged into any organization and is easily customizable and scalable to accommodate thousands of concurrent users.
The Consul TemplateRB also shows metadata semantics, which is especially useful because it’s data used by other systems as well as for making decisions about how to provision infrastructure. The Consul TemplateRB reflects exactly what’s provisioned with Consul and is updated live.
With the Consul TemplateRB, users can:
- Find services by typing a few words instead of complex queries
- Use tags to filter systems, including FTP, HTTP, and Swagger
- Directly call APIs
- Reach all systems and get specific information for their services, even if the user doesn’t have specific Consul knowledge
- Search for data quickly and reach it in a matter of seconds
- Investigate past service incidents without getting DevOps involved
The data these features produce also makes it possible for the DevOps team to proactively address additional user challenges by:
- Easily tuning it and linking it to more systems
- Showing how alerts and load balancers are configured for systems, as well as viewing the changelog
- Identifying owners of each service and the team tasked with handling it
To prevent users from performing modifications on other services, Criteo’s DevOps team makes sure all services are standardized and that the data can only be changed at startup. By disallowing the data from being changed later, they can avoid introducing entropy into the system.
Similarly, the data registered in Consul can only be changed by the machine itself, preventing others from modifying the service on a given machine and changing what is published into Consul. In Criteo’s world of thousands of services, identifying the stakeholders — service “owners” — is important when there’s an emergency. Identifying service owners is also key to leveraging information on usage, such as consumption of resources, to further refine and enhance the application’s functionality and value.
» Shifting Capability Even More to Users
Enabling self-service through an intuitive UI makes it easy for users to investigate services on their own, helping to eliminate the need for operators to answer questions. Criteo’s small, four-man DevOps team went from dealing with dozens of questions per day to around one question per week.
To help users help themselves even more, the DevOps team put in place some additional policies and mechanisms, including:
- Standardizing the naming of metadata by prefixing all services with a team name
- Developing a feature called “service metadata” to unbind metadata into the service and allow configuration of external systems
- Giving users more predictability by allowing them to explore what’s happening in the system themselves
- Relying on business semantics, rather than metadata related to tools, to allow the infrastructure to evolve over the long term
- Putting all configuration of other systems in one place—into the service itself—so users don’t have to call yet another API to provision a network or to add alerts
- Enabling DevOps to change infrastructure systems without users notifying them afterward
» Working with Applications & Infrastructure
Criteo has more than 4,000 applications enterprise wide, with some of the larger services seeing upwards of 2,000 instances on a single server. To create more control around their SDK, the DevOps team implemented default queries to simplify search and discovery while improving load management.
With Consul, instead of all requests ending up on a single server (as happens even if multiple servers are provisioned), the team set up Consul to answer calls from any server to enable easier application scaling across infrastructure — or horizontal scalability — using a common query. In addition, they also added a feature for external applications that enables every Consul cluster to serve them the data instead of just one machine in one place.
Now, when an application is too loaded, Consul’s health checks — Passing, Warning, and Critical — provide vital, on-demand status updates about server loads and help proactively identify potential points of failure or degradation. Warning and Passing statuses have also been weighted, so that when an application is overloaded and tells Consul it’s in a Warning state, all the applications targeting this specific instance will reduce the traffic they send to the overloaded application to let the application recover from the excessive load and help to prevent successive instances from also going down when there’s too much traffic.
At the same time, Criteo uses Consul for automated:
- Load balancing
- Metrics
- Alerts
- Availability monitoring
- DNS configuration
Consul registers new applications and any modifications to instances, acting as a single repository for all of Criteo's infrastructure and creating a clear and unified view of all infrastructure on all applications.
Essentially, users can publish their service on the infrastructure and change, on their own, the metadata in their service. Because all services use the same business semantics, it eliminates the need to feed a Git repository. Instead, the infrastructure takes care of everything automatically, freeing the DevOps team of administrative burdens and enabling them to provide sufficient support for each of the company’s 4,000+ services.
» Setting Service Expectations with Users
For the Criteo DevOps team, defining for all stakeholders what matters to Consul also defines a clear SLA. Instead of using technical metrics, the team opts for business metrics, such as:
- The amount of time needed to register a service
- Establishing time benchmarks for how long keys are visible to Consul agents in the data center
- DNS response times
The team measures these outside of Consul with other applications and uses a dashboard to note for users when they’re under the SLA, when they’re close, and when they’ve exceeded it. By giving users a clear set of expectations around Consul and the SLA, the DevOps team can further reduce the number of questions coming into them and ensure users know exactly what is agreed with them.
As infrastructure changes continue apace, Criteo is making sure no human user is left behind. The company believes it’s important to boost operational efficiency and user autonomy through intuitive self-service capabilities, supported by the watchful eye of its DevOps experts. With Consul and Consul TemplateRB, Criteo users can take charge of their services and applications within defined parameters and without depending on DevOps.
To learn more about how Criteo uses Consul, view their HashiConf Digital presentation and read their case study.
» Join HashiCorp in October
We invite you to join us at our next HashiConf Digital, October 12-15 (PDT timezone). Registration is free to attend. Real-time product workshops are also available, and will require a nominal fee to reserve your seat. Register here.
Sign up for the latest HashiCorp news
More blog posts like this one
HashiCorp at AWS re:Invent: Your blueprint to cloud success
If you’re attending AWS re:Invent in Las Vegas, Dec. 2 - Dec. 6th, visit us for breakout sessions, expert talks, and product demos to learn how to take a unified approach to Infrastructure and Security Lifecycle Management.
Consul 1.20 improves multi-tenancy, metrics, and OpenShift deployment
HashiCorp Consul 1.20 is a significant upgrade for the Kubernetes operator and developer experience, including better multi-tenant service discovery, catalog registration metrics, and secure OpenShift integration.
New SLM offerings for Vault, Boundary, and Consul at HashiConf 2024 make security easier
The latest Security Lifecycle Management (SLM) features from HashiCorp Vault, Boundary, and Consul help organizations offer a smoother path to better security practices for developers.