Building a Service Mesh at Criteo with Consul and HAProxy
Good morning ladies and gentlemen. I’m Pierre Souchay, I’m working for the Discovery team at Criteo. I’m also now working for authentication and authorization and security at Criteo. Today, I’m going to talk to you about service mesh and discovery. You might think, why? But simply because whenever you want to deploy some load balancer on a service mesh, you need working discovery in order to make it work at scale.
I’m also the author of consul-templaterb, which is a nice tool to generate configuration based on Consul topology, which is very convenient, for instance, to configure tools such as HAProxy. We’re first going to see a bit of history of Consul at Criteo and we’ll see then how we went to using service mesh at Criteo historically and, finally, we’ll see how we built a service mesh with HAProxy and Consul and listing all the features we implemented already.
Basically, a request is coming from the north, from the Internet, through various load balancers. At that time, it was F5 load balancers. It’s going to a web application, here’s more detail on that stuff.
Just adding a load balancer at that point in our system would add both latency, because, of course, you’ve got one hop more to go to the microservice, and also at that time would cost a lot, meaning hundreds of load balancers just to send that out to a microservice. Since we had very, very…an architecture relying heavily on microservices, this would cost quite a lot. So, we did some nice libraries in both C# and JVM languages in order to talk to the systems.
So what we did first was doing exactly the same thing as Yammer, meaning we added HAProxy load balancers that were getting information from Mesos and Marathonon, and generating a configuration.
The thing is whenever you have thousands of machines you cannot just dump the legacy and say it’s over because just moving from bare metal machines to containers takes a huge amount of time amongst your developers. We were also running Windows systems and, of course, moving to containers running on Linux is not something you can do in a one-minute step.
It also involves something very interesting, which is the ability to define health checks in a centralized way. When you are deploying Consul, you are basically deploying a Consul cluster, which is a group of servers in the same way as you could do it for ZooKeeper, for instance, but you’re also deploying some agents and those agents are running on every machine at Criteo. Those agents are also responsible for checking whether the services they are hosting are working properly and dispatching this information across your cluster.
That’s very interesting because whenever you are reaching some scalability limits, each of your health checks has to be done by each system. Here you have a global view of what is the healthiness of a given system? It’s very important for us. It also provides some very interesting features such as being notified on any service of the platform whenever something changes, meaning that the database polling issue we had, saying, “Do I need to poll every 10 minutes or every one minute or whatever?” disappears. You can be notified whenever the service you’re targeting is changing. That’s very interesting as well.
Of course, in the same way as etcd and ZooKeeper, it has fault tolerance built in, meaning that it’s a good candidate for using it on very large infrastructures.
At the same time we added health checks. Basically, every service was responsible for saying, “Oh, I’m okay or I’m not okay.” We added lots of various information that was really useful for what has been described by my colleagues Pierre and William yesterday, for instance, to provision load balancers automatically.
Finally, we are using, for now, two main languages at Criteo, but people say, “I’d like to use Golang. I’d like to use REST, I’d like to use Python, Ruby”, and so on and so on. Of course, this doesn’t scale up, meaning that we don’t want to port those libraries to every language, so having another way to do this with performance in mind would be quite helpful.
One and a half years ago, HashiCorp, the builders of Consul, did introduce Consul Connect. Basically, what is Consul Connect? It’s the famous, fashionable service mesh infrastructure system. It’s basically creating sidecar applications that will, under the old connectivity of your service, that will introduce TLS, that will introduce service discovery and so on. Basically, how it works is you’ve got MyApp1 and instead of adding this complex piece of code, client-side load balancing, you are just talking to localhost and this localhost will forward your requests to the various services you are targeting. That’s the kind of view you have and, of course, Consul is driving all of this because it has a clear view of the whole topology of the whole infrastructure.
Furthermore, it adds some benefits for us, which is, for instance, handling TLS for free with a rotation and so on and so on. It also had some authorization mechanism. That’s quite interesting because in current architecture, even if relying on microservices, it’s sometimes very, very hard to know what is the impact of a system being faulty. I mean that a microservice can be called by another microservice that can be called by a real app. Knowing that whenever your microservice is not working anymore, what are the possible impacts of this? It’s very, very challenging stuff in modern architectures: being able to decide what will come up after one service is down. Having a way to authorize systems will allow you to create a graph, a graph calling instances. Of course, lots of people are using it. They are using Zipkin or tools from the Finagle family or whatever. Basically, it’s very, very hard to have this and to be sure that you didn’t forget anything in your infrastructure.
I’m now introducing HAProxy Connect. HAProxy Connect is what? It’s an implementation, an open-source implementation of Consul Connect with HAProxy. Basically, the reference implementation of Consul Connect is using Envoy, but we had a few issues with Envoy, deploying it on all systems, for instance, but also having the ability to talk directly to people from HAProxy Technologies is a big advantage for us. Furthermore, our network engineers are very familiar with HAProxy, less so with Envoy.
We also apply a convergence mechanism, which is in order to be sure that HAProxy really understands what we mean we simply send all of our commands to the Data Plane API, compare it to the results we expect, and if it’s not working we reload it. It gives us the ability to be sure that HAProxy has the exact state we want to be applied. Whenever something is changing in HAProxy we are sure that even if something is changing on the wire, everything is going to work as expected.
So, we are testing this new piece of software for something like two months in our pre-production and we are using it already in production at Criteo for something like one month. It’s working quite well. On the performance, I had no time unfortunately to do some benchmarks, but basically, it performs better than Envoy with Consul Connect, which is already quite nice. We have around a 20% improvement both in CPU time and throughput. However, it has some kind of extra latency compared to plain HTTP, as we did it in our client-side library stuff.
Everything is open source. In order to deploy it, in order to create it, we also had to create an SPOE library for Go because everything is written in Go. Of course, all of this is open source and we are now talking with HAProxy for them to maintain, HAProxy Technologies, to maintain those projects on the long term. Thank you to Aestek, which is the main developer of the system. Of course, we will be glad for any people wishing to use this kind of service mesh technologies to contribute to the project. Thank you very much!
Organizations rapidly deploy HAProxy products to deliver websites and applications with the utmost performance, observability and security at any scale and in any environment. Looking for more stories?