Hyperscaling Self-Service Infrastructure: Transitioning from Ticketing to Load Balancing-as-a-Service at Criteo
In this presentation, William Dauchy and Pierre Cheynier describe building a self-service infrastructure platform that supports 50,000 servers at Criteo. HAProxy Enterprise is their preferred layer 7 load-balancing technology because it provides robust health checks, log sampling, and TLS offloading. Its ability to run on commodity hardware is cost effective and allows them to scale horizontally to accommodate any sized traffic. Their platform allows load balancers to be created on-demand, giving their teams convenient Load Balancing-as-a-Service.
Hi everyone. I hope that you are awake! I’m William and this is Pierre and we are both from the Network Load Balancer team at Criteo. Today, we are going to talk about transitioning from ticketing to a Load Balancer-as-a-Service. This subject, interestingly, you will find some connections with the presentation from this morning from Booking or GitHub.
I won’t enter too much into some gory details because this morning there was an awesome talk about that from Booking and GitHub, but we’ll try to focus more on what’s specific to Criteo. The first thing we did as a team at that time was to define APIs for our end users. The way we introduced these APIs was by writing our own DSL and API extension for our existing execution environment. The goal here was to have the exact same primitives everywhere, again so that people can see the network as a flattened environment completely agnostic from the execution environment. This, for sure, was tightly linked to the Consul registration.
I took a small example here. So, you have someone defining an app with a port and with a network service data set, which is in fact a matter of adding metadata. The first part here for these folks are apparently to create a service named superservice and under the domain criteo.net. Their visibility, this is a public service and they have a strategy regarding the DNS entries we would generate. Then, they can specify stuff related to HTTP. For example, they want a redirection between HTTP and HTTPS to be enforced.
Finally, at the end they can introduce routing features through these semantics, let’s say. For example, here you see that they almost do a sort of canary, meaning that for 20% of the requests on foo and bar they redirect the traffic to the foobar app, which sits within our DC. Same, they are able to offload stuff such as the security policies to an existing app. So, it fits quite well with the microservice approach where you have existing services and you can reuse them when you deploy a new app.
Here I have something to mention that is very interesting. We are really focused on making only features expressed through intent and never to mention technology. The goal at the end for our team would be to be able to completely swap technologies when we want. The consequence of that, also, is that the ownership of the network app config has completely moved from our team, the network teams, to the end-user. At the end, it’s the developer that defines self service and their requirements.
So, the idea here was also to leverage Consul and to make it a state reference. At Crito we try to contribute publicly when possible and this is why we started this initiative and these pull requests on Consul to create a dedicated endpoint. The goal of this endpoint was, for an existing app, to be able to retrieve an aggregated view of the health of an application.
At Criteo we did that in two steps. The first one was to create what we call internally a control plane, which basically takes care of consuming events and producing events on another end. The app in production is available through a WebSocket API, which is described with OpenAPI and so on. This component, there is one instance per DC to ensure the consistency when we try to resolve resources.
It has been historically written in Python, but it’s not really a matter of technology because we can swap. There is an API for that. These components run themselves on top of our platform-as-a-service infrastructure in a plain container, Linux container. Here, for example, you see that if you implement the
get_device and the
pre_provision, you enter into a sort of provisioning workflow and at the end you are able to introduce a new technology.
Also, as mentioned in other presentations as well, we provide metrics, for sure, related to everything you can think of related to our networks. We put this example. I think it’s the biggest application at Criteo: 4,000,000 QPS at the time. That’s a pretty huge one.The developer can subscribe to those metrics in order to trigger alerting, for example.
That’s where we introduced the the usage of the HAProxy configuration named tarpit in order to make sure that the developer is aware of, yeah, maybe in this kind of case where you have thousands of instances, maybe this is something you don’t want to do. If you want to benchmark your applications, please prefer the east-west communication I was mentioning at the beginning of the presentation.
As you probably understood, at Criteo we are talking quite often about big services. To give you an idea, we have 50k servers across the world, which is quite a lot. When you need to push a change or a new version, anything you can think of, we have a system called choregraphie, which helps you to somehow control what you do in the production. You select part of the infrastructure; In our case, for the load balancer—by the way it’s between 50 and 100 load balancers—so you select in our case 10% of the infrastructure and when the change is validated you go on the next batch, etc.
This is quite convenient because I had some fun looking at, while doing this presentation, how many bumps we did for HAProxy or any other software, such as the kernel. We did almost 600 deployments over the past few years, which is quite huge. I was even surprised by this number. Why do we do that much deployment? Because our team is used to looking at the Git repository and once we see an interesting fix, which probably can be triggered on our site, we do a backport and we deploy it in a few hours. That’s something which is quite enjoyable because we can do lots of deployments every week.
maxconn. Willy told you about it this morning already, but let’s go back on what happened on our side. Basically, we were looking at the number of connections per process and we said, “Oh, now we should start to increase it because we are reaching a limit. So let’s double the number and deploy it”.
Everything seemed fine because HAProxy was not doing anything. We are simply doing very simple checks and after a while everything started to become strange. So, a few hours after you are ending up with a worldwide incident. As you probably already know, when you change the max connection, HAProxy is doing its own stuff in order to adapt the number of file descriptors you put on a given process, but if it fails to increase that number it rolls back the value to another value, which could be lower than the previous one you’d already set. That’s why in that kind of situation we don’t like HAProxy to take this kind of decision. That’s why we contributed recently, in order to introduce the
strict-limits parameter and make sure that HAProxy is failing completely if it fails to increase this parameter, those limits.
What is interesting for us here is it might sound like very simple metrics, but at the end we are quite proud of it because it allows us to trigger, I would say, 99% of our issues. I would say, also, that most of our bug reports we do for HAProxy are based, at the beginning, on those metrics. We bump a new version, we trigger something weird, and this is based on those metrics.
Okay, so now that our user seems to be happy and now that we also are happy to deploy with these safe mechanisms, we can start on our side to move everything without, hopefully, that the user notices it. Here, I will only mention, how does it integrate in our workflow? We don’t really focus on what’s the technology behind. Let’s start with, where do we come from? Historically, as mentioned by William at the beginning of the presentation, we came from very specialized infrastructure within specific racks in our network.
This is what I call here hyper-converged load balancing. Basically, the idea in this set up, historically when we started this initiative, was for the application to register itself, asking for here a layer 7 service to the network, wait for the control plane to locate the correct provisioner, and then ask this provisioner to configure the load balancing technology. Then you have a client, which is an end user, which is happy, can start making its DNS requests and the traffic flows this way.
But now we have two different stacks. One of them can be completely replaced by commodity hardware and can be moved within our data center. It’s only a matter of having sufficient machines to handle the load because at layer 7 level, as you might know, there is an issue…I mean there is a lot of load that is consumed by handling TLS.
Now that we have this, we can go even further. We can redo the exact same thing on the layer 3 level, right? For example, the layer 4 load balancing can ask for a layer 3 service to the network and exactly the same: The control plane locates the proper switches, you know on top of racks, configures the peering session, and then you have a BGP session which is established. Also, you can imagine handling DDoS specifics or whatever you may need or think about. It’s only a matter of abstraction here, right?
Just to mention here that this is very easy to do since HAProxy 2.0 thanks to the
pool-purge-delay option, because it allows you to maintain a persistent connection pool between your Edge PoP and the origin server.
Also, we had one concern, which was we probably want to do that progressively and gradually. For example, by deploying ourself in our on-premises in some location, but also to rely maybe on a cloud provider and to boot a VM and install HAProxy on it, but also maybe we could want to offload that on a cloud provider. We don’t want to be locked and we want, again, to be agile on that.
One thing we would like to improve in the following HAProxy version is improving the metrics part. I’m especially talking about the counters. As you can see here in this graph, you can see some holes. Typically, in this case this was most likely some reloads, which were happening because of events from our developers or machines. So, this is something which is kind of something we would like to improve because sometimes our users are coming back to us and saying, “Oh, is there something wrong?” and of course everything is perfectly fine. That’s why this is something we would like to improve.
bindlines. It can also be worse if you are using the CPU pinning on your configuration. Why? Because you probably know that already, but each new
bindline in your HAProxy config with the TLS certificates loading will load a new object of your certificates. So that’s something which will be, as we saw in the other presentation, that this will be fixed in the 2.1. We are very happy about it.
What I wanted to highlight here is even if we do fix this issue, this is something which is very important for us because as you can see on the last graph with the memory, when we trigger lots of events it can be very challenging because suddenly the memory can go crazy. It’s very important for us in the future that we can make sure that each certificate is loaded just one single time, just because when you reload lots of times your process, it will create a new one with the new set of your certificates.