Processing Billions of Web Requests Per Day: A Journey from Hardware Load Balancers to HAProxy at DoubleVerify
In this presentation, Oren Alexandroni and Wally Barnes III bring to life their journey moving from hardware F5 load balancers to software HAProxy load balancers. Before making the switch, their hardware load balancers were costly and could not handle their traffic pattern. The new solution had to be scalable, provide low latency, be cost efficient, offer high availability and fault tolerance, and adapt to their CI/CD processes. They found that HAProxy exceeded their expectations and gave them improved observability over their services, optimized utilization of their CPU resources, and helped them to process billions of requests per day.
Morning! Thank you all for joining us so early this morning. My name is Oren and this is Wally. We’re going to spend the next couple of minutes talking a little bit about DoubleVerify.
Right, so I’m going to hand it to Wally and he’s going to talk about how we took that and moved that to production.
Good morning! I know it’s early, we’re all sleeping off the effects of doing tech reviews of HAProxy 2.0 documents. So, you’re going to see what we use here in terms of the elements, config elements of HAProxy. They’re right along the base elements that pretty much everyone knows. While what we’ve done is notable, what we use is probably not going to seem high-tech to you.
To help keep you riveted to what’s going on up here, I’m going to do two things that you probably shouldn’t do at tech conference. The first thing I’m gonna do is I’m going to poke fun at my manager, who happens to be here on stage with me. This might result in my expense reports not being approved, I’m just saying. The second thing is I’m going to poke fun at the company that’s sponsoring this event. I don’t know what’s going to happen, they might cut my mic, there might be a new bug introduced in production called the Wally bug that somehow affects everything. We’ll see. You with me? You ready to watch me burn? Okay, alright, let’s do it.
Let’s take a quick journey back about three years. I’m interviewing with this manager at DoubleVerify, he sees on my resume “HAProxy experience” and he mentions, “Hey Wally, we’re working on an HAProxy project,” and with an ego born of two-plus decades of systems engineering and administration I say, “Yeah, I’ve used HAProxy to put some web servers behind at Bloomberg, Morgan Stanley, you know,” and I look across the table and he has this peculiar smile on his face. If you want an example, it looks like that, that was him.
So let’s fast forward to my first day and Oren says, “Hey remember that HAProxy project I mentioned? We want you to work on it, help get it established, get it rolled out and tested.”
“No problem! What web application are we gonna put behind it?”
“Well, we’re going to replace F5, so all them; but don’t worry, we’re going to start with just our main application. It’s about a couple hundred million requests a day, but in about a year we’re looking at it going to about a billion.”
“So you’re going to use HAProxy, the software load balancer, to replace F5, the dedicated device? Okay! We’re on it!”
Having a file-based configuration means we can treat it like code, we can manage it centrally, it can be reviewed, it’s rather straightforward. Because we’re going to be adding live, disparate applications, these applications were not built together, this dev team did not build along with this dev team, so they don’t do anything alike. If we’re going to put them behind the same systems we need to be able to configure what we need for each one of these applications, and being able to configure at multiple levels—frontends, backends, the defaults, everything, listen sections—all these things come into play for us.
Having access to metrics is also important. Once again, black box solutions, if you want these kinds of insight, if you can even get them, you wind up paying for some third-party product that’s very expensive and may not even give you what you need. The flexibility we get from encryption comes because we use OpenSSL. What’s in it, we get. OpenSSL is generally, pretty robust, so that gives us a lot of options, and we need to be able to grow not only vertically, adding processes and memory, but also out in terms of boxes.
To tie this up we use Ansible and Git; This is, of course, config management and code management, these two tied together; and the last piece, we use two offerings from Akamai. I’m going to talk about them in reverse here. Let’s start with the load balancing property of Akamai. This gives us the logical construct that we call a pool. This is how we group our systems together; and then we have the second property…I’m sorry, so these pools, what we have is a pool per region. We have four main data centers: Singapore, Frankfurt, New York and Santa Clara, California. We have two other smaller ones, one of them is in LA, but they’re dedicated and each one of them are pools of HAProxy systems.
Geo-mapping, because this is a DNS-based system, the geo-mapping ensures that when these requests come in you’re sent to the pool that is geographically as close as possible to where you are.
This gives us some keen avenues for troubleshooting and insight; If we have a processor burning hot and we want to know why, we find the process pinned to it, we connect to it with strace, and we can tell exactly what that processor is working on. We know the traffic, we can go in and make changes as necessary. It’s just insight we did not have before.
The second one is multi-socket binding; This is our way of now dividing those processes to do specific work, so certain processes will be bound to do HTTP processing, frontend processing, others that will be bound to do HTTPS processing, and then yet another set for backend work. If we look and we see groups of them doing more work than the others and they’re struggling, we know we have to rebalance. If we get to the point where we can’t rebalance, we know it’s time to scale one way or the other.
I mentioned that we have datacenters across the world…four basic locations, not across the world yet. We’ll get there. Singapore, Frankfurt, Santa Clara, California and New York. These two kind of go together, and that we can increase and decrease our pools relatively quickly. This is not just for capacity reasons, but when we were doing…we started out with Community edition of 1.8. When we were doing the tests to move to the Enterprise edition, we were able to siphon off groups of these systems and have them working at the same time to do comparison and make sure we had good stability tests. We’re able to do all of these things with our setup.
We can add backend functionality; Now this might confuse you because when I say backend what I mean is our backend applications. So if I have a new web application, I can roll out the config to all of our pools with no customer impact and once we’re ready we turn DNS to Akamai and things work. In fact, we’re usually ready before the developers are. We have our setup good and gone two weeks before they’re even ready to start testing. In fact, we have to remember what we did half the time by the time they’re ready! Of course, with Ansible and Git we can centrally manage our configs, our deployments, all of these.
Distributing functionality across independent systems: What this means is these systems don’t know about each other. There’s a logical construct from Akamai that ties them together, but there’s no way one of the HAProxies knows that there are a hundred other systems like it. So we’re able to spread out our functionality across these systems using this design.
So for almost the first year, we were missing this insight that we get from logging and we had to figure out a way to gain this access. What we wind up doing is employing RabbitMQ, utilizing named pipes, and having a consumer that polled that named pipe constantly, pulled those messages off that named pipe and sent it into an exchange in RabbitMQ. From there we have the flexibility to do whatever we want with that; We can do anything.
Right now we have multiple things happening simply from that one exchange. We store it all in MySQL so that there’s just an overall store of all these messages. We also have another consumer…sorry, it’s a producer that sent it to exchange, excuse my terminology…but we have another consumer that takes that same set of messages and it sends a subset of that to Splunk, like we needed anyway, the warnings, the errors, the things we need to alert on, things of that nature; but just by doing that—once again open source to the rescue—we’re able to solve some of our challenges.
Unified Config doesn’t seem like it would be a challenge right away, until you have enough disparate web applications behind it and you make a change at the wrong level and affect more than one application. We’ve had to be overly diligent about what we add and where we add it; Most things on an application-specific level are added to the backends, but we can’t do that for everything. Certain information, it’s not available at that level, so we’re diligent about when we add it. When a developer group wants to add something to their app, that’s all they see. Why can’t we do that? Well, there are 20 other apps who will tell you why; We’re responsible to make sure that doesn’t happen. Having a unified config has been a challenge in that we’ve often bandied about should we break this out? Should we have different pools? So far, we haven’t had to do it, but it’s caused us to be hyper-vigilant about how we go about things.