Building a Global PoP Network Using HAProxy
In this presentation, Luke Seelenbinder, founder of Stadia Maps, describes the benefits his company received by building a custom Point of Presence (PoP) network using HAProxy. Their previous solution required placing authentication, authorization and quota enforcement into the backend applications. Also, failover was done at the DNS layer, which had a slow recovery time. By building a PoP network based on HAProxy, cross-cutting concerns like authentication, authorization, and rate limiting could be moved to the edge of the network. They increased their service reliability, gained observability over their traffic, and removed single points of failure.
Good afternoon. I hope you all got some coffee and are ready for the last few sessions. I’m Luke Seelenbinder as he said, cofounder of Stadia Maps, and let’s start with a little bit of background about Stadia Maps.
Our goal in infrastructure is to be able to deliver fast, affordable and very reliable location data services to all of our customers.
We run about 50 servers in eight geographic regions and we service anywhere between 150 and 200 requests a second globally. So, as a very small team, our goal in infrastructure is to be able to deliver fast, affordable and very reliable location data services to all of our customers. Our customers are spread around the world so that means we need a solution that allows us to spread servers around the world so that all of our customers have good response time.
All of our auth and statistics was baked into NGINX via nginx_auth_request. We had DNS failover. So, if one region went down their DNS record would eventually get updated to point to the next region. This comes with a few drawbacks, unfortunately.
If you need to do authentication, if you need to do authorization, if you need to do any kind of quota enforcement, all that lives on a backend server.
And DNS-only failover. As we all know DNS TTLs are more of a suggestion than a hard and fast rule. So, if you set a 60 second TTL, you might have clients who have requests fail when a server goes down for a minute, for two minutes, for an hour, and that’s obviously a very bad solution.
You have to build a point-of-presence edge network at some point.
The client requests and connections were never severed and they have an overall much better experience.
Just a diagram how this works. You have your clients. They all talk to their nearest server. So, let’s say a client is in Dallas. One client is in Tokyo. Another client is in Amsterdam and they talk to Frankfurt, and then they all go back to your backend servers via some kind of discovery mechanism. As a comparison to the previous failover method we saw, let’s say that our map service in Newark goes down. Instead of a client seeing 500 errors or not getting a response at all because the server is down, those client requests go directly to a failover server. The client requests and connections were never severed and they have an overall much better experience because they never see a 500 error, and you’re able to deal with the problem.
Why would you want to build a PoP network? Originally, this talk was going to be more of a process of how we build it with HAProxy and a few other tools, but because I have a much shorter time period I’m just going to hopefully convince you why you need an edge network. Then in a few weeks, a blog post will be on the Stadia Maps website and possibly also the HAProxy website about more of the technical details.
A PoP network gives you the opportunity to prioritize user requests at the edge.
Also, a PoP network gives you the opportunity to prioritize user requests at the edge. Let’s say you have a commodity version of your product and an enterprise version. With an edge network, you can prioritize your enterprise version without…and that gives them much better, perhaps much better performance and you can meet your SLAs for your enterprise customers without too badly affecting your commodity traffic.
We also have the opportunity as we’ve seen in the last couple talks to block or deprioritize bad traffic or perhaps over-quota traffic. Each of our plans at Stadia Maps has a quota and when people go over the quota we don’t block them because that would be bad because then their users wouldn’t get maps; but we can implement rate limiting and, perhaps, make those requests a little bit slower so that we can maintain quality for our other customers, but not completely cut off that customer that was over quota.
Often, we ask the question: Is this server up? With HAProxy it’s much, much easier to just look at the Stats page.
Also, with a global network, if you have a region that’s experiencing traffic loss or you have a, as we often have sometimes, a cache server will be performing very badly and we want the ability to reroute traffic from, say, London to Frankfurt to maintain quality; and having an edge network allows you to do that because you can dynamically say just don’t send any traffic to London and the client requests are automatically pushed to where they need to be.
You have one source of truth for your service state. Often, we ask the question: Is this server up? When you have just direct connections to the backend it is sometimes difficult to answer that question, but with HAProxy it’s much, much easier to just look at the Stats page as we all like to look at it and see what servers are actually up on the backend.
It also again gives you one centralized place to enforce your authentication and access control, and in our case we do a lot of quota enforcement. To improve your client experience, it’s also a great place to gather performance metrics because you’re right on the edge. You see exactly what your client sees in regards to your backend services.
You add hitless reloads so you can change the configuration as much as you want.
We changed our authentication method that took sometimes up to a second because of network latency to being an instant process. Every single one of our requests is authenticated instantly. You add hitless reloads so you can change the configuration as much as you want and as long as you don’t have to deal with too many long lasting TCP connections, you have an uninterrupted experience for the client as you refresh your HAProxy configuration.
Then, for us, we use DNS SRV records for our service discovery. I looked at a lot of different options and HAProxy has one of the best integrated support for DNS SRV records, along with you can control everything from how often records are expired and how long you keep them if you can’t actually look up the service in the backend.
I think one of the biggest differences is the community that HAProxy offers. When we just pushed out our solution using HAProxy, our PoP network, we ran into an issue when we enabled HTTP/2. I didn’t know what was going on and I found the mailing list and I said I’m going to send a bug, say this is my problem. Within just a few hours I got a response from Willy and I think from a few other people saying, “We found a problem. We’re going to fix it.” The fact that, as an open source user of this product that I had just found a few months ago, getting a response like that is something I had not seen in the open source community. So, keep that up.
It’s also incredibly cost effective. We deployed our entire edge network with eight locations for $80 a month and this handles all of our traffic and it could handle much more. Our error rates during normal operation dropped to practically nothing. I don’t have good statistics from before we deployed the network, unfortunately, because we didn’t have a good way to gather them. When we upgraded from HAProxy 1.9 to HAProxy 2.0…so for those of you haven’t upgraded yet there’s something that’s very cool in that…we added layer 7 retries and, as you can see, this is an error graph over, actually, I think about a week.