In this presentation, Neal Shrader describes how DigitalOcean leverages HAProxy to power several key components within its infrastructure. First, HAProxy is used as a component of DigitalOcean’s Load Balancer-as-a-Service product. Second, it’s used as a frontend to its Regional Network Service, which is responsible for orchestrating changes within their software-defined network. Last, HAProxy is used for load balancing traffic to their edge gateways and public websites. HAProxy provides the redundancy and performance they need to satisfy both their internal infrastructure needs as well as the needs of their customers.
Hi everyone. Thank you so much for allowing me to talk today, it’s a pleasure to get to speak to you. So, I’m going to be talking about load balancers at DigitalOcean; basically how we utilize HAProxy not only in our internal services but also externally and through our product offerings as well.
If you’re not familiar with DigitalOcean, we’re a New York-based cloud hosting provider that was founded in 2011. We had a rather modest beginning, essentially being a VPS provider. We primarily offer droplets, which are our compute instances. But our primary focus is the user experience and community. I hope that some of you might have run across our tutorials in the past. Ensuring that the developer experience is as simple as possible and just enabling individual developers to succeed is our passion and kind of fundamental to everything that we do.
We use HAProxy extensively internally, both for internal services and in our product offerings.
Currently, we operate in twelve data centers across the globe and we’re managing in excess of about 1,000,000 active droplets and over tens of thousands of hypervisors today. We’ve grown a little bit from our beginning as just a single offering of droplets. We now have an S3-compatible object storage offering called Spaces, DBaaS service, managed Kubernetes platform called DOKS and, as was mentioned before, a little bit more expanded networking capabilities with Cloud Firewall and nascent VPC support. We use HAProxy extensively internally, both for internal services and in our product offerings and I’m going to talk a bit about each of those today.
The regional network service’s primary purpose is to orchestrate this full mesh of tunnels between members.
So, I’ll start with the regional network service. The regional network service is essentially the engine of software defined networking at DigitalOcean. Its primary responsibility is orchestrating the overlay network that ensures tenant isolation on our private network. Now, an overlay simply is the process of encapsulated traffic at a source and then writing it over a very simple IP fabric to be decapsulated the other side and to be presented to the VM.
Every single state change that happens inside of a user’s network, it becomes a one-to-many action. So, if a user creates a droplet, needs a live migrate from one hypervisor to another or to destroy, that needs to be propagated out everywhere. The regional network service’s primary purpose is to orchestrate this full mesh of tunnels between members. We call it a virtual chassis, of which there’s only one today, but soon there will be many available.
This is the general shape of the architecture itself.
We’re fronted by two load balancers that front a service called north-d and north-d is the primary integration point for our software defined networks. So, when our compute services essentially want to add a member of this chassis, remove a member of the chassis, create or destroy a chassis, it’s inbound to this service which is then persisted in a local MySQL cluster. From then an asynchronous process kicks in, an internal API called Convergence, which has the responsibility of making it so, as it were. It generates, essentially, these ApplyVirtualChassis messages that are then placed on a message queue, in our case RabbitMQ; and there are southbound workers, which we call south-d, and they’re responsible for invoking RPCs against a daemon local to the hypervisor called hv-flow-d.
On each hypervisor, we utilize Open vSwitch to express our data path, our pipeline. From there we translate that message into OpenFlow and then persist it in the data path and then from there we’re able to encapsulate and decapsulate accordingly. There are also some ancillary services that are responsible for projecting the state of the chassis towards our user-facing services. So, for instance, we don’t necessarily have to round trip to Bangalore to be able to say, “Okay, what’s the state of this chassis?”
Now the load balancer tier. Essentially this is consistent with any bare metal clusters at DigitalOcean. We have a process called Exabgp, which is responsible for peering with the upstream top-of-rack switch that announces the clusters and is, essentially, actively health checking through HAProxy. HAProxy itself is balancing between the north-d instances with a relatively straightforward configuration: least connections, TCP. It’s running 1.8 at the time. If HAProxy does wedge or is unavailable for any reason, host route is withdrawn and we fail over to the secondary load balancer or back and forth. This general pattern is consistent to many services that are just internally deployed on bare metal. Spaces, for instance, follows a very similar pattern.
I’ll focus a bit on the edge of our network. The edge gateway is essentially our internal API gateway and it simplifies, essentially, publicly exposing any internal services in our cloud control panel, which is the user interface that you use to control your access to droplets or creating and destroying; and the public API itself. It gives us a really clear path to the break apart some of the monoliths internally and gives us a really great path to distribute our architecture and really shift into a fully microservice operational pattern. It has simplified the integration for being able to develop however you wish, but we’ve standardized on Go internally at DigitalOcean. It also generalizes concerns such as feature flipping external features, doing internal rate limiting, and also authorization concerns: passing down user IDs.
This is what the general architecture looks like for an incoming request into our control plane.
DigitalOcean.com itself, the DNS is served from Cloudflare. We utilize their services for DDoS mitigation and so on for our control plane. We then come inbound to NYC2, which was our second data center that we built in New York. From there, that HAProxy cluster balances to edge gateways, which reside in our New York City datacenters, which also come back to all the microservices that are responsible for orchestrating droplets and Floating IP, etc.
The public load balancers…it isn’t a crazy scale. It’s servicing about 600 requests per second. So, that’s all control plane traffic. It’s also responsible for routing requests towards edge gateway along with the other static sites that we manage. So, all of our tutorials, our community sites, www, hacktoberfest are served through these. The one thing that we have noticed with edge gateway specifically as we increase the number of processes there, if we aren’t cognizant of how health checks are implemented, it can provide a little bit of burden on downstream services and it’s something that we’ve had to be careful of in the past.
LBaaS is our product offering for load balancer-as-a-service. We initially offered this in early 2017 as a way to find product market fit. When Floating IP was released we noticed that users were essentially rolling their own highly available software and we wanted to be able to wrap that in such a way that it was simple to be able to do that on their own without having to burden the user with that. The initial offering of Load Balancer is essentially DigitalOcean primitives. So namely droplets and Floating IP, backend droplets specified either by name or by tag.
We also have integration with Let’s Encrypt that really simplifies certificate management, allowing it be fully managed, auto renewed, roll automatically. Simplicity is essentially the name of the game here. It’s built on HAProxy itself. The image is HAProxy 1.8 and the product surfaces all the configuration that you would expect. Essentially, algorithms between round robin or least connections, sticky sessions, Proxy Protocol and it allows to terminate at the load balancer itself or just pass through to the backends directly.
As I was saying, this is the current offering and when a user creates a load balancer through our public API, it then re-invokes the public API to create a primary instance and a backup instance, which are running on HA agents and has a floating IP pointed into the primary agent once that configuration is pushed to the instance. From there, we have a regional health checker, which is polling the floating IP consistently. If it notices that the primary instance is unavailable, it will then just fail over to the backup and destroy the primary instance and then stage another backup droplet to serve as the new backup in this case.
Now this served 80% of the use case and gave us a sense of the appetite for this in the market, but it does have its challenges because while it’s simple, it does really give us limited means to be able to scale the current offering. Vertically scaling can only really go so far, especially how we queue into a VM itself. Having to traverse two network stacks also is suboptimal. The idle compute instance, as well, having just a backup droplet sitting there idle isn’t the best utilization of our infrastructure. So, we want to find a way to be able to effectively eliminate that; and for each droplet that is created we do allocate a public IPv4 address which, as we all know, is scarce and expensive and is something that we want to reign back. The product is seeing an increased usage just due to integration with further upstream products on our platform.
So, the next evolution of Load Balancer.
I won’t belabor this architecture because we’ve gone into what a software load balancer looks like in numerous sessions, but essentially it’s going to look very typical. Having Equal-Cost Multi-Pathing into multiple layer 4 load balancers to be able to evenly spread across HAProxy instances into our backends of course gives us the redundancy necessary, allows us to be able to horizontally scale, and gives us the performance that we desire to be able to satisfy customers, users and some of these upstream products that are demanding a higher scale of usage.
How we plan on implementing this is via Kubernetes itself. Instead of kube-proxy, we’re going to drop in kube-router, which will be responsible for peering with upstream routers and handling the advertisement of the VIP itself. IPVS will function as our layer 4 balancer running Maglev to ensure consistent hashing across the load balancer pods in the cluster itself. The pods are just going to be running a similar image that we have in the existing product: HAProxy on the backend.
Now, one of the complicating factors here is the integration into our existing software defined network. So, in addition we’re going to need to leverage Open vSwitch on Kubernetes as well and to be able to land a daemon that we’re going to call connflow-d that will be able to watch the placement of these pods and to be able to ingest these ApplyVirtualMessages from our orchestration software. From there, we’ll be able to ensure connectivity into our backend droplets and be able to orchestrate as we expect it to.
That is a very quick tour of some of the usage of HAProxy internal to DigitalOcean and what I wanted to speak to you today.