In this blog post, learn to use HAProxy, Keepalived, Terraform and Ansible to set up highly-available load balancing in AWS.
In the third part of this series, we are again tackling how to implement a highly available architecture in AWS. In the first article, HAProxy on AWS: Best Practices Part 1, you learned how to set up redundant HAProxy load balancers by placing them behind Amazon Elastic Load Balancing (ELB) in order to safeguard against one of the load balancers failing. In HAProxy on AWS: Best Practices Part 2, you learned that high availability can be achieved without ELB by having both HAProxy load balancers monitor one another using Heartbeat.
As in the previous article, the design that you’ll see here consists of two active-active, HAProxy Enterprise load balancers running on EC2 instances relaying traffic to three backend web servers. The HAProxy Enterprise AMI is a high-performance, fully-tuned image, tailor-made for running HAProxy. The current supported distributions are Ubuntu and RHEL. Either way, you’ll be using an image that’s guaranteed to get you started on the right foot.
As you’ve seen, you can leverage the AWS Command Line Interface (CLI) to dynamically reassign an Elastic IP address (EIP) to a load balancer. That means that you don’t need ELB in front of your two HAProxy instances in order to achieve automatic failover. Last time, we’d demonstrated how to use Heartbeat to invoke the CLI in the event of a load balancer failure. Now, we’re going to show how to configure the Keepalived service and the Virtual Router Redundancy Protocol (VRRP) for creating a fault-tolerant setup.
High Availability Architecture
We often prefer Keepalived when designing for high availability, due to its proven stability and wide use. It provides a way to check on the health of a machine and trigger actions when a failure occurs. VRRP is a protocol for automatically assigning IP addresses to hosts.
The only difference from a typical configuration is that we cannot use multicast on Amazon EC2. So, we will use unicast peer definitions. It’s possible to create a multicast overlay with n2n. However, it is much simpler to manage a unicast configuration and requires fewer moving parts.
Here is an overview of how traffic will be received by one of the two load balancers and then forwarded to a backend server:
When either of the HAProxy nodes fails, Keepalived will quickly detect the condition and invoke our helper scripts. They, in turn, will utilize the AWS CLI to migrate the failed load balancer’s EIP to a secondary private IP address on the remaining, functional load balancer. That way, traffic can continue to flow using the same public IP address.
The remaining load balancer will pick up the slack until the other node recovers. Typically, Keepalived is able to detect a node failure and complete a full EIP reassociation within several seconds—usually in less than five seconds. In our example setup, we have Keepalived configured to detect a range of service failures including OpenSSH failure, DHCP client failure, and HAProxy failure. It will also notice split-brain situations wherein the default gateway is unreachable.
Before you’re able to utilize the CLI calls for managing EIPs, you must grant permissions by creating Identity and Access Management (IAM) policies in AWS. We will cover that in the next section.
Creating Infrastructure with Terraform
Terraform replaces point-and-click creation of cloud resources with an Infrastructure-as-code approach. It works with many different cloud providers, initializing complex infrastructure using a declarative configuration language. With Terraform, you get a repeatable and clean stack that consists of only the defined resources and their dependencies.
Download the latest version of Terraform, such as:
Before using Terraform, you must set up a user account in AWS that has permissions to create infrastructure. Also, since you’ll be giving Keepalived the ability to view and change network settings, your account must have the right to grant those permissions. Log into the AWS console and go to IAM > Policies > Create Policy. Select the JSON tab on the Create policy screen and enter the following:
Click the Review policy button. Give the policy a name, such as TerraformAccess, and description, then click Create policy.
Next, go to Groups > Create New Group. Assign a group name, such as TerraformGroup, then click Next Step. On the next screen, select TerraformAccess from the list of policies, then click Next Step. Then click Create Group.
Next, go to Users > Add user. Assign a username, such as TerraformUser, and then check the box to set its access type to Programmatic access. On the next screen, check the box next to the TerraformGroup group and then click Next. On the last screen, click the Create user button. At this stage, it’s important to copy the Access key ID and Secret access key and save them somewhere. You can click the Download .csv button to do this.
You must also create an SSH key-pair or use an existing one. Log into the AWS console and go to EC2 > Key Pairs > Create Key Pair. When you create one, you’ll be prompted to save its private-key PEM file. Be sure to set appropriate permission on it:
Our complete example, which includes our Terraform files, is on Github. Use the following commands to clone it and create your own infrastructure demo.
Include a variable named key_name when you call
terraform apply. The Terraform scripts install the public key of an SSH key-pair onto each virtual machine, but they need to know the name of the key-pair first.
The resources being created include:
- Two HAProxy Enterprise load balancers
- Three node.js web applications
- The network including VPC, subnet, Internet gateway, and route table
- Security groups for allowing some types of traffic to reach our nodes
- IAM policy permissions (regarding EIP and ENI management)
- Public and private IP addresses, which are assigned to the load balancers
While the Terraform configuration in main.tf is fairly straightforward, there are some important parts that we would like to address. The following snippet demonstrates how we are selecting the HAProxy Enterprise image from the AWS Marketplace:
This excerpt selects the most recent HAProxy Enterprise 1.8 (at this moment 1.8r2) Ubuntu AMI. This AMI is based on Ubuntu 18.04 Bionic Beaver for the AMD64 platform and supports full (HVM) virtualization. You should set product-code to the given value and use aws-marketplace as the owners so that only official images from HAProxy Technologies are selected. These filters also ensure a quick and efficient lookup.
We use similar logic to select the base Ubuntu 18.04 image for our node.js web applications:
Something else to note about the Terraform files is that we’re allowing VRRP traffic (IP protocol 112) in each load balancer’s security group.
When Terraform runs, it automatically creates the IAM role with all the necessary permissions for EIP and Elastic Network Interface (ENI) management. That role will be assigned to an instance profile and later associated with the EC2 load balancer instances.
When creating HAProxy Enterprise EC2 instances, it’s important to make sure that the unattended-upgrade utility, which regularly checks for the latest package updates and automatically performs them, does not start post-boot. Allowing it to do so would lock the package database and prevent Ansible from running immediately after Terraform. To avoid such a case, the example project uses cloud-init through the
user_data argument. By specifying
runcmd, commands will execute at boot that stop the apt-daily and apt-daily-upgrade services.
Post-creation Configuration with Ansible
As flexible as Terraform is, individual service installation and tuning is much easier to do with a configuration management tool like Ansible. An Ansible playbook will make sure that all instances have up-to-date software, required services running, secondary IP addresses assigned, EIP management helper scripts installed, etc.
Install the latest version of Ansible, like this (note that we are running this on an Ubuntu 18.04 workstation that has Python2 installed):
To execute the playbook, Ansible must be able to connect to your running virtual machines via SSH. If you haven’t already done so, you need to create an SSH key-pair and add its public key to each instance. See the previous section about running Terraform.
Next, update ansible.cfg so that the private_key_file variable is set to the path to the PEM file that you saved.
ansible-playbook command to connect to all EC2 instances and apply the configuration.
Here’s a summary of the Ansible roles, describing their purposes:
Ansible roles applied to HAProxy instances
|secondary-ip||Ensures that each HAProxy instance is able to configure a secondary private IP on boot, as that doesn’t happen by default on Amazon EC2.|
|ec2facts||Gathers ENI and EIP configuration details for further use in Keepalived configuration and Keepalived EIP helper scripts.|
|hapee-lb||Auto-generates the hapee-lb.cfg configuration file from a Jinja2 template and populates private IPs in the backend server definition.|
|keepalived||Handles Keepalived installation and configuration, populating two VRRP instances (one for each of the secondary EIPs), generates random VRRP instance passwords and generates EIP migration helper scripts, which are different for each instance.|
Ansible role applied to web backend nodes
|nodejs||Handles installation and configuration of the node.js web server.|
In the following sections, we’ll explain these roles in detail.
Let’s take a look at the secondary-ip Ansible role. It installs a helper script onto each load balancer server. The script starts as a Systemd service and queries instance metadata to get the associated IPv4 address. It then assigns the IP to the eth0 network interface as an additional IP alias.
This secondary IP is a private address, not yet exposed outside the AWS network. However, if Keepalived detects that the other load balancer has failed, the failed node’s public-facing EIP is automatically paired with this IP. In essence, the secondary IP is an empty slot to which the EIP can migrate.
The example project gets many of its variables from ec2.py, which is used for creating a dynamic inventory. However, this doesn’t populate all of the information you need. The ec2facts role gathers information about the EC2 ENIs and EIPs, which is needed when configuring Keepalived. This includes interface IDs, allocation IDs and private IPv4 addresses allocated to the instance and its interfaces.
To get this data, the role uses two Ansible modules: ec2_eni_facts and ec2_eip_facts. Together, these generate the rest of the necessary variables.
The hapee-lb role generates a complete HAProxy Enterprise configuration file, hapee-lb.cfg, that does round-robin load balancing of the node.js web applications. As is typical when using Ansible, we’re leveraging the extensible Jinja2 templating language, which permits various dynamic expressions and references to Ansible variables.
The template generates
listen sections that, when rendered, include all of the private IP addresses of the backend node.js servers. While the template is fairly generic, you can easily extend it to enable threading, TLS termination and other sorts of load balancing essentials.
The final and most complex Ansible role installs the AWS CLI and HAProxy Enterprise version of Keepalived, generates the Keepalived service configuration, and installs the EIP management helper scripts. We should mention a specific code block near the beginning of the task definition:
This particular task relates to the already mentioned unattended-upgrade utility. If the utility is already running, Ansible will stop and wait until the process has been completed. Otherwise, if we proceeded without waiting, the package database would be locked and the apt/dpkg utilities would fail to install the required packages. This would ultimately cause the Ansible role to fail.
The keepalived Ansible role configures two VRRP instances, one of which will be always in the BACKUP state initially. The BACKUP instance, by default, has a priority of 100 and the MASTER has a priority of 101. The instance with the higher priority within the group becomes the MASTER. Remember that this role is being run on both load balancers. So, each load balancer is the MASTER instance of its own EIP. The two load balancers operate in an active-active configuration. In the event of a failure, one load balancer assumes control of both EIPs.
The configuration is more complex than the one shown in the last blog post for the Heartbeat service. Primarily, that’s due to more complex logic in defaults/main.yml that sets all the required variables for later interpolation in the templates/hapee-vrrp.cfg.j2 Jinja2 template. Take a look at the template file and notice the many variables being referenced. Here’s a snippet:
This file also includes VRRP configuration that performs the following health checks:
pkill -0, checks whether a process is running. These checks perform the following actions:
- Default gateway ICMP echo checks to avoid a split-brain situation. An error status will cause a FAULT state until ICMP checks start returning OK exit statuses.
- DHCP client checks: An error status will subtract 4 from the initial priority.
- OpenSSH service checks: An error status will subtract 4 from the initial priority.
- HAProxy process checks: An OK status will add 6 to the initial priority.
Weights assigned to the script checks are intentionally larger than the MASTER-BACKUP priority gap so that check failures cause appropriate VRRP instance state transitions. If you want to require that a particular script check should return an OK status or else you’ll put the instance into the FAULT state, use a weight of 0 as we’ve done for the chk_gw check, which checks that the gateway is reachable. Setting a hard requirement here avoids split-brain by disabling the server regardless of whether it is currently the MASTER instance.
As mentioned previously, there is no multicast capability on Amazon EC2 so we’re using unicast local and peer IPv4 addresses and authenticating each of the VRRP instances separately. Upon VRRP instance state transitions, appropriate EIP helper scripts are executed, each using the AWS CLI to perform EC2 API calls.
The portion of the rendered Keepalived configuration that defines the VRRP instances will look like this:
The EIP helper scripts, update-EIP1.sh.j2 and update-EIP2.sh.j2, while rather complicated in the template form, render into very simple shell scripts. This, for example, is the update-EIP1.sh script:
The script will reassign an EIP with a specific ID to the current load balancer’s local ENI using a specific private IP address.
The End Result
Once these Ansible roles are executed, you’ll have two HAProxy Enterprise load balancers that accept traffic directly over publicly routed Elastic IP addresses. Remember to update your DNS settings to point to both EIPs. If a fatal error occurs with either HAProxy instance—which includes networking, kernel and service-related issues—the remaining live instance will reclaim the remote EIP automatically. Once the original load balancer recovers, the EIP will return to it automatically.
For example, here’s a screenshot of the two load balancers running healthily:
However, when I manually stop one, the other takes over its EIP:
Remember to use
terraform destroy to tear down the resources in AWS when you no longer need them so you don’t get charged for the usage.
This concludes our HAProxy Enterprise Keepalived HA example. As a reminder, all of the example code is available and contains the complete Terraform and Ansible configurations. Use the provided source code to build your own HAProxy AWS deployment. Any contributions are encouraged!
Please leave comments below! Contact us to learn more about HAProxy Enterprise or sign up for a free trial. You can also join the conversation on Slack and follow us on Twitter. HAProxy Enterprise combines HAProxy, the world’s fastest and most widely used, open-source load balancer and application delivery controller, with enterprise-class features, services and premium support.
Nice article. Would be great to see an Azure version.