RTL’s Journey to Kubernetes with HAProxy
In this presentation, Vincent Gallissot from M6 describes how his company leveraged HAProxy to migrate their legacy application to the cloud and Kubernetes. They utilized many techniques, including infrastructure-as-code, deployments triggered by Git pulls, canary releases, and traffic shadowing. M6 also uses the HAProxy Kubernetes Ingress Controller to easily scale up or down their Kubernetes pods and gain first-class observability.
Let’s talk about the journey to Kubernetes with HAProxy. Three years ago, the UEFA Euro 2016 took place in France. It was a big event for us because we broadcasted the whole live match on our streaming platforms.
I’m Vincent. I’m a Network and Systems engineer. I work at M6. I’m the lead of the operations team. M6 is a French private TV channel and after some years M6 became a group owning 14 TV channels and doing all stuff. This group is part of the RTL group and for the RTL group, we manage the French VOD platforms, the Belgium VOD platform, the Hungarian VOD platform, and the Croatian VOD platforms.
This represents some nice users and we have some nice use cases. It’s a nice playground. We have some nice API calls and we have a seasonal use. So, we have different API calls during the day, during a week, and during a year. Everything’s different, so it’s really nice to follow.
We chose to migrate from on-premises to the cloud. For that, we check up our legacy platform. We were using almost all available CPU on our ESX cluster; 98% of CPU was used. So, it was time for us to migrate.
We checked. We had 30 microservices at this time, but they all follow the same pattern. Our backends are written in PHP and our frontends are written in NodeJS.
For all those applications, the traffic goes through a bunch or Varnish servers and Varnish servers forward the requests to virtual machines. The virtual machines all run NGINX for the HTTP part and PHP-FPM or NodeJS.
Migration to the Cloud
We decided to migrate. The first step to migrate was to create a platform in the cloud. For that we’re using Terraform. Terraform is a really nice tool because you can control resources you have in the cloud either AWS or Google Cloud. We even control our Fastly CDN configuration through Terraform. A really important thing for us is to write the projects, infrastructure-as-code, inside the GitHub repository of this project, since it gives a lot of autonomy to our developers and this is really nice to follow.
For example, we have an API to generate images. We have a lot of videos on our platform, so we have a lot of images for which we create thumbnails. One of these API images has a .cloud directory; and in that .cloud directory we have a docker subdirectory in which we have Dockerfiles and every configuration needs files to create Docker images. We have the Jenkins directory in which we have Jenkinsfiles to control the CD pipelines and every test over that Docker image. We have a charts directory, in which we have the Helm charts to deploy that Docker image inside a Kubernetes cluster; and we have Terraform to write all the needed managed services used by that API.
An example of a terraform directory: we have a single Terraform file for each managed service we use and the differences between every environment for those managed services are written in a dedicated vars file. So, if we have some sizing change between instance types and everything, we’ve wrote those differences inside dedicated environment tfvars. And the code…the rest of the code…is the same between all environments.
Now, we have a platform in the cloud using Terraform for AWS and all the EC2’s network and etc. We are using kops to deploy a Kubernetes cluster in that cloud platform. Kops is a really nice tool. It allows you to control the master nodes, the worker nodes of Kubernetes. You can do rolling updates. It’s really nice and it works almost out-of-the-box.
The configuration changes a bit. We’re using
weight as Oleksii just explained before. We also added
observe layer7 on the second server, on the ELB server.
I appended parentheses around
observe layer7 because this is really how we could secure our migration;
observe layer7 is HAProxy checking for HTTP response codes.
At 100% of the traffic sent to the ELB, the first server, which is the on-premises server, was marked as backup, as Oleksii just presented, and we still used
observe layer7 on the second server. Even at 100% traffic sent to the cloud, if there were problems, HAProxy would send it back on-premises.
So we were like, “Okay, let’s replicate traffic again.” We replicated traffic again and 25,000 connections and the application crashed again. We kept half of the traffic replicated, so we replicated only 50% of traffic and everything worked fine. We opened a support ticket to the AWS team and they responded, given the sizing of our sever, of our Redis server, the maximum connections was hardcoded to 25,000 and it was not written in the doc. To get rid of this limitation we had to choose bigger servers, even if they were doing nothing in the previous sizing.
A gor script example is this one. We use a special BPF filter keyword, which allows us to capture traffic on a really specific port and TCP IP, and we’re sending the traffic to the ELB of the application in the cloud. With GoReplay, you can do a lot of things like change the request, add cookies, add headers, really do a lot of things. It’s a really nice tool.
HAProxy 2.0 comes with traffic shadow capability, apparently the same thing. I never tested it, I don’t know how it works. Don’t ask me any questions about it, I don’t know the answers. Maybe the HAProxy team knows.
So, we had this application-02 and we started to add HAProxy that will receive the traffic.
observe layer7, as we did before.
We’re using map files for that. We’re using map files from the very beginning. It was available in the configuration; and we’re using the special keyword
map_reg, reg stands for regular expression.
Inside the map file, we define two regular expressions. The first one names the V2 path only. So, if the first line matches the regex, the traffic will be sent on the backend named application-02-migrating and the rest, which is a catchall, the second line, will be sent on backend name on-premises-only.
With that, we could migrate the application from 1% to 10% to 25% only for the V2 path. Watching our Grafana dashboards, we could show that the V1 was less and less used. So, at the certain point we could migrate back the DNS to point it directly to the ELB. This is our migrating even our most complex application in the cloud.
Using HAProxy Ingress Controller with Kubernetes
Now, we’re still using HAProxy as an Ingress Controller inside Kubernetes. Why do we need an Ingress Controller for?
We also have previews; previews is when a developer opens a pull request. If they add the correct label, which is cd/preview, the branch code will be deployed on a dedicated environment. So, the developer will be able to test the code and see the behavior. One preview means one ELB. We have one ELB per preview, per application, per environment. It is a lot of ELBs.
So, of course, we reached the maximum possible ELBs for an account.
That came with a problem of our implementation. Because our infrastructure code is inside the Git repository of the project, we cannot control a single ALB from different sources. Also, because we have previews, we have a lot of previews depending on pull requests, so it would mean changing a lot of single ALBs for each pull request. It would be a nightmare to maintain and therein comes the Ingress.
Applications. You can have a lot of Ingresses running inside your Kubernetes cluster. To say that your application will be managed by HAProxy we define an annotation, which is the Ingress class haproxy-v1. This YAML code is written inside the GitHub repository of a project and when we have the correct Ingress class set the Ingress Controller will watch those objects. The specific rules here are HAProxy will watch for the domain name foo-241.preview.bar.com and would forward the traffic to the service named foo-service on the NGINX port.
observe layer7keywords and we migrated complex applications by replicating production traffic and by migrating only a part of an application. Today, we’re still using HAProxy as the Ingress Controller and that let us define an infinite number of projects and of previews. Thank you very much for your attention.