Migrating thredUP Infrastructure to Kubernetes with HAProxy
In this presentation, Oleksii Asiutin describes how thredUP evolved from a monolithic application to a fully containerized, distributed system that is powered by Kubernetes. His team used HAProxy as a main routing point to distribute traffic simultaneously to the existing and the new system. They gradually increased the amount of traffic that went to the new system, which allowed them to fix issues along the way with less risk. HAProxy gave them real-time observability, which they used to quickly identify potential problems.
Thank you. Hello everybody, welcome to the Kubernetes HAProxy talk. Today I will describe how we migrated our cloud-based microservices infrastructure to Kubernetes infrastructure and what role HAProxy played in it; but before that let me introduce myself again, maybe, and my company I work in. So yes, I have both experience in software development and platform engineering, which helps me to popularize this kind of DevOps culture and practices in [the] company. I live in Ukraine. We have a big community of engineers and I’m organizing our DevOps digest newsletter, which runs monthly. And I work now as a staff software engineer at thredUP.
thredUP is [the] biggest consignment and thrift store in the world. We have, we process about 100,000 second-hand items daily and we have about 300,000,000 items available for sale every time online. We have a lot of distribution centers, which is powered by automation; you can see it on the video. And all this is run by software and platforms that run the software.
In terms of engineering team, thredUP is not so big. We have about 70 software engineers, we run about 50 microservice applications and everything is about on 100 EC2 nodes in AWS.
We have Ruby, we have NodeJS, we have Python and we are exploring more programming languages like Java and so on; and we started in 2009 as a monolithic application written in Ruby on Rails running on one dedicated server.
We migrated to a microservice architecture deployed in Amazon Cloud; and here is where HAProxy came into play.
But with the time passed, we grew as a company and we migrated to a microservice architecture deployed in Amazon Cloud; and here is where HAProxy came into play because it acts as our main routing point, delivering requests from a bunch of microservices based on request URI prefix, based on hostnames, and so on.
And so we began our migration and actually we did it. We did it within a year. We even have a case study on Cloud Native Computing Foundation. And we receive a lot of gains from it, not only in our deployment process, and we also received a big hardware decrease because it’s about fifty percent. We improved performance; We improved scalability of our services and you can read about it here.
But the actual point is that during the migration we have to support this type of infrastructure we have. When you have your old one, when you have your new one, it’s two separate networks and you have HAProxy load balancing between it; and I will describe to you our migration plan and migration process and I will give you some examples of issues and gains we bumped and how we solved it.
So our migration plan, it looks actually pretty simple. So if you have a service and you have it functioning on EC2, you need to prepare a Kubernetes deployment and for that you need to containerize it, maybe do some tweaks in [the] codebase to prepare it to be container ready. You write Docker files, you update your continuous deployment pipelines, and after that you test it in a Kubernetes environment. Does it deploy? Does it actually, the container runs? And if everything is okay, you deploy to the staging cluster, you verify it works, you verify it functions, it can communicate to other services; and it sounds like a simple process, but I think everyone knows, you can bump into a lot of issues during these steps.
But if everything goes okay, when eventually you’ve fixed everything, you deploy to production. And in terms of production I think a lot are familiar with blue-green deployment. It’s when you route some small amount, percent, of traffic on your new deployment. Like in this we have two percent. Below you can find backend configuration for such an initiative and after that you run your test, you analyze logs, you analyze errors, and if everything is okay you switch 100%.
In case something happens, you can switch back. But the case is like classical blue-green deployment in that you deploy new release and old release. You have it in the pretty well known environment. It’s your production environment and you run it daily like deploy and deploy.
This is when you, in client, you specify
cookie and you configure your backend service in that way, that it serves requests to the new release only if you have this cookie specified. And in that case you can manually, or with some integration automatic test, you can verify that this new release in the new environment, it works and functions at least close to your old production service release.
But, you still…it’s good practice to leave your old deployment as a backup and also there may be not very convenient points that you have support your deployment pipelines to deploy on both old and new infrastructure and to be bumped in such cases when we have, for example, two days of new Kubernetes service functioning normally and then something happens, you have down time and having this backup option and your old infrastructure is really good point. We investigate it; We fix it; Back to Kubernetes service. But having this I would say for two weeks maybe, your old deployment as a backup strategy, is a good point.
And okay, it was our migration plan. It was our path for migrating, an example, a single service from the old infrastructure to Kubernetes cluster; and it sounds like success and yes, it was success; but is the actual thing that during this path it wasn’t so straightforward and we bumped into a lot of issues; and I will name a few connected with this, maybe, HAProxy configs, with connectivity; and the first one is in Kubernetes.
If you deploy it in Amazon as a service, it’s represented by a load balancer and during this gradual traffic increase on our newly deployed service, we started to notice that we had downtime and servers became unavailable. We switched it back to the old one and to the backup service and we started to investigate and it seems like AWS does autoscaling out-of-the-box and behind the scenes, and it works perfectly, but it does it in a way that during the autoscaling it changes IP addresses behind your load balancer DNS name.
And in our case, like HAProxy, we didn’t do dynamic resolvers; and in our case during autoscaling, HAProxy on the start resolves your DNS load balancer name to specific IP addresses and when AWS ELB scales, we get new IP addresses and old IP addresses are unavailable. And so we have service downtime.
It was [an] unpleasant case when we experienced such problems, but we quickly solved it, but it’s worth to mention that you can have it and please use dynamic resolvers. I think it considers not only AWS. It considers every cloud-based provider and even if you have [a] DNS name in your backend on servers and if you plan to change IP addresses behind it, please take it into account and use it.
Okay, the other not to mention point, when you deploy your server, when you do migration of your service to Kubernetes, you probably have your dependent services deployed in the old infrastructure and during this migration you have to communicate with both in this new one and in the old one and maintaining two deployments like new one and old ones simultaneously makes you provide connectivity with, for example, that database layer from Kubernetes and from your old infrastructure. And the point is your old data layer and dependent services has probably deployed in the old infrastructure. And, okay, when you do deployment you set up for example Kubernetes as a single, separate network and it’s not so easy to connect to these resources.
We used a lot of metrics like response times from HAProxy to the backend and from the HAProxy to the client.
Okay, you migrate services. You do it day to day and sometimes you see that it’s not going so well and monitoring seems to be a good place to look and to look for metrics and to see if everything goes well. HAProxy provides you with a bunch of metrics and we use DataDog as a monitoring dashboard. We use 200 errors, 500 errors, and it seems it was very useful. For example, if you migrate a service and you notice an increase of 500 errors, you can dive in and investigate further.
Apart from that, we used a lot of metrics like response times from HAProxy to the backend and from the HAProxy to the client; and sometimes it gave us a lot of benefits to notice [a] problem immediately and to fix it. I will name one. What is the downside of Kubernetes, if you can say so? Because if you deploy your service on dedicated hardware, is it bare metal or cloud based? You just reserve an EC2 instance with sufficient capacity. You know how much memory it has, how much CPU it has, and you use it and your service works on that instance and everything is okay. If you need more, you provision more.
HAProxy metrics help us with monitoring and to prevent such bad experiments from going to production.
We store all HAProxy configs in GitHub and we use pull requests to review.
So that’s it. As I said, we succeeded in migration. HAProxy helped us a lot, doing it reliably and in a predictable way. And if you have some questions, I’m ready to answer and thank you very much.