How HAProxy Helped Me Get “Near Perfect” Uptime While Slashing Support Costs
In this presentation, Eric Martinson describes how PlaceWise Digital utilized HAProxy Enterprise to gain observability over their network, improve performance, and secure their applications from threats. HAProxy’s metrics revealed that a large portion of their traffic was bots, which were sapping performance for legitimate customers. Eric leveraged several techniques for thwarting these bad bots, such as enabling the HAProxy Enterprise Antibot module and the HAProxy Enterprise WAF. He also succeeded in eliminating unplanned downtime, creating robust disaster recovery mechanisms, and offloading work from the development team.
I’m going to discuss how a small company was able to really leverage HAProxy to get pretty massive improvements in both performance and stability for a pretty legacy platform.
I think one of the unique things about our environment is we act as both an agency and a platform, which adds a lot of complexity.
We’re primarily in the United States. We own a pretty good chunk of the market in terms of an independent vendor that isn’t a real estate investment firm. We host about 40 to 50% of all shopping center websites or supply the content for them. We serve retailers and brands through our online platform that’s called RetailHub. It’s basically the place where anywhere from Gap and H&M to the mall operator themselves to the actual local stores can load their content and have it published across anyone that has access to our system.
I think one of the unique things about our environment is we act as both an agency and a platform, which adds a lot of complexity. On the agency side we’ll do web development, ad creative and publishing; We’ll manage content for clients and a lot of what the agency would do. On the platform side, we actually serve that up. So, in addition to RetailHub, we have our own custom web serving platform as well as an API. The API will feed any number of other clients either their websites or on-screen displays, touch screens and sometimes even apps. It kind of just depends on the use case, but in the grand scheme of things we sell data, right? We are a content provider. We sell data and run websites.
Any one of them could take down the rest of them and it was just kind of a cascade effect.
What ended up happening is we had a legacy system that was incredibly single threaded and multi-tenant. You’d have a server that had not only one single threaded item on there, but another eight on there as well. So, any one of them could take down the rest of them and it was just kind of a cascade effect. It was pretty pretty painful. It’s a Windows-based system, so reboots would just take an astronomically long time not just because of the legacy hardware we were on, but also because of our caching strategy at the time. A reboot would take over 20 minutes, which when you’re in the middle of troubleshooting is awful. The physical hardware was already dated, so we had to do something about that. We had no disaster recovery in place.
We were in the process of rebuilding a platform, one we are still in process of. Anybody who has a pretty massive infrastructure will know the pain of that. It’s not something you can just cut off and put new items in there. You have to take them out bit by bit.
Going into PlaceWise, you know obviously, it’s a very small company. I had to reduce an already minimal budget. We were over paying for a lot of the stuff that was there. We really could only build when necessary. The infrastructure team, right here. Change what? Really kind of goes the mentality of a lot of startups and things. You know you just load this stuff up and go, right?
This did drive some improvements, but it took a year to do even at our scale. We didn’t have the benefit of a large legal team, so we have contracts that we have to just basically wait out to be able to do this. When we were in this process is when I lucked upon HAProxy because I’d tried every other solution out there. It’s built into my firewall. I tried that. Didn’t work well. I tried a bunch of different projects, not great for what I needed to do. I was also looking at the cloud vendors’ load balancing solutions, but I also didn’t want to get tied again to another vendor, right? I want to be able to move data centers if I need to and take that configuration and move it somewhere else. HAProxy was really the best solution for me.
The thing that was really the kicker for me was just being able to have an agent on the local machine to circumvent some of the stuff that we needed to have done. I’ll kind of go into that configuration when we get there.
When we got the logs it was pretty evident where a lot of our performance issue was coming from.
Here was a different story. We didn’t really know what was hitting us. We just had to kind of guess. When we got the logs it was pretty evident where a lot of our performance issue was coming from. 60% of our traffic was pretty much invisible to the tools we had before I put this solution in place, meaning it was all bots. If I had to guess today I think the legit traffic would be even lower, just as matter of course. It’s just too easy to put these things in place and I don’t know, there’s some perverse thing out there that people really want to go out and see if they can get access to your servers for no real reason.
We’ve got that thing and it’s a balancing act. We still have the performance issues that we had before. It’s getting better because I’m starting to filter the bad traffic, but I still have an older caching strategy that I had to deal with.
The other part of that is a lot of the search engines that we didn’t want.
The other part of that is a lot of the search engines that we didn’t want, like Yandex and Baidu, we just don’t care about them. A lot of them don’t…they look at the robots.txt file. They just ignore it. I’ll see that traffic coming across as they make the request for them, but then they constantly just ignore it. So this was a good first step. In this I already have some other bad reqs that I get. People trying to do WordPress requests on my network, I know are already just scanners. There’s a lot of those out there, right? I immediately just tarpit those.
So I started doing to my backup servers, and this is like one pod, we have multiple pods, but this would be like an example of one pod. I wanted to start using it to my advantage. I started taking the wanted bot traffic and now using it to warm up my backup servers. The effect of that was pretty significant for us. Now if we actually had an issue on our primary and it failed over, it was already warmed up and ready to go. Our downtime within that cutover basically went to zero. Since then, my team has re-architected the caching so that’s no longer a problem, but I still do this just from a system load perspective.
I wanted a way to take all of the remaining single points of failure, which is essentially our database, out of the picture.
I wanted a couple of things. One was a good recovery solution, one that would be a lot more scalable because you just don’t really know in those load times what’s going to hit you. We’ve had years in the past where we got hit by just massive API requests, that kind of flat footed us. I wanted a way to take all of the remaining single points of failure, which is essentially our database, out of the picture. I also wanted to give me a better maintenance ability that was already kind of…I had already started with some of my other failovers.
It was really, really helpful for us in this particular instance and with our failovers it’s amazing how automated this is.
It was really, really helpful for us in this particular instance and with our failovers it’s amazing how automated this is so that when I get the alarm I’m not immediately trying to triage some issues with getting people online again. I’m trying to find the root cause. It’s been exceptionally helpful for us. We had this like two weeks ago and nobody knew it. All of the site’s back up and it’s a pretty scalable solution for us. As well as, I have scripts available on the different endpoints that I can push that for our developers. So, if the developers want to do something and they know they need to switch traffic, we can actually switch traffic to our static backups in pretty easy fashion.
I had 2,000 SSL certs within a few days up and running.
I really only put it on traffic that I’m pretty suspect with, but this thing is amazing. It changes the game for a lot of this stuff.
The first one I did, which, ironically, I was already in plans to create this myself with my dev team. I don’t know why. I don’t think I was reading that they actually had this in place. So, when I was on the phone with them, they were talking about this and I was ecstatic because I could immediately just lop off a lot of time of development to do this. It was drop dead simple to put in place. Other people have talked about it. It’s pretty amazing. I’ve made a couple mistakes when rolling it out where I actually put it on more people than I wanted to and the little loading thing is not ideal. So, I really only put it on traffic that I’m pretty suspect with, but this thing is amazing. It changes the game for a lot of this stuff.
It’s just a complete 180. I have complete visibility over my network and control of it.
We had down time when I first started, in the hours per month, mostly unplanned; and now maybe it’s minutes per month, planned. It’s just a complete 180. I have complete visibility over my network and control of it. It really informed the way I was able to do a disaster recovery solution and maintenance. To top it all off, I spend almost no time managing this. That my job is much bigger than the network, than the infrastructure. I’ve got all kinds of other things to do. I may spend 5% of my time on this, so the investment-to-payoff ratio is pretty big.
We’re going to start rolling out a completely new platform and it’s on our schedule, not someone else’s.
It’s still legacy code and I’m always just paranoid what is going to hit it. We’ve got to keep updating that, but most significantly as we build out a new platform, I bought us years. We didn’t have to do any kind of emergency. We could do this in lock step. Now we’re at the time, like next year, we’re going to start rolling out a completely new platform and it’s on our schedule, not someone else’s. It’s really because of the tools we were having within HAProxy I was able to leverage to be able to do this.