A New Era for Web Observability at OVH

Steven Le Roux
Engineer, OVH

In this presentation, Steven Le Roux of OVH describes how the HAProxy Stream Processing Offload Engine (SPOE) lets you build your own sophisticated solutions, such as custom tracing frameworks similar to OpenTracing. He describes the meaning and value of logs and metrics. He then explains how they can be captured as time series data. OVH has implemented a multi-stage time series database that aggregates data for various levels of retention targeting different use cases.

haproxyconf2019_steven le roux

Transcript

Hello everyone. We’re going to speak about observability. Who is okay with the term observability? Not that much? Okay. So, what’s observability?

haproxyconf2019_a new era for web observability at ovh_steven le roux_1
If I talk about logs and metrics, maybe now it speaks for you. Who is using a time-series database among you? Quite a bunch! Let’s make a vote. Who is using InfluxDB? Yeah. Prometheus? Yeah. OpenTSDB? One, okay, two. BigGraphite? Graphite? Not that much today. Yeah, okay. Something else, maybe, that I missed? Warp 10? Oh, okay, a few guys. Great. Who is operating HAProxy? Quite a bunch. So, there is convergence between the time-series database and the HAProxy instance, actually.

Observability with Logs and Metrics

The two pillars of observability are metrics and logs, but we will see that actually both are time series with different indexing strategies. The time-series database is really important at this stage. Let’s see the difference. HAProxy is really useful in terms of logging.

haproxyconf2019_a new era for web observability at ovh_steven le roux_2
This image comes from the blog of HAProxy and shows the different things that you can get from the log. For example, you have counters, you have established connections, statuses. You have the queues, the connections. You have many insightful things in the log and actually it’s marvellous. When you operate a solution, to have this kind of information is really, really, really a good thing.

But indexing logs can be quite costly. I’m not saying for the previous speakers that Elasticsearch is not a good solution, but given some workloads, it can’t work for every workload, actually.

haproxyconf2019_a new era for web observability at ovh_steven le roux_3
On the opposite side of logs you have metrics. Metrics are simple data evolving in time, but measured data. Speaking in terms of HAProxy, what are metrics? Client session rates, rates, counters, response times, queues, etc. These kinds of metrics are exposed with the socket API or the Prometheus exporter.
haproxyconf2019_a new era for web observability at ovh_steven le roux_4
So, there are metrics, but we have seen in the previous slide that there are many metrics in true logs too. The thing is, it’s quite important, but it’s how we do it. We extract metrics from the log to store only metrics. Why? Because if you see in the slide, if you have to store a log…storing a log is actually not an issue because you have good storage strategies…but if you want to query it and use Kibana, for example, you have to index it.

Indexing logs has a cost and this cost can be mitigated if you extract the value inside the logs, aggregate it in real time, for example, and just flush the corresponding metrics. Here is the difference. Above you have the full log and if you want to build an index on it, you have to index different fields, etc; but if you just extract the value, for example HTTP statuses, and you count them, namely, for example, every 10 seconds, and you flush it so you know that you had five 200 codes, etc. etc. for the other example. But what you have to store at this layer is only the timestamp and the metric. You are sparing a lot of volume, a lot of bandwidth, a lot of things. It’s quite useful.

Let’s Observe (BPE)

Let’s observe how we did it at OVH. It’s important to say that we did it BPE. It was Before Prometheus Era. There wasn’t a Prometheus exporter and actually we wrote one. It’s not exactly a Prometheus exporter because it exports a format, an HTTP format, that is kind of like Prometheus, but we don’t use Prometheus so it’s not quite the same. There was, actually, an existing HAProxy exporter, but we didn’t use it because it was not sustaining our workload. We didn’t succeed in collecting metrics quickly enough. So, we had to rework one with performance in mind.

haproxyconf2019_a new era for web observability at ovh_steven le roux_5
When you collect metrics…Willy said in the keynote that there are many, many insightful stats, but you see that maybe you don’t get the sense behind it. Now, there is documentation for each one of those stats. It should be useful. But here with this exporter you can just choose which one of the metrics you want to get. You can just select and fine grain them.

Once we get those metrics we export them, but we have a small daemon on each load balancer that is actually collecting metrics with a DFO strategy. DFO is Disk Failover. We flush the metrics on disk because if we lose the network or the service and we cannot push metrics, we will have them stored and when the network comes back we will flush them back. When we flush it…This component is open source. It’s named Beamium on the GitHub of OVH.

haproxyconf2019_a new era for web observability at ovh_steven le roux_6
We push on a multi-stage, time-series infrastructure. Why? Because, at first, we have a live instance, which is in memory of computing and we use it for fine-grained operations. For example, scaling, monitoring, etc. etc. We use it, also, for aggregating data because when you have a lot of data separating in different clusters you want to aggregate them. It’s why, actually, we have a multi-stage for this.
haproxyconf2019_a new era for web observability at ovh_steven le roux_7
At the first stage, we use a really short retention strategy, but mostly for monitoring purposes. We first try to aggregate per frontend, backend, etc. on a customer scope. Then, we will push these aggregated metrics on the second stage, which provide global insights of the platform. But still at this stage we don’t have a long retention. We have a day’s retention because it’s still in memory. But we are global, so we have the total view for customers about how the global load balancing experience is behaving.

Then, we aggregate per customer with different metrics and we push them on the cloud infrastructure where we can have years of retention if we want.

haproxyconf2019_a new era for web observability at ovh_steven le roux_8
So, we have really different strategies about time series and it’s quite important because there is a huge retention factor in terms of unique time series. On the first layer we have dozens of millions of unique time series. The first layer is really the fine-grained, raw data. There are millions and millions of time series. Even if it’s not an issue to have them on the cloud, because we have multi-hundreds of millions of time series on the cloud, when you operate it’s better to have a better management of your time series. From, let’s say, a hundred million times series; the second stage has only ten million; and we keep only a hundred thousand at the end for the customers. You see that if we had to keep, with a long retention, a hundred million time series, the cost wouldn’t be the same.

Why to Collect Metrics

Why collecting metrics? Actually, we do a lot of things with metrics and you can predict the future. Yeah. How do we do it?

haproxyconf2019_a new era for web observability at ovh_steven le roux_9
Here is an example based on memory, but it could be HTTP requests, for example. You see there is a trend in the graph. It’s raising, raising, raising, raising.
haproxyconf2019_a new era for web observability at ovh_steven le roux_10
Oh, not good because since it’s memory, there is a limit, a hardware limit. Here we materialize it with the green line. We are going to hit the wall.
haproxyconf2019_a new era for web observability at ovh_steven le roux_11
What we can do is that we can extract the trend of the series and then we can forecast it.
haproxyconf2019_a new era for web observability at ovh_steven le roux_12
If we forecast, the crossing lines is when we hit the wall.
haproxyconf2019_a new era for web observability at ovh_steven le roux_13
If we hit the wall, we can anticipate any actions. We can say, “Oh, we have to move before the incident.” The trend is a way to forecast, but we can have all the strategies. For example, we can forecast the global signal, not the trend. We see that the crossing time is not exactly the same. Given the workload, the use case, the forecasting strategy, it can be different.
haproxyconf2019_a new era for web observability at ovh_steven le roux_14

What to do now? Well, we can alert. We can annotate in a dashboard. We can autoscale a service. For example, if it would have been HTTP requests: Okay, we are going to hit the wall. We need more instances so we can actually scale the service. This is all achieved in a single request in our time series database. It’s not analyses; It’s not big data; It’s just one query for a given load balancing experience for a customer that will say, “Okay, now you need to act.” We could act through the time series database, but we don’t do it.

There is another approach of this. Here, I said that it’s a single query inside the time-series database, but we also have a different infrastructure for AutoML, Auto Machine Learning. We train models so that we can go further than this, because here this example is really simple. It’s a global trend and linear aggregation; You forecast it. Okay, job done. But what if you have seasons in your signal? What if your signal has a weekly seasonality? Or a monthly seasonality? If I need to forecast, but I don’t have the global picture, my forecast doesn’t reflect what will come. This is where a trained model will help us to anticipate.

If you want to try this, it’s free. It’s a free service of OVH. You can try it on labs.ovh.com. It’s called prescience and there is a time series forecasting algorithm based on SARIMA. SARIMA is actually a seasonality, moving-average algorithm. But, it’s quite interesting in terms of time series forecasting.

haproxyconf2019_a new era for web observability at ovh_steven le roux_15
We can also detect anomalies. With all those algorithms, we also can, maybe, do ESG score or z-score tests to get, actually, the deviation. Once you have this, you see on the graph that the outliers will be the most uncommon values that you have. The spikes here, when I notated it, it represents that I could take actions on this. Or if you see latencies or weird events, you can have it. You get it from your time series. So, it’s really insightful and you can do a lot of things with a proper time series database.

New in this Era of Observability

All this is pretty classic, right? It’s just operating a service, collecting metrics, etc. etc. Now, what is new in terms of Era of Observability? Well, it’s SPOE, actually. I was planning to explain what was SPOE, but Pierre did it before so I will shorten, a bit, the presentation.

haproxyconf2019_a new era for web observability at ovh_steven le roux_16
We tend to all speak in terms of HAProxy like a Swiss Army knife, you know. Actually, SPOE is like if we add a bazooka as a tool inside the Swiss Army knife because this kind of thing makes HAProxy like a framework that you can extend and you can do a lot of things, actually.
haproxyconf2019_a new era for web observability at ovh_steven le roux_17

Here is a simple example. The idea was to explain how we could do OpenTracing, for example, based on SPOE. What is SPOE? Gobally, it’s based on the filter engine and you will trigger some event. For example, a trigger on-frontend-http-request and when you hit these kinds of actions you can act. Here, I could get as a trace-id and span-id from a request. Okay?

haproxyconf2019_a new era for web observability at ovh_steven le roux_18
On the response I can get the same and if I close the times, the time window between the request and the time, I can flush the span on the tracing framework so that I can I get visibility and observability and full tracing between my clients, load balancer, server, etc.

I’ve understood that HAProxy is implementing OpenTracing, right? Or something like this? There is something in this way, but you could have your own tracing solution and maybe the OpenTracing wouldn’t be compatible with your solution. This makes you able to implement your own strategy about tracing or authentication, as we saw previously.

haproxyconf2019_a new era for web observability at ovh_steven le roux_19
Apparently, the HAProxy team has a message for you, maybe.

So, it was a story quickly about observability. Thank you!

Organizations around the world use HAProxy to achieve high availability, performance and security. Looking for more stories?