A New Era for Web Observability at OVH
In this presentation, Steven Le Roux of OVH describes how the HAProxy Stream Processing Offload Engine (SPOE) lets you build your own sophisticated solutions, such as custom tracing frameworks similar to OpenTracing. He describes the meaning and value of logs and metrics. He then explains how they can be captured as time series data. OVH has implemented a multi-stage time series database that aggregates data for various levels of retention targeting different use cases.
Hello everyone. We’re going to speak about observability. Who is okay with the term observability? Not that much? Okay. So, what’s observability?
Observability with Logs and Metrics
The two pillars of observability are metrics and logs, but we will see that actually both are time series with different indexing strategies. The time-series database is really important at this stage. Let’s see the difference. HAProxy is really useful in terms of logging.
But indexing logs can be quite costly. I’m not saying for the previous speakers that Elasticsearch is not a good solution, but given some workloads, it can’t work for every workload, actually.
Indexing logs has a cost and this cost can be mitigated if you extract the value inside the logs, aggregate it in real time, for example, and just flush the corresponding metrics. Here is the difference. Above you have the full log and if you want to build an index on it, you have to index different fields, etc; but if you just extract the value, for example HTTP statuses, and you count them, namely, for example, every 10 seconds, and you flush it so you know that you had five 200 codes, etc. etc. for the other example. But what you have to store at this layer is only the timestamp and the metric. You are sparing a lot of volume, a lot of bandwidth, a lot of things. It’s quite useful.
Let’s Observe (BPE)
Let’s observe how we did it at OVH. It’s important to say that we did it BPE. It was Before Prometheus Era. There wasn’t a Prometheus exporter and actually we wrote one. It’s not exactly a Prometheus exporter because it exports a format, an HTTP format, that is kind of like Prometheus, but we don’t use Prometheus so it’s not quite the same. There was, actually, an existing HAProxy exporter, but we didn’t use it because it was not sustaining our workload. We didn’t succeed in collecting metrics quickly enough. So, we had to rework one with performance in mind.
Once we get those metrics we export them, but we have a small daemon on each load balancer that is actually collecting metrics with a DFO strategy. DFO is Disk Failover. We flush the metrics on disk because if we lose the network or the service and we cannot push metrics, we will have them stored and when the network comes back we will flush them back. When we flush it…This component is open source. It’s named Beamium on the GitHub of OVH.
Then, we aggregate per customer with different metrics and we push them on the cloud infrastructure where we can have years of retention if we want.
Why to Collect Metrics
Why collecting metrics? Actually, we do a lot of things with metrics and you can predict the future. Yeah. How do we do it?
What to do now? Well, we can alert. We can annotate in a dashboard. We can autoscale a service. For example, if it would have been HTTP requests: Okay, we are going to hit the wall. We need more instances so we can actually scale the service. This is all achieved in a single request in our time series database. It’s not analyses; It’s not big data; It’s just one query for a given load balancing experience for a customer that will say, “Okay, now you need to act.” We could act through the time series database, but we don’t do it.
There is another approach of this. Here, I said that it’s a single query inside the time-series database, but we also have a different infrastructure for AutoML, Auto Machine Learning. We train models so that we can go further than this, because here this example is really simple. It’s a global trend and linear aggregation; You forecast it. Okay, job done. But what if you have seasons in your signal? What if your signal has a weekly seasonality? Or a monthly seasonality? If I need to forecast, but I don’t have the global picture, my forecast doesn’t reflect what will come. This is where a trained model will help us to anticipate.
If you want to try this, it’s free. It’s a free service of OVH. You can try it on labs.ovh.com. It’s called prescience and there is a time series forecasting algorithm based on SARIMA. SARIMA is actually a seasonality, moving-average algorithm. But, it’s quite interesting in terms of time series forecasting.
New in this Era of Observability
All this is pretty classic, right? It’s just operating a service, collecting metrics, etc. etc. Now, what is new in terms of Era of Observability? Well, it’s SPOE, actually. I was planning to explain what was SPOE, but Pierre did it before so I will shorten, a bit, the presentation.
Here is a simple example. The idea was to explain how we could do OpenTracing, for example, based on SPOE. What is SPOE? Gobally, it’s based on the filter engine and you will trigger some event. For example, a trigger on-frontend-http-request and when you hit these kinds of actions you can act. Here, I could get as a trace-id and span-id from a request. Okay?
I’ve understood that HAProxy is implementing OpenTracing, right? Or something like this? There is something in this way, but you could have your own tracing solution and maybe the OpenTracing wouldn’t be compatible with your solution. This makes you able to implement your own strategy about tracing or authentication, as we saw previously.
So, it was a story quickly about observability. Thank you!