Rate limiting controls how many requests a client can make to your applications or APIs within a given time window. It protects infrastructure from abuse and enforces fair usage across every client, all while guarding against resource exhaustion. But not all rate limiting is built the same. The difference between a basic per-IP counter and an enterprise-grade solution can determine whether a platform absorbs a traffic spike gracefully or buckles under it.
1. Understand why rate limiting has become a platform decision
Rate limiting used to be a feature you configured in application code or bolted onto an API gateway, and that approach still works for simple use cases. Modern infrastructure has outgrown it, though.
Applications now span multiple clouds and run across distributed load balancer clusters, serving traffic through Kubernetes pods that scale up and down constantly. A rate limiter that only sees traffic on a single node will miss coordinated attacks that spread requests across the fleet. Embedding one in application code adds latency to every request, even legitimate ones, and requiring a separate Redis cluster introduces a new dependency that can become a single point of failure.
IBM's Cost of a Data Breach Report 2025 found the global average breach cost fell to $4.44 million, down 9% from the year before, thanks largely to faster detection and containment. Rate limiting sits at the front line of preventing application-layer DDoS, brute force attacks, credential stuffing, and API abuse, the threats that drive those numbers in the first place.
The right rate limiting solution operates at the infrastructure layer and sees traffic globally, all while integrating with the broader security posture.
2. Algorithms and where enforcement happens
How you measure and enforce traffic matters as much as the threshold you set. HAProxy supports several approaches natively, and the right one depends on what you're protecting against.
Approach | How HAProxy implements it | Best for |
Sliding window | Stick table counters like http_req_rate(10s) recalculate the rate on every request, always looking back over the trailing time window | Most rate limiting use cases; avoids the boundary-spike problem of fixed windows |
Fixed window | Counters like http_req_cnt accumulate until you clear them on a schedule, typically with a Runtime API call in a cron job | Calendar-based quotas, like a hard daily API cap that resets at midnight |
Connection queueing | The maxconn, minconn, and fullconn settings queue excess connections instead of rejecting them, releasing them as backend capacity frees up | Absorbing short traffic bursts without dropping legitimate requests |
Tarpit | http-request tarpit holds a connection open and stalls the response for a set delay before returning an error | Slowing down bots and brute-force attempts so they can't retry immediately |
CAPTCHA / JS challenge (Enterprise) | Instead of denying outright, HAProxy Enterprise's CAPTCHA module presents a challenge the client must solve before continuing | Cutting false positives when a legitimate user gets caught by a rate limit meant for bots |
HAProxy doesn't implement classic token bucket or leaky bucket rate limiting the way some other proxies do. Instead, it builds sliding and fixed window counting directly into stick tables, then pairs that counting with queueing or tarpitting for enforcement.
The approach matters, but so does where enforcement happens. Rate limiting at the edge (your load balancer or reverse proxy) is faster and more efficient than rate limiting inside application code. It intercepts abusive traffic before it consumes compute resources on your backend servers.
HAProxy's stick tables provide in-memory, line-speed rate limiting directly in the data plane. Stick tables are key-value stores built into the HAProxy process itself. They track metrics like HTTP request rates, connection rates, and error rates per client IP, with no external dependencies. You combine them with access control lists (ACLs) and flexible routing rules to compose rate limiting logic that matches your exact use case: per-IP, per-URL path, per-API key, or any combination of HTTP attributes.
3. Global enforcement across distributed infrastructure
Single-node rate limiting breaks down when infrastructure is distributed. If a client sends 500 requests per second and you have five load balancer nodes, each node sees only 100 requests per second. That falls below your threshold on each individual node, even though the aggregate is well above it.
This is the most common failure mode for rate limiting at scale, and it's the reason many organizations discover their rate limits don't actually work during a real attack. The fix requires aggregating rate data across the entire fleet in real time.
HAProxy Enterprise solves this with the Global Profiling Engine. It collects stick table data from all HAProxy Enterprise load balancer nodes in a cluster, aggregates it, and pushes the results back to each node. If LoadBalancer1 receives 200 requests from a client and LoadBalancer2 receives 300 from the same client, the Global Profiling Engine sums them to 500. That aggregate count becomes available to both nodes for enforcement, and it happens in real time, using HAProxy's native peers protocol, with no external database required.
The Global Profiling Engine also stores historical aggregation data. You can compare a client's current request rate against their historical baseline and set dynamic thresholds that adapt to normal traffic patterns. A legitimate client whose traffic naturally spikes during business hours won't be penalized, while an attacker generating the same volume at 3 AM gets flagged.
Key takeaway: if your rate limiting only works per-node, it doesn't work at all in a distributed environment. Global aggregation is what separates a demo-ready rate limiter from a production-ready one. |
4. Rate limiting for APIs and AI services
APIs and AI inference endpoints face distinct rate limiting challenges. A simple request counter treats every API call equally, but not all requests carry the same cost. A lightweight GET request that returns a cached response is fundamentally different from a POST that triggers an expensive database write or an LLM inference call that consumes GPU seconds.
Effective API rate limiting needs granularity:
Login pages and payment APIs need stricter per-endpoint limits than public read endpoints.
Limits should apply per API key or token, not just per IP, since many clients share addresses behind NAT or proxies.
Error-rate tracking matters too: a client generating a high rate of 4xx or 5xx responses is likely probing for vulnerabilities.
HAProxy's rate limiting integrates directly with its API gateway and AI gateway capabilities. You can define rate limits per URL path, per HTTP method, per header value, or per any other request attribute. Stick tables track more than request counts. Error rates, connection rates, and bytes transferred are all fair game too, which means you can build policies like allowing 100 requests per minute to /api/v2/search but only 10 per minute to /api/v2/admin. You can also block any client generating more than 5 errors per second.
For AI services, where a single inference request can cost 100x more than a standard API call, this kind of granularity prevents a single client from monopolizing expensive backend resources.
5. How rate limiting fits into your security stack
Rate limiting is most effective when it works alongside other security layers rather than operating in isolation. A client exceeding a rate limit might be a legitimate user during a traffic spike, or it might be a bot performing credential stuffing. The response should differ.
Look for a rate limiting solution that can coordinate with:
Bot management to classify whether rate-limited traffic is human or automated
WAF to correlate rate spikes with known attack patterns
DDoS protection to distinguish application-layer abuse from volumetric attacks
Observability tools to provide real-time visibility into rate limiting decisions
When these layers share data, you can build sophisticated enforcement strategies. For example: if a client exceeds a rate limit and the bot management module classifies them as automated, present a CAPTCHA. If they exceed the limit and the WAF detects SQL injection attempts in the same session, block them entirely.
HAProxy Enterprise's multi-layered security architecture does exactly this. Rate limiting through the Global Profiling Engine works alongside the HAProxy Enterprise WAF, the Bot Management Module, and DDoS protection. All of these layers run within the HAProxy Enterprise process, share client context, and can be orchestrated together through HAProxy Fusion Control Plane using the visual Threat-Response Matrix.
6. Centralized management at scale
Managing rate limits on a single server is simple. Managing them across dozens or hundreds of load balancer instances serving different applications and teams in different environments is where operational overhead grows.
Without centralized management, each team configures rate limits independently and policies drift. Thresholds become inconsistent, too. Troubleshooting a false positive requires logging into individual nodes to inspect stick table data, and rolling out a new policy during an active attack means touching every node manually.
HAProxy Fusion's management GUI provides a single interface to manage rate limiting policies across your entire HAProxy Enterprise fleet. Teams can define and push rate limiting configurations from one GUI or API, view aggregated security events in real-time dashboards, and integrate policy changes into their CI/CD pipelines. The observability suite compiles over 150 metrics, so you can track rate limiting decisions, blocked request volumes, and threshold effectiveness across all clusters from a single pane.
7. Performance under load
Rate limiting adds processing to every request. If your rate limiter itself becomes a bottleneck, the protection is self-defeating. Evaluate how the solution handles rate limiting under load. Does enforcement add measurable latency per request, or does it require external network calls to Redis, a cloud API, or a separate rate limiting service? And how does memory consumption scale with the number of tracked clients?
Solutions that depend on external state stores introduce network round trips on every request, while those that run in-process with the load balancer avoid that overhead entirely.
HAProxy stick tables run in-memory within the HAProxy process. There are no external calls and no network hops, so rate limit checks add no latency. The data structure is an elastic binary tree, making lookups, inserts, and evictions fast at any scale. Memory consumption is predictable and scales linearly with the number of tracked entries and counters. HAProxy's architecture is designed to hold millions of concurrent connections efficiently, so rate limiting metadata doesn't compete for resources.
At enterprise scale, this matters. Roblox uses hundreds of HAProxy instances to manage millions of requests per second. Booking.com runs HAProxy in a load balancer as a service platform. At that scale, every microsecond of overhead per request counts.
8. Flexibility beyond simple request counting
The best rate limiting solutions go beyond counting requests per IP per second. Real-world scenarios demand flexibility:
Connection rate limiting to cap how quickly new TCP connections are established from a single source
Error rate tracking to detect and block clients generating abnormally high error volumes (a strong signal for vulnerability scanning or brute force)
Byte rate limiting to throttle clients transferring excessive data volumes
Composite conditions where multiple signals (request rate AND error rate AND geographic origin) combine to trigger enforcement
HAProxy stick tables support all of these. You can store and track request rates, connection rates, error rates, byte counters, and general-purpose counters (gpc0, gpc1) simultaneously. Combined with HAProxy's ACL system, you compose rules using full logical operators against any TCP/IP information, HTTP attribute, or dynamic traffic condition. The response options are equally flexible: deny with a 429, tarpit the connection, redirect to a challenge page, route to a different backend, or present a CAPTCHA.
This composability is what separates a rate limiting feature from a rate limiting platform. Rather than selecting from a fixed menu of options, you build exactly the logic your applications need.
9. Total cost and operational simplicity
Rate limiting solutions carry costs beyond the license price. Factor in the infrastructure cost of external state stores, like Redis clusters or dedicated rate limiting services, plus the operational cost of managing those dependencies, including failover, scaling, and monitoring. There's also the latency cost of external network calls on every request, and the complexity cost of integrating a standalone rate limiter with the load balancer, WAF, and observability stack.
A rate limiting solution built into the application delivery platform eliminates most of these costs. There's no separate infrastructure to run and no external dependencies to monitor. Integration glue disappears too, since rate limiting runs in the same process as load balancing, SSL termination, and security enforcement.
HAProxy Enterprise's Global Rate Limiting is included with your HAProxy Enterprise instance. Stick tables are built in. The Global Profiling Engine is an enterprise module that runs alongside your load balancers. There's no additional per-request billing and no separate SaaS subscription. You don't need to manage an external state store either.
Conclusion
The best rate limiting solutions enforce limits globally and run at line speed, all while integrating with the security stack. We built Global Rate Limiting in HAProxy Enterprise to do all of that without external dependencies. See how it fits into your DDoS protection and rate limiting strategy or request a demo.
Subscribe to our blog. Get the latest release updates, tutorials, and deep-dives from HAProxy experts.