What is the problem and how important is it?

Problem: As the web services grow, so arises the challenge of maintaining the latency of service to the end user. Latency usually grows when size and complexity of the system scales up or the overall use increases.

Why is it important? Modern devices require ultra-low latency response. The paper shares example of VR application of then emerging Google Glass; and today, it's EdgeAI applications, remote control robotics, Metaverse etc. – all of which requires low latency service.

The paper outlines the causes for high-latency episodes in large online services hosted on warehouse-scale computers and shares the techniques that can reduce their severity or mitigate their effect at the whole-system performance level.

Just as fault-tolerant computing aims to create a reliable whole out of less-reliable parts, large online services need to create a predictably responsive whole out of less-predictable parts; we refer to such systems as "latency tail-tolerant," or simply "tail-tolerant."

"lengthening the latency tail", meaning adding more software that may affect service latency.

What are the insights?

What causes the latency variability?

Shared resources: Local (e.g. CPU, Cache etc.) and Global (e.g. Network Switches and shared File Systems (FS)).
Daemons: Background daemons may use only limited resources on average but when scheduled can generate multi-millisecond hiccups.
Background activities (e.g. data reconstruction in distributed file systems, periodic log compactions in storage systems like BigTable, and periodic garbage collection in garbage-collected languages) can cause periodic spikes in latency.
Queueing: Multiple layers of queue-ing in intermediate servers and network switches amplify this variability.
Hardware Trends: Power limits (e.g. Throtling in CPU), Garbage Collection (the need to periodically garbage collect a large number of datablocks can increase read latency by afactor of 100!) and Energy Management (e.g. lower power mode decreases the main CLK frequency in CPUs).

Key Insights:

Even rare performance hiccups affect a significant fraction of all requests in large-scale distributed systems. Eliminating all sources of latency variability in large-scale systems is impractical, especially in shared environments.
Using an approach analogous to fault-tolerant computing, tail-tolerant software techniques form a predictable whole out of less-predictable parts.
Tail-tolerant software techniques allow designers to continue to optimise for the common case while providing resilience against uncommon cases. To come up with the solution, there is need to understand the application behaviour: how long a service is requested and what resources does it use from the system, how long it typically takes in low/high traffic etc. Categorise the problems in classes to scale efficiently.

What is the solution? (Is it feasible?)

Parallelise sub-operations across many different machines, where each sub-operation is co-located with its portion of a large dataset. Parallelisation happens by fanning out a request from a root to a large number of leaf servers and merging responses via a request-distribution tree.
The sub-operations must all complete within a strict deadline for the service to feel responsive.
Over-provisioning of resources, careful real-time engineering of software, and improved reliability can all be used at all levels and in all components to reduce the base causes of variability.
This is used in Google warehouse scale computer.

Some software techniques for interactive requests:

Ensuring interactive requests are serviced in a timely manner through many small engineering decisions:
- Differentiate the service classes and higher-level queuing: prefer scheduling requests for which a user is waiting over their non-interactive requests. Keep low-level queues short so higher-level policies take effect more quickly. Example of google using a high-level scheduler to optimise the utilisation of OS-level (low-level) disk queue, to allow priority to interactive requests.
- Reducing head-of-line blocking: It is sometimes useful for the system to break long-running requests into a sequence of smaller requests to allow interleaving of the execution of other short-running requests; for example, Google’s Web search system uses such time-slicing to prevent a small number of very computationally expensive queries from adding substantial latency to a large number of concurrent cheaper queries.
- Managing background activities and synchronised disruption: Examples are log-compaction in log-oriented storage systems and garbage-collector activity in garbage-collected languages. A combination of throttling, breaking down heavyweight operations into smaller operations, and triggering such opera-tions at times of lower overall load is often able to reduce the effect of back-ground activities on interactive request latency. Google example: Synchronise activity across many machines simultaneously; without synchronisation, a few machines are always doing some background activity, pushing out the latency tail on all requests.

Interesting note: Caching does not directly address tail latency, aside from configurations where it is guaranteed that the entire working set of an application can reside in a cache.

Two classes of techniques to build latency-tolerant systems:

Note that, we observe that the class of techniques described below are effective only when the phenomena that causes variability does not tend to simultaneously affect multiple request replicas – which is fairly common in large-scale systems.

Within-request immediate-response techniques: These operate at a time scale of tens of milliseconds, before longer-term techniques have a chance to react.
(i) Issue the same request to multiple replicas and use the results from whichever replica responds first (hedged requests). On scale, the load is moderate. This can be optimised: defer sending a secondary request until the first request has been outstanding for more than the 95th-percentile expected latency for this class of requests (5% load increase). The overhead of hedged requests can be further reduced by tagging them as lower priority than the primary requests. Take care of faster cancellation of requests when request has been serviced.
(ii) Observation: Once a request is actually scheduled and begins execution, the variability of its completion time goes down substantially.
Observation: Allowing a client to choose between two servers based on queue lengths at enqueue time exponentially improves load-balancing performance over a uniform random scheme.
Google recommends: Enqueuing copies of a request in multiple servers simultaneously and allowing the servers to communicate updates on the status of these copies to each other. Google's term for servers' cross-server status updates is tied-requests. e.g. Servers can inform each other of the cancellation request rather than client.
Observation: It is useful therefore for the client to introduce a small delay of two times the average network message delay (1ms or less in modern data-center networks) between sending the first request and sending the second request – to help counter the issue when e.g. two servers start at the same time to service the request – why waste resources?
(iii) An alternative to the tied-request and hedged-request schemes is to probe remote queues first, then submit the request to the least-loaded server. It can be beneficial but is less effective than submitting work to two queues simultaneously. Why? Load levels can change between probe and request time; request service times can be difficult to estimate due to underlying system and hardware variability; and clients can create temporary hot spots by all clients picking the same (least-loaded) server at the same time.
(iv) Request is sent to one server and forwarded to replicas only if the initial server does not have it in its cache and uses cross-server cancellations.
Cross-request long-term adaptations: These perform on a time scale of tens of seconds to minutes and are meant to mask the effect of longer-term phenomena.
(i) Micro-partitioning of data: A static assignment of a single partition to each machine is not perfect approach – machine are not uniform in their performance. So, generate many more partitions than there are machines in the service, and then do dynamic assignment and load-balancing of these partitions to particular machines. Failure-recovery speed is also improved through micro-partitioning, since many machines pick up one unit of work when a machine failure occurs. Similar to virtual-processor-partitioning technique.
(ii) Selective Replication: Detect or even predict certain items that are likely to cause load imbalance and create additional replicas of these items. E.g. Google Search makes additional copies of popular and important documents in multiple micro-partitions.
(iii) Latency-induced probation: Detect situations where the system performs better by excluding a particularly slow machine, or putting it on probation. The source of the slowness is frequently temporary phenomena like interference from unrelated network-ing traffic or a spike in CPU activity for another job on the machine, and the slowness tends to be noticed when the system is under greater load. Continue to issue shadow requests to these excluded servers, collecting statistics on their latency so they can be reincorporated into the service when the problem abates.

Good-enough schemes: Since waiting for exceedingly slow servers might stretch service latency to unacceptable levels, Google’s Information Retrieval (IR) systems are tuned to occasionally respond with good-enough results when an acceptable fraction of the overall corpus has been searched, while being careful to ensure good-enough results remain rare. Skip non-essentials – like spell checks and ads.

Canary Requests: Untested algorithmic code paths can cause large scale software crashes. To prevent such correlated crash scenarios, some of Google’s IR systems employ a technique called “canary requests”; rather than initially send a request to thousands of leaf servers, a root server sends it first to one or two leaf servers.The remaining servers are only queried if the root gets a successful response from the canary in a reasonable period of time. These can help ensure robustness e.g. in case of malicious code from client or DoS attacks. Despite the slight increase in latency caused by canary requests, such requests tend to be used for every request in all of Google’s large fan-out search systems due to the additional safety they provide.

What is the takeaway message?

For large-scale information retrieval systems, returning good results quickly is better than returning the best results slowly – metric of measurement.
Complex warehouse-scale systems' underlying hardware components display greater performance variability when under load.
While some of the most powerful tail-tolerant techniques require additional resources, their overhead can be rather modest, often relying on existing capacity already provisioned for fault-tolerance while yielding substantial latency improvements. Many of the techniques can be encapsulated within baseline libraries and systems, and the latency improvements often enable radically simpler application-level designs.
A simple way to curb latency variability is to issue the same request to multiple replicas and use the results from whichever replica responds first.

Will this paper win the test of time award?

Yes. The paper introducsd a durable conceptual lens: the tail at scale. That phrase became standard vocabulary, it appears.
Also the techniques have aged well. Even where implementations differ today, the architectural instinct is the same: replicate, prioritise, cancel quickly, rebalance dynamically, and avoid waiting for pathological stragglers.
Also, it clearly passes the "test of time" as it appears to have already received the 2025 SIGOPS Hall of Fame Award.

How could this paper have been improved?

Benchmarks shared are narrow. It gives examples like the BigTable benchmark where hedging after 10ms dramatically improves 99.9th-percentile latency, and the tied-request results in Table 2. But it does not include methodology across diverse workloads, load levels, failure modes or deployment settings.
It does not talk about extra bandwidth/CPU/storage cost associated with software techniques. A brief reference is made for one of the techniques e.g. 5% increase.
Replication-based techniques work best when the cause of variability does not simultaneously affect multiple replicas. This caveat is huge. In real systems, many bad events are correlated e.g. rack-level network congestion, software rollout bugs, JVM/GC behavior across many similar instances, noisy neighbors in shared infrastructure, regional control-plane failures etc. A discussion on this can be built on top.
Replication is a bit over-emphasized. Even in 2013, statistical methods were in use for load balancing – ML-based load prediction? why not that?

Some discussion points

The paper argues that techniques have modest overhead and can reuse resources already provisioned for fault tolerance. That is plausible and often true, but: When is extra replication worth it? When does hedging become wasteful?
When is "good enough" acceptable? What if low-latency demand is from a product of medical systems, trading, robotics control, industrial automation, or safety-critical edge AI? Are these methods still applicable?

Also see:

P. Ranganathan and U. Holzle, "Twenty Five Years of Warehouse-Scale Computing" in IEEE Micro, vol. 44, no. 05, pp. 11-22, Sept.-Oct. 2024, doi: 10.1109/MM.2024.3409469. https://doi.ieeecomputersociety.org/10.1109/MM.2024.3409469