Paper review: Warehouse-Scale Computing: Entering the Teenage Decade

September 6, 2011

This review is based on a presentation by Luiz Andre Barroso from Google, titled "Warehouse-Scale Computing: Entering the Teenage Decade". I believe it was given this past year (2011) at FCRC. I really strongly recommend it, since it talks about the operational issues in running a Google datacenter, and also identifies a lot of the research issues surrounding "warehouse-scale computation".

Key Points

There were a couple problems identified in the talk, some of which have been solved, some of which have not been.

  • I/O latency variability right now is terrible, with basically all durable storage displaying a long latency tail. Random accesses to spinning disks are slow, flash writes are slow, and these high-latency events muck up the latency for potentially fast events.
  • Network I/O suffers a similar problem. Using TCP and interrupts adds orders of magnitude of latency to network requests, making fast network hardware slow again in software.
  • Datacenter power efficiency as defined by PUE (Power Usage Effectiveness) has gotten pretty good (<10 percent is used on operational overhead). The real problem now is making better use of servers, to get CPU load up into the 80% range instead of the current 30%.
  • This leads into the another problem: how do you share all of the resources in a cluster among many different services, while also chopping off the latency tail?

To summarize, there are two big ideas in the talk. First, latency and variation in latency are the key performance metrics for services these days; today's web-based applications demand both to provide a good user experience. This may require reexamination of a lot of fundamental assumptions about IO. Second, increasing the utilization of resources in a cluster is important from an efficiency and performance standpoint. Server hardware should be a fungible resource that can be easily shared among different services.

The differences here between warehouse-scale computing and datacenter-scale computing lie in scale and the type of user experience provided. Warehouse-scale computing operates in the many petabyte range, and allows for complete system integration of the hardware, software, power, and cooling of the cluster. The types of services hosted by a warehouse-scale computer are also scaled way up, in terms of the latency requirements and the size of the data that is being crunched.

Trade offs

One of the key tradeoffs mentioned was between latency and throughput. Most of the software stack for I/O these days is done to optimize throughput, done in response to relatively slow disks or networks. An example of this would be Nagle's algorithm in TCP; small packets are delayed and batched to be sent in bulk, reducing TCP overhead (fewer bytes need to be sent) but also increasing latency. New technologies like flash and fast networks mean that these assumptions should be reexamined.

Long-term impact

Web-based services are here to stay, and I feel confident in saying that this is an area that is going to see yet more growth. Large internet companies like Google and Facebook are already dealing with these issues internally, and there are only going to be more warehouse-scale datacenters built. It's clear that these are hard problems that aren't going away because of some deus ex machina like Moore's Law, so any solutions are likely to have a big impact.

blog comments powered by Disqus