Paper review: The Google File System : umbrant

Paper review: The Google File System

September 24, 2011

This is a paper review for "The Google File System" by Ghemawat et al., published at SOSP in 2003. This is a fairly important paper, and directly inspired the architecture of HDFS.

Main ideas

What Google did was look very carefully at their desired workload, and build a distributed filesystem specifically for that. GFS is very much not a general purpose filesystem, and I really like how they lay out quite clearly early on the assumptions they make:

Files are almost all large, many GBs
Target throughput, not latency
Append-only. Cannot overwrite existing data.
Must be distributed and fault-tolerant

The problem they were basically trying to solve was doing log analytics at scale, meaning mostly long sequential writes of very large files. Spreading files across multiple disks is crucial to getting enough throughput and getting fault-tolerance. GFS can be viewed sort of like a distributed version of RAID 1.

The architecture is a single GFS master which stores metadata for all the files, and a lot of chunkservers that store chunks (64MB) of files. The master is used to chunk locations for a given range of a file, the actual reads and writes are done by directly accessing the appropriate chunkserver. All chunks are replicated across multiple chunkservers for durability and load balancing. Chunkservers talk to the master via heartbeat messages, upon which the master can piggyback commands like re-replicating or getting chunk lists.

Data consistency is made a lot easier by not having to worry about overwrites. It also means clients can cache chunk locations, since they change rarely.

Future relevance

GFS clearly does a good job at the application it was designed for: sequential reads for large files by data-parallel workloads. However, since HDFS has become sort of an industry standard for storing large amounts of data, it's increasingly being used for other types of workloads. HBase is one example of this (a more database-like column store), which definitely does a lot more random I/Os. Facebook also published a paper on doing real-time queries with MapReduce (and thus HDFS). The question is how well HDFS can be squeezed into these roles, and if other storage systems are necessary. For low-latency web serving this is definitely true (memcached and other k-v stores dominate).

In short, I don't think the MapReduce paradigm is going anywhere, and HDFS already feels like the standard answer to storing big data. I don't think it's going anywhere in the next decade.