Paper review: Facebook Haystack : umbrant

Paper review: Facebook Haystack

January 3, 2012

This is a review of Facebook's Haystack storage system, used to store the staggering amount of photos that are uploaded to Facebook everyday. Facebook Photos started out with an NFS appliance, but was forced to move to a custom solution for the reasons of cost, scale, and performance. Haystack is an engineering solution that applies well-known techniques from GFS and log-structured filesystems to their distributed, append-only, key-value blob situation. Metadata management is somewhat novel, as well as their CDN integration.

The paper, "Finding a needle in Haystack: Facebook's photo storage" by Beaver et al., was published at OSDI '10.

Main ideas

Facebook's design requirements break down as follows:

Efficient random disk access. Anything that hits Haystack missed in the CDN cache, and there are too many photos to fit it all in memory. Thus, there has to be at least one random disk seek; Haystack makes it just one (most of the time).
Efficient use of storage space. Using the normal "one file per image" approach killed the # of disk accesses required to serve an image, but it also required huge amounts of extraneous metadata for things like permissions and filenames that don't matter.
Append-only write semantics. Once a photo is written, it cannot be modified. Application "overwrites" are handled by deleting and adding a new photo with the same key. I assume that this ispretty rare.
Fault-tolerance. True of any distributed system.
Scalability and elasticity. Ditto.
Simplicity. Always good if you can get it.

Efficient disk access and use of storage space are handled by essentially keeping metadata in memory and collapsing many images together into a single file that is preallocated on disk, called a "physical volume". These physical volumes are on the order of a hundred gigabytes, and function basically like segments (if you think back to the memory management part of your OS class); each photo is referenced as an offset and length within a file. As long as a server can keep this offset+length metadata and the inodes of the huge files in memory, it can do almost every photo read in a single random seek (with the corner case of falling on an extent boundary requiring two). There are also checksums and flags and other metadata that are stored in the file on-disk.

Recovery of the memory metadata is done through the use of a checkpoint file, that is then updated by examining the end of each file. This works because new photos are added sequentially, like in a log-structured filesystem. Recovering the "is deleted" flag is done lazily when a read request is made, by checking the flag's state on-disk (which will be right, due to all writes being synchronous). Synchronous writes and append-only semantics make consistency a non-issue.

Haystack combines multiple physical volumes into a single "logical volume", through which is how photos are actually accessed. Mapping from logical to physical is again done via an in-memory structure. I think this is how they do geographic replication, by mirroring writes across all the physical volumes in a logical volume, which has to contain volumes in multiple locations.

Photo writes are optimized pretty heavily. Naturally, they are batched, but there is also some interplay with the Haystack cache (which acts like an internal CDN). Machines are marked as either write-enabled or read-only. Write-enabled machines get to keep their data in the Haystack cache, to reduce their read load moving the disk head excessively. The cache also does prefetching of new photos, since they're very likely to be accessed soon. Read-only machines with a lot of deleted photos can be compacted and (I assume) toggled back to write-enabled to receive more photos.

A few more misc takeaways:

XFS worked the best for making the large 100GB files used for physical volumes
RAID is also used underneath, for added availability
Many layers of cache yield diminishing returns
The photo URL on FB starts off pointing at a CDN, but gets stripped down successively as it goes further into Haystack. Ex: http://CDN/HayCache/HayMachine/VolumeAndPhoto.
Centralized master maintains the logical-to-physical mapping and load balances writes across logical volumes. Seems easy to keep in memory and replicate if necessary.

Analysis

This really was a great application to build a new system, since existing stuff wouldn't do it as well, and the requirements made it relatively easy as distributed systems go. It pulls heavily from GFS with the centralized master and many data node approach, and also uses LFS concepts of a sequentially-written log and keeping filesystem metadata in memory. The "write-enabled" vs. "read-only" business is essentially adding journalling to a distributed filesystem, which of course is just a mini-version of the ideas in LFS. Using giant 100GB files means they were able to make their own super-simple user-level filesystem without writing an actual filesystem, a move I applaud for practical reasons.

I can't say there was much "aha" content in the paper though. I normally like industrial papers because of the real-world experiences, but this was a straightforward implementation paper. Their system bears many resemblances to GFS, and being tailored for a single usecase allowed them to greatly simplify the design. I'm a bit disappointed, in that this is a clearly impressive system, but didn't relay any interesting tidbits.