Windows Azure Storage : umbrant

Windows Azure Storage

February 4, 2016

What makes this paper special is that it is one of the only published papers about a production cloud blobstore. The 800-pound gorilla in this space is Amazon S3, but I find Windows Azure Storage (WAS) the more interesting system since it provides strong consistency, additional features like append, and serves as the backend for not just WAS Blobs, but also WAS Tables (structured data access) and WAS Queues (message delivery). It also occupies a different design point than hash-partitioned blobstores like Swift and Rados.

This paper, "Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency" by Calder et al., was published at SOSP '11.

Background

Most people are familiar with filesystems. Filesystems have a hierarchical namespace composed of a nested directory tree, where directories contain files. Directories and files can have various bits of metadata attached to them like permissions, modtime, ACLs, xattrs, etc. Directories are used to logically group files together, and there are commands that work on entire recursive directory trees (mv, rm -r, chown -r, etc).

Blobstores are like filesystems, except simpler. A unifying characteristic of blobstores is that they do not provide a hierarchical namespace. Instead, you get multiple flat namespaces (which S3 calls buckets and WASB partitions), in which you can store blobs. Blobstores also provide fewer features than filesystems. It's typical to not support an operation like rename, setting per-blob permissions, and also have preconditions around IO (e.g. S3 requires a full-blob checksum at upload time, and no random writes or appends) which push complexity to the application-level.

You might read this and think that blobstores sound terrible, but there's a very good reason for throwing away these features: horizontal scalability. It's difficult to shard a hierarchical namespace, and it's even harder to support operations like directory rename, so blobstores punt on these problems. As a result, you have a system that architecturally has infinite scale.

Overview

In the datacenter, WAS is composed of stamps, which are sets of 10-20 racks of servers. This is what others might call a cell or pod, it's used as a unit of deployment and management. There are many stamps per datacenter, and many datacenters which are geographically distributed for fault-tolerance.

Users have accounts and all of the data in an account is stored on a single stamp. Accounts are another unit of management, and are migrated between stamps based on load.

WAS is a very layered system, so let's take it from the bottom up, starting with how it works within a single stamp, then talking about how multiple stamps are glued together into a global namespace.

Stream Layer

The bottom-most layer in the stack is the stream layer. The stream layer exposes a flat namespace of append-only logs called streams. Streams are composed of extents, which are a unit of replication and are about 1GB and stored on as files on a local filesystem. Only the last extent in the stream can be appended to. Extents in turn are composed of blocks, which variable-length up to 4MB in size, and are the unit of a client read or write. Blocks are also the unit of checksumming, so the entire block is read at read time to verify the checksum.

Architecturally, this looks a lot like HDFS. There is a Paxos-replicated stream master which maintains a mapping of streams-to-extents and extents-to-nodes. It chooses which nodes to use for incoming writes, routes reads to the correct and re-replicates extents when nodes fail. The stream master needs to keep these mappings in memory, and it's designed for approximately 100k streams and 50 million total extents.

Notably, the stream master does not track the extent-to-block mapping, which would not fit on a single machine. Instead, this is handled by the extent nodes, which maintain an index of the block offsets alongside the extent file.

The stream layer uses chain replication when writing an extent, the same method as HDFS but with some differences I'll be highlighting. Chain replication is nice since it's less complicated than Paxos replication and you can get better throughput by pipelining. The master is off the hot-path during data writes; the writer goes directly to the three chosen extent nodes. One of these nodes is the primary and pushes whole-block writes down the pipeline and acks the client when complete. These block writes are atomic (and there is even an atomic multi-block append). Combined, this has the nice property of allowing concurrent writers to an extent, since the primary orders the incoming block appends and serves all reads to the extent while it's being written to. HDFS does not have atomic appends, but does allow applications to block for data to be durable (hflush/hsync) which provides similar properties if you use a length-prefixed record format with a checksum, and roll a new file on failure.

Talking more about failures, in WAS, failures during a write are handled by sealing the extent and starting a new one. Sealing the extent requires agreeing on the length of the extent, which is coordinated with the stream master. The SM asks the remaining nodes for the length of the extent, and uses the smallest length. This is safe since only writes are only ack'd to the client after they are fully-replicated. Longer lengths are also okay, since stream clients are required to handle duplicate blocks in a stream. Once sealed, this is the final length of the extent. If a version of the extent with a different length appears, it is safely discarded.

HDFS does something more complicated called pipeline recovery to try and keep writing rather than rolling to a new HDFS block. This is because we want to produce fewer, larger blocks to reduce NameNode memory usage, and for a long time HDFS did not support variable-length blocks (for no really good reason). HDFS pipeline recovery has been described as "an informally-specified implementation of two-phase commit", and you can read all about it (and other recovery processes) in an excellent series of blog posts written by my colleague Yongjun Zhang.

The stream layer also implements background erasure coding of sealed extents, as well as latency-levelling by a similar mechanism to Jeff Dean's The Tail at Scale work. They also do some ops tricks to further improve latency, like allocating a separate SSD to buffer writes and doing deadline IO scheduling.

Partition Layer

Now, we move onto the co-designed user of the stream layer: the partition layer, which maintains the user-visible constructs like blobs, tables, and queues.

The partition layer is a range-partitioned distributed database. These ranges can be split and merged and moved around based on load.

There is a table for each of the three user-visible constructs, a table that describes the schema of these three tables, and finally a table of the mapping of ranges to servers (like the meta table in HBase). The primary key for the three object tables is a compound key of (account, partition, object name) (user-level identifiers), and other columns describe what stream, extent, offset, and length have the corresponding data for that object. Since it's a range-partitioned distributed database that uses LSM-trees under the hood, I'm going to point to Kudu and HBase as similar systems if you want to learn more. The split and merge process looks the same as HBase, and implementation details like the memstore, bloom filters, and row caching are present in all of these systems.

Each range uses a couple streams to maintain its state. The two important ones are a commit stream (a WAL) and a row data stream (checkpoints of WAL mutations, HFiles in HBase parlance) which are used to maintain the LSM tree. They also implement a BLOB type which writes blob data into a side stream to avoid the write amplification of LSM trees, instead using pointers and efficient stream concat operations to avoid rewriting data.

One interesting point is that on a per stamp basis, they see 75 splits and merges and 200 partition moves every day. That's a lot more than I would have guessed for HBase, but since the partition layer worry about storage locality, moving a partition is cheap. An efficient stream concat operation means you can avoid rewriting data when doing merges.

Front-end service

The front-end service is a proxy for user requests which interacts with WAS on behalf. Front-ends are stateless, perform authentication/authorization, and caches the partition map for faster lookups. This is pretty standard in web services.

Location service

Gluing together stamps is the location service, which sits above the partition layer and maintains a global namespace of accounts-to-stamps. We now have the full picture of how to find an object given its (account, partition, name) tuple. The location service tells us which stamp has the account, the partition layer at that stamp translates partition and name to (stream, extent, offset, length), and then the stream layer translates the (extent, offset, length) into a block on disk (the read unit).

The location service is responsible for moving accounts between stamps for load balancing, and also for inter-stamp replication for disaster fault-tolerance. An account has a primary stamp and some number of secondaries. The location service links them together, and the partition layer in the primary stamp asynchronously replicates to the secondaries. This async process is pretty fast, they say on average it takes 30s to replicate changes.

In the case of a disaster, the location service flips DNS and VIP over to point to a secondary cluster. This does mean that something like 30s of data will be lost, since replication is asynchronous. These events do happen too. It's not just meteor strikes, I've heard some funny stories about hunters shooting down power lines or utility crews cutting network cables.

Questions and comments

How many hops / IOs?

It's interesting to try and count how many network hops and IOs are required to do operations in this system. Let's look at a typical read request.

Client has the DNS cached so goes to a front-end at the primary stamp. 1 hop, 0 IOs.
Front-end has the partition map cached, so goes to the partition server for that range. 1 hop, 0 IOs.
Partition server looks in the row data stream for the extent/offset/length of the blob. This is an LSM tree, so requires 0-N lookups to the stream layer (let's assume 1), and the blob layer is 1 lookup. For each stream lookup:
1. Go to the stream master to figure out which extent nodes have the extent. 1 hop, 0 IOs.
2. Go to the extent node to read the data. This requires looking in the index and then the actual data. 1 hop, ~3 IOs. This depends on caching of file handles and the filesystem, so an estimate.

My math says this is a total of 6 hops and 6 IOs to do a read. How about writing a small blob?

Client has the DNS cached so goes to a front-end at the primary stamp. 1 hop, 0 IOs.
Front-end has the partition map cached, so goes to the partition server for that range. 1 hop, 0 IOs.
Partition server writes to the WAL stream and the side blob stream. For each write:
1. Go to the stream master to find the primary extent node for the extent. 1 hop, 0 IOs.
2. Go to the primary extent node, write the block. This gets pushed to the two other nodes in the pipeline. Appending to an open file is ~2 IOs (one for the data, one to update the file length in the inode). I'll assume the index is in memory while the extent is unsealed, and checksums are stored inline, else this blows up. 3 hops, 6 IOs.
At this point the write can be ACK'd.

Add it up, this is ten hops and 12 IOs. This is multiplied by the number of stamp replicas, and again by some small factor for LSM rewrites, and by 2 since all writes are journaled.

Account limits

S3 lets you can store an unlimited number of items in a bucket with no performance implications. This is not the case in WAS, which limits accounts to 500TB usage, 20K IOPS, and 10-30Gbps network throughput. Remember also that an account can have multiple partitions! This is because accounts are assigned to a stamp, so it's limited by the capacity of that one stamp.

Forcing users to partition manually is distinctly worse than the unrestricted buckets provided by S3. Their reasoning was that pinning to a single stamp gives better performance isolation and lets users choose in what geographic region their data is stored. I don't understand why you couldn't do the same thing with range-partitioned buckets. With system-managed partitioning, you could get even better stamp utilization by splitting/merging an account across stamps.

I'm told that Azure Data Lake is the next thing coming down the pipe, is built on top of WAS, and handles sharding across multiple accounts to get around these limits. At this point I begin to wonder; WAS is already a very layered system, and indirections do come at a cost.

Centralized control

They chose not to use hash-partitioning since they wanted ranged listing and more control over placement for isolation. This is a criticism of hash-partitioning I share. It's really convenient to split/move partitions around based on load. Range partitioning has the issue of hotspotting writes to sequential keys, but there are tricks like reversing or hashing the key to work around this.

Conclusion

There are lots of things to like about WAS. They built a number of useful user abstractions on top of a single storage system which provides strong consistency, a fuller feature set than S3, and geo-replication. WAS also has nice provisions to improve latency, increase utilization, and make operations easier.

I view this as a practical way of stitching together multiple clusters. This is something we support in HDFS via federation, and MapR does something similar with its concept of volumes. It's also probably true that most users can live just fine with 500TB of storage. I just find it somewhat dissatisfying that WAS rids itself of a hierarchical namespace, but doesn't exploit that fact to the fullest extent to expose a truly scale-out system.