Static website hosting on Amazon S3 : umbrant

Static website hosting on Amazon S3

April 1, 2011

Werner Vogels, Amazon CTO, posted on his blog about a month ago on "New AWS feature: Run your website from Amazon S3". S3 now offers the ability to host static HTML pages directly from an S3 bucket, which is a great alternative for small blogs and sites (provided, of course, that you don't actually need any dynamic content). This has the potential to greatly reduce your hosting costs. A small Dreamhost/Slicehost/Linode costs around $20 a month, and I used to run this site out of an extreme budget VPS (Virpus) which was only $5 a month, but I expect to be paying only a few cents per month for S3 (current pricing is just 15¢ per GB-month). Of course, you also gain best-of-class durability, fault-tolerance, and scalability from hosting out of S3, meaning that your little site should easily survive a slashdotting.

The difficulty here is that most of the popular blogging engines require a backing database, and do their content generation dynamically server side. That doesn't fly with S3; since it is, after all, just a Simple Storage Service, content has to be static and pregenerated. I chose to use Hyde, a Python content generator that turns templates (based on the Django templating engine) into HTML. Hyde page templates are dynamic, written in Django's templating language which supports variables, control flow, and hierarchal inheritance. Hyde will parse these templates, fill in the dynamic content, and finally generate static HTML pages suitable for uploading to S3. Ruby folks can check out Jekyll as an alternative.

Caveats

To be clear, purely static content won't suffice for many sites out there (like anything with user-generated content). Even a simple blog like is only feasible because there are web services that fill in the gaps in functionality. Disqus seems to have cornered the market for comments as a service; you just include a little bit of Javascript and it's good to go. It's similarly easy to include a Twitter widget showing your recent tweets with another little blob of Javascript, and Feedburner and Google Analytics are defacto analytics tools. There's barely a need these days to scrape, store, and serve content yourself these days, further obviating the need for a real server.

This is also clearly a more coding heavy approach to blogging and site generation than most people need. With free blog services like Wordpress.com, Blogger, Tumblr, and Posterous, blogging has never been easier or more available. Google Sites is also a great way of throwing up a quick website. I went with S3 and Hyde because I wanted more customization in the look and feel of the site, I like the Django templating system, and I wanted to play with S3 (especially since Amazon offers 1 year of free AWS credit). I also feel a bit safer about my data, since it's backed by Amazon's eleven 9's of durability on S3, it's on my local machine, and under version control at github.

Hyde

Hyde is pretty straightforward for anyone with experience writing Django templates, since it's basically the Django template engine plus some extra magic content and context tags. The Hyde README and github wiki are somewhat helpful in laying this out. Essentially, Hyde lets you assign per-page metadata that can be accessed by other pages as they walk the directory structure of your content; your URL structure mirrors your folder structure. By default, this metadata includes a created field that fuels the magic recents template tag which gets the most recent content from a directory (like your blog). There are a few more Hyde specific features which you can read about on the wiki page on templating, and the Django templating reference is also useful.

I still found myself a little stuck, and what was most useful was reading the source for the skeleton site that Hyde generates for you initially, and the code that Steve Losh uses to generate his own blog. To help you out, I've published the code for this site on github too. It might be useful to read Steve's write up on moving from Django to Hyde as well.

A few nice features of Hyde I like are the ability to automatically compress Javascript and CSS with jsmin and cssmin, and support for writing posts in Markdown, which is a lot easier and more portable than HTML. There's also support for writing "higher level CSS" (CleverCSS, HSS, LessCSS), but I never understood the point of these and didn't use them.

The features I had to add to the skeleton code are a draft status for posts, and the "Recent Posts" and "Archive" sections on the sidebar. Drafts were done by adding a metadata draft: True tag to draft posts, and modifying all my "listing" pages to exclude these posts (like the home page, archive, recent posts, atom feed). The "Recent Posts" and "Archive" sidebar use page.walk to traverse the blog directory and the recents tag the most recent posts. These posts are then filtered with if statements to exclude non-draft content. This is all slightly hacky, since if you want to show the 5 most recent blog posts (as returned by recents 5), you might have less than 5 posts after filtering out drafts. This is worked around by not dating drafts until publication (which gives them a default date in 1900).

I also had to modify Hyde's page.walk and page.walk_reverse to walk directories in lexicographically sorted order, but I'm hoping that's been fixed in git (I was using version 0.4).

S3

There is plenty of documentation on how to set up an S3 bucket as a website. It's pretty easy, I didn't have any trouble with this.

Making your existing domain name point to your S3 bucket is a little more tricky. S3 provides a URL for your bucket (in my case, http://www.umbrant.com.s3-website-us-west-1.amazonaws.com/). The first problem is a limitation of DNS: you can't make your zone apex a CNAME. If that was gibberish, it means that you can't make your plain domain name (http://umbrant.com) an alias for another domain name, like your S3 bucket's. Subdomains don't have this limitation, which is why you're viewing this blog at http://www.umbrant.com, happily CNAME'd to my S3 bucket. My zone apex then does a redirect to the www subdomain; this redirect is a service provided by some registrars, or you can beg a friend with a server.

I just lied to you a little about how this works. Notice that if you dig www.umbrant.com, you get the following:

$ dig www.umbrant.com
 
<snip>
 
;; QUESTION SECTION:
;www.umbrant.com.      IN   A
 
;; ANSWER SECTION:
www.umbrant.com.    831 IN   CNAME  s3-website-us-west-1.amazonaws.com.
s3-website-us-west-1.amazonaws.com. 60 IN A  204.246.162.151
 
<snip>

My subdomain isn't actually CNAME'd to my S3 bucket domain name, I've set it to alias directly to s3-website-us-west-1.amazonaws.com. This is a mild optimization that saves a DNS lookup; if you dig my bucket domain name, you see that it's CNAME'd to s3-website-us-west-1.amazonaws.com anyway, which finally gets turned into the IP address for an S3 server (the A record). This server uses the referring domain name (www.umbrant.com) to look up the S3 bucket with the same name. This system also means that if someone's already made a bucket in your region with the same name as your subdomain, you've got to choose a different subdomain (thanks to S3's flat keyspace). In other words, when using S3, your bucket name and subdomain must be the same.

Uploading files to S3 isn't too bad. I'm sure there are existing tools out there for interfacing with S3 on the commandline, but I rolled my own in Python with the SimpleS3 library available on PyPI. It's basically rsync-for-S3 with some issues; it doesn't delete old files from S3, the parsing isn't bulletproof, and it uses modtimes to check for updates instead of checksums (which I plan on implementing soon, right now it's almost my entire blog each time I re-run Hyde). However, it does work, and it is really simple to use.

from simples3 import *
import os
import re
 
# Config options
ACCESS_KEY = 'YOUR_ACCESS_KEY'
SECRET_KEY = 'YOUR_SECRET_KEY'
# Change this
BUCKET_NAME  =  "www.umbrant.com"
# Change this too, make sure to edit your region and bucket name
BASE_URL = 'https://s3-us-west-1.amazonaws.com/www.umbrant.com
 
# NO TRAILING SLASH
SOURCE_DIR = "/home/andrew/dev/umbrant_static/deploy"
 
IGNORE = (
          "\.(.*).swp$", "~$", # ignore .swp files
         )
 
# code
 
ignore_re = []
for i in IGNORE:
    ignore_re.append(re.compile(i))
 
# open bucket
bucket = S3Bucket(BUCKET_NAME, access_key=ACCESS_KEY, 
                  secret_key=SECRET_KEY, base_url=BASE_URL)
 
# recursively put in all files in SOURCE_DIR
 
for root, dirs, files in os.walk(SOURCE_DIR):
    relroot = root[len(SOURCE_DIR)+1:]
    for f in files:
        # root directory files should not have a preceding "/"
        # puts the files in a blank named directory, not what we want
        key = ""
        if relroot:
            key = relroot + "/" + f
        else:
            key = f
        filename = root + "/" + f
 
        # check in the ignore list
        ignore = False
        for i in ignore_re:
            if re.match(i, f):
                print "Ignoring", key
                ignore = True
        if ignore:
            continue
 
        stat = os.stat(filename)
        metadata = {"modtime":str(stat.st_mtime)}
 
        # check if it's changed with modtimes
        sf = False
        try:
            sf = bucket.info(key)
        except:
            contents = open(filename).read()
            bucket.put(key, contents, acl="public-read", metadata=metadata)
            print "Uploading", key
            continue
 
        if not sf["metadata"].has_key("modtime") or \
        sf["metadata"]["modtime"] != str(stat.st_mtime):
            bucket.put(key, open(filename).read(), acl="public-read", 
                       metadata=metadata)
            print "Uploading", key
            continue
 
        print "Skipping", key

Final remarks

This was a pretty reasonable and fun 2 days of effort, most of which was spent on tuning the CSS template and writing content, not wrangling code. Hyde doesn't feel very mature (documentation is lacking, example skeleton site is slightly broken, the sorting bug), but it works well enough and is good for people transitioning from Django. I'm very positive about S3 and Amazon Web Services in general (modulo Elastic Block Store being terrible, but that's a rant for another day), since my site is now essentially impervious to failure. It's also pleasing to see top management like Werner Vogels dogfooding Amazon's features.