How Acquia uses Amazon Web Services at DrupalCon SF 2010

 

Amazon Web Services

This talk was given by Barry Jaspan, Senior architect at Acquia, Inc. It addresses how to get a non-trivial Drupal site set up on Amazon Web Services, and how Acquia manages their AWS environment.

Benefits of AWS

  • Outsourced
  • high end infrastructure
  • geographic distribution
  • lower up front costs
  • fast provisioning and elasticity

This Talk is about the Challenges

I'm not going to talk about the usual hosting stuff, just the unique things about

hosting on the cloud and AWS in particular. Particularly, I'll be talking about how

Acquia Hosting addresses these issues.

What Could Possibly Go Wrong?

You might succeed. (God forbid.)

In that case, one box is never going to be enough, and you need multiple boxes. Multiple nodes can talk with a DB master and DB slave, with a load balancer to split the traffic between the nodes. However, in AWS, you can't do hardware load balancing or Round-robin DNS, which is bad for availability anyway.

ELB

AWS uses Elastic Load Balancer (ELB), which requires a CNAME record such as www.TLD. There's a limited amount of flexibility you have as far as what ELB will do. It will balance traffic, but if you want to do something like session stickiness or terminating SSL, ELB will not help with this (although in the last couple weeks, they did announce session stickiness.)

Elastic IP Address + Software Load Balancing

Allows top level domains, and you can use nginx, but you can have only one Elastic IP address per server, which will become an issue as we will see.

How do we share files?

Drupal needs a POSIX compatible file system for user-uploaded files. Elastic Block Store is single-instance, and rsync doesn't scale past a couple web nodes. S3FS is just not POSIX compatible, it's an object store, and we ran into lots of problems. Really, a network file system is required. Let's say you use NFS.

What happens if the load balancer fails?

You need a high-availability load balancer. Elastic Load Balancer is good for this, since you can add a second active instance relatively easily. If you're using an Elastic IP, you can have a hot-spare server and reassign the Elastic IP from one to the other if one goes down, either as a manual or automated process. EIP Reassignment is what Acquia is using.

What happens if the filesystem fails?

NFS cannot tolerate server failure, because it is not replicated. DRBD with virtual IP assignments could work but it requires a fair amount of effort in a colo; you can't reassign virtual IP assignments like that in Amazon.

Amazon recommends you look at AFS, which tolerates server failure well, but it doesn't tolerate replication.

We're using GlusterFS. It's replicated and distributed, GPL, easy to use, but tricky to optimize and get high performance out of it.

What happens if the database fails?

Active/passive MySQL replication works fine with the standard techniques and blog posts out there. However, all web nodes must fail over simultaneously. Heartbeat with virtual IP assignment won't work because Amazon doesn't support that. We require synchronous app-level failover. We store DB election in Gluster FS every time there is a failure. When an app tries to connect to the DB and fails, it writes to a file in the Gluster FS system, and subsequent calls are forwarded to the active DB. You don't have to use Gluster, you can store it anywhere that's synchronous and HA.

What happens if something changes?

If you have machines that you're spinning up on a regular basis, you need to keep them configured. Pantheon and Mercury works well. You can tweak your AMI and re-bundle it, but we were worried about being able to completely reproduce the changes you've made to your AMI.

At Acquia, we boot a plain vanilla Ubuntu AMI from scratch, and then update the config files with an automated configuratino script. That allows incremental updates, it's a good fit for source-code control, and it's a known process. It works really well for us.

Probably the best approach would be to use both of these in a hybrid approach, and use Puppet to manage everything in each AMI.

Dynamic Configuration

  • Balancers will need info for each web node
  • Web nodes need info on all db servers
  • FS servers need info on all EBS volumes to use
  • FS clients need info on which FS servers to use

All re-started instances get a new IP; EIPs are limited and cost moroe to use. All servers need current internal IPs, and admins need current external IPs.

We also have a controller that keeps track of information for each server in our cluster. All of our servers talk to the controller and update their configuration files and records.

You don't need this level of control for just getting one site up, but for example when you add a new DB server you will end up doing a lot of modification to your configuration files.

How about spam?

You can't send mail from Amazon servers, because a lot of their shared IPs are on a lot of blacklists out there. You can't send email from the cloud, but all of your servers are in the cloud.

You need a server not in the cloud, say in a colo somewhere, or a third-party provider like Constant Contact. We have an external colo hosting a server outside the cloud, and all our web nodes talk to that.

That's All! ..except for the other stuff

That's the AVS-specific stuff, but in addition you need all the normal stuff that comes with setting up your own hosting, such as nagios, backup, etc.

Thanks to Barry for the great talk. I hope this summary is helpful!

Did you enjoy this post? Please spread the word.