Don't Fail to Scale Panel

 

The panel was moderated by Shai Goldman @shaig of SVB. The panelists are:

  • Paul Buchheit @paultoo is at Facebook, and previously was co-founder of FriendFeed and previously helped write Gmail at Google
  • Chad Dickerson @chaddickerson is the CTO of Etsy, based in Brooklyn and around since 2005
  • Ade Olonoh @adeolonoh is the CEO of Formspring, with about 18m registered users

Ade: We started with two single-core instances on AWS, and about two months later we had two hundred. It started out as a side project, and we had two instances with a load-balancer. As it scaled, we ran into a number of issues; our biggest downtime was a few weeks ago for a few hours. We handled it by the audience by getting the word out; even after we got our database back up, we had to slowly bring the users back on. Luckily we had architected in such a way that the users were partitioned and we could slowly bring them back on the platform.

Chad: We've had outages with the exact same scenario. When I started at Etsy it was three years in and our big problem was search; some were taking 30s to complete. We were running full text search in the database. Since then, we've stepped on the gas at Etsy and we went from about 16 engineers at the end of last year to about 80 today. Hiring is difficult when you're scaling that quickly. You spend all your time building the team. You need to put processes in place so the people you bring on are effective immediately; that will help your team scale quickly. There are no short-cuts, but you can do some things to help. We built a small, in-house recruiting team of three people.

Paul: Don't believe your host's promises! Every time there was a message about maintenance not affecting the service, you have to plan for an outage! Know how to restore your system, especially with memcached. We designed a few different ideas with bringing the site up on AWS even if the colo goes down, but we decided that wasn't the biggest risk to our business and also a relatively low-probability event.

Chad: Etsy started in 2005 and raised significant funding in 2008, so Etsy has more or less always been hosted in a colo. We are using Amazon to do some offline processing, and we're running some processes out in the cloud but by and large, we're pretty happy running on our own hardware.

Ade: All our data is stored in mySQL but we're using a lot of system on top of that for different access layers. We're keeping an eye on other technologies but mySQL has a great ecosystem and a lot of great people that know what to do when there are problems.

Paul: Facebook has 500m users and is primarily running on mySQL and memcached; why is your problem so novel that you need to use the very latest tools? I wouldn't risk my business on running some hot new database.

Chad: I agree. We're migrating a lot of our postgres data to mySQL. Our goal is to build a great marketplace, not to test out the latest noSQL variant. If someone runs in and says that some new database is going to solve all our problems, I get nervous.

Ade: We didn't start looking for funding until our hosting bill hit $10 in a month. We decided to do that because we didn't want to monetize too quickly. AWS is great to not have to put forward a lot of capital to set things up, especially using their auto-scaling and load-balancers you can tune their infrastructure to how much you want to spend. Right now, we're spending a lot more than we would if we had bought our own hardware, but we may be unique in that regard.

Ade: Funding wasn't an easy process but it was a relatively short process. Traction solves everything. Even if an investor doesn't know the product and doesn't know me, just looking at our growth and our passionate users was enough to erase any questions.

Chad: We're using AWS for a lot of stuff like offline processing; Right now at Etsy, you're hitting live servers on our colo, but some of the data served may be from services such as s3. Hosting and cloud isn't either-or, it's either-and. We were storing all our images on an expensive disk array, but over the past few months we've been migrating those to s3. In the past three months, we've migrated about 60m images up to s3 from expensive network appliance hardware. We're planning to migrate all the hottest stuff in the cache up to the cloud.

Scaling Demand

Ade: It comes down to building the right product, and finding the product that hits what people love to do. All this about scaling, I talked about how we started but I didn't talk about the times that I didn't sleep and my wife is yelling at me because I'm hacking on servers on Christmas day, but if I had to go back I wouldn't change anything. Scaling demand involves building a product people love, and most of my product was finding the right product.

Chad: What etsy promises is a creative life for a lot of people, and that for us is kind of a magic formula.

Paul: Whenever someone starts asking me these technical questions, my first question is, "Do you have the problem yet?" If you're worrying about scaling and you don't have any users, you're kind of approaching the wrong problem.

Code Quality

Paul: You will inevitably need to rewrite the code, and you'll need to rebuild parts of the system. It wasn't necessarily some metric of quality as the knowledge that you'll have to rewrite it. If one part becomes a bottleneck, you rewrite that, so you design to code to allow you to do that. Many times it would only take a few days to do a rewrite of something, or change our crawler to increase the number of feeds crawled, or change from a single database to multiple databases.

Ade: We've rewritten the backend at least three times already, either to introduce cacheing or performance. I agree that it can be too dangerous to think too far ahead. I wouldn't spend a lot of time to figure out what's next, because for the most part you can rewrite things and move things around.

Chad: We're always rewriting too; we just moved 80m records of 'favorites' from postgres to a sharded mySQL database. My team is working on it right now!

Boosting Integration With Other Developers

Ade: We have an API in beta, but we're keeping it close to prevent abuse. We're still so focused on getting the product right, as opposed to putting too much out there and then change the product so the APIs aren't relevant anymore. We have a separate cluster that the API is hosted on, but it's using a similar back-end.

Chad: Regarding growing the team, we use the API as a recruiting mechanism, or even to identify acquisitions. There are many many stories of companies acquiring third party applications that use their APIs.

Green Technology

Chad: The green ethos is part of our company. I think it's just good business. Any startup has expenses, and power is one of them. It just makes sense to buy low-powered servers, like Google or Apple does.

Detecting Bottlenecks

Paul: One tactic is to just keep an eye on performance. At FriendFeed we had our performance graph there, and if response time ever crawls above 100ms we would go look and wee what's wrong.

Ade: We had to learn the hard way. Now we measure every action that happens, and then we can see that if there are performance issues is that because people are using a feature more or because of a new patch?

Chad: I can't emphasize that enough. We used to have a lot of mystery outages, but since we started measuring and keeping track, those have gone away.

Did you enjoy this post? Please spread the word.