These are my notes from a talk given by Mikito Takata on July 2, 2012. Mikito works at ZenDesk and you can follow him on Twitter @mikitotakada
I'm a Socket.io/Engine.io contributor. I write free books! I'm going to talk about real-time at slightly larger scale.
The stack at Zendesk used to be based on Ejabberd stack. In 2011 I came in and they wanted me to kill it; that's what we did. We deployed one version of the infrastructure based on HAProxy/f5 + Socket.io + redis, and this year we've moved to Engine.io.
You make good choices in the beginning and try to stick to those. You want an API that doesn't publish client IDs and stuff like that.
- Independently scalable processes: you should be able to add capacity at each level of your app independently of others. Whatever technologies you are using, you should be able to add new processes on the back-end, for example. You should have levels of abstraction where you queue work for each layer, and that layer should not really care what the data is. You should write your API to deal with things that are meaningful to your app and ignore implementation details.
- Do as little as possible: If the processes are very simple (they should not contain management code or do more than one thing) then you should not have issues where you can't independently manage those things. If something fails, you should be able to start a new process.
- Statelessness is very important. Anybody doing serious high-availability stuff will tell you that data is sacred, whereas computation can occur pretty much anywhere. Shared nothing architecture: each node is independent and self-sufficient. Socket.io requires sticky sessions (handshakes,) but don't build your app in a way that uses in-memory sessions. It's simply a protocol limitation. Things that matter should be persisted elsewhere and things that don't can be automatically recovered. Avoid server-side sessions; write your APIs and authentication to work with stateless servers. You don't want to get to the point where your processes have state that you should be storing in the database, but you're not actually storing in the database. In our v.1, authentication was session-based.
- Disposability: Fail Gracefully. Client crash should be fixed by re-load. Server crash should still be able to retrieve the existing state.
- Monitoring over debugging: You want to be able to identify faults but also look at failure rates. Even in a perfect system you will still have random failures; somebody is using wifi or 3G and goes into a tunnel and that causes some strange issues. You need to have some sort of monitoring that's more sophisticated than just writing to a file. Keep it simple. If you aren't going to react in real-time, hourly is fine. Also, make collecting data easy, on the server as well as the client.
- Keep it Simple: no special casing; your end user doesn't care about your architecture; polling works fine. Don't try to tackle too many problems at once.