These are my notes from the talk at DrupalCon Denver given by Jeff Miccolis. You can follow Jeff on Twitter @miccolis.
This is probably the only talk at Drupalcon where you'll hear the word 'node' a lot.
I used to work at Development Seed, which was a small consulting shop in Washington, DC and we did a lot of work for governments and NGO. I was most involved with Open Atrium, Features, Context and Strongarm.
Today, I work at MapBox. At Development Seed, we found ourselves focusing on open data, visualizations, and maps. Currently we have a hosted service at mapbox.com for hosting tile sets that you create, or integrating tile sets that we create from other open data sources. Recently, foursquare switched to our maps.
In my description of my talk, I implied that I would get into the nitty-gritty of node.js. Unfortunately that talk wasn't that fun, after I wrote it. Instead, I'm going to talk about three horror stories with Drupal and PHP that would have been better if I had used node.js.
Story 1: Sending Lots of Email
We ran into this problem with Open Atrium, which is a collaboration suite. With any collaboration suite, you'll want to have notifications. We had an interface that lets you check off the colleauges you want to notify.
We wanted to add more: SMS, XMPP, Twitter, etc, etc etc. To deal with this we wrote a Drupal module called Messaging.
In Drupal, when you hit Save, the most naive way to send an email is right at that point. That's a problem, because sending email takes time. Sending a single email takes about 50ms. For 600 users, that will take about 30 seconds. That's a problem. Generally, you'll move this to when you run cron. If you run cron every 5 minutes, then that gives you a 5 minute window. But sometimes, cron may not complete, and the next time it runs, it may not catch up. We ran into this a few times with Open Atrium.
In PHP, the common way to process a lot of emails would be to cycle through an array and send a mail to each recipient. In node, it's a little bit different. For instance, the mail() call isn't in 'if' block, and it's passed a callback function which handles the error. Node.js gives us a platform with easy, accessible asynchronous programming. The upside of this is that you can stop waiting around for actions which involve I/O.
Just based on our access to the I/O, the asynchronous aspect makes node.js 10x faster. You can do things faster by telling your I/O to do more work!
Story 2: Aggregation
One of the big projects that Development Seed did was a news aggregation system called managing news, which aggregated stories and geolocated stories on a public map. We had to fetch a lot of RSS feeds, wikipedia articles, and do geolocation. We had thousands of feeds with millions of items, and gigs and gigs of data.
The testing database we used to use were 4.5 gigabytes, which was big enough to simulate what a real site was like. The solution to dealing with I/O failed horribly in managing these. There was just not enough time to guarantee that you would hit all those feeds. We needed to guarantee each feed would be hit at least once a day, and cron just wouldn't get there for us.
We wrote maggied, a multithreaded python daemon, which accessed the Drupal database and used four workers to get batches of 50 items and fetch/tag them using a third-party service and try to geocode them into the Drupal database.
Retrieving an original story took about 300ms. Tagging it took about 100ms and geocoding took about 150ms, so the total time for each thread to deal with one story was 500ms. That's not too long, but during that time my server is not doing a darn thing. It's just waiting for these APIs to respond. I have all this excess capacity on servers built to handle tons of data, but they're sitting idle.
One approach is to use lots of worker threads. For various reasons, that didn't end up working for us (although theoretically it could work.) But really, is that the best idea?
If I were to go back to this problem after working with node.js, I would replace the retriever workers with a hyperactive squid:
The squid runs up and down as fast as it can dealing with each item in turn. It fires off any long running I/O operations and then moves on to the next item. When the I/O operation reports progress, it does a little more work on behalf of the corresponding item.
You have this single process (in node.js it's called the event loop) which keeps running around checking as much I/O as it can, and only doing something when there is new data in the I/O. This is great, because anything that leaves v8 takes a callback, and only bother the main event loop when something has happened.
Node looks different in code for filesystem work, network, stdio, timers, and child processes. The benefit of doing things this way is that you get to ask your machine to do more work. Your limiting factors start to change:
- How many open sockets are you allowed?
- How much bandwidth do you have?
- How fast can you issue requests?
The classic limit with PHP is your per-thread memory limit and the number of concurrent threads.
Story 3: Big files, long sessions
By big and long I mean gigabytes and hours. If you've done PHP development for any period of time, you have stories of being challenged by large files and long sessions. There are these four config options that everyone in this situation is familiar with:
As you add more features and request bigger files, you'll bump these numbers up and up. I'd say it caps out at about 500M, and you've opened yourself up to DOS attacks, and you're allowing your scripts to run for long periods of time. The classical PHP solution is to look elsewhere for very large file uploads. I saw a little chatter on whether we can use node.js for this, for Drupal.
In node.js, this becomes a lot more possible. I've used a library called formidable, which offers a service to upload very big files to your website using a simple API, and can also handle incoming POSTs in general.
CouchDB has a really cool feature called a changes feed. It allows you to set up a connection to your database, and it will send you (over HTTP) changes to the database. You'll get these little JSON snippets with information about changes. We use this on MapBox hosting. We allow people to upload tiles of up to 5GB, and we need to get those uploaded. We map uploads directly to s3, and save a record when that upload completes into CouchDB. That record gets propagated to all of our web heads, and CouchDB notifies a long-running node.js process that sees that the tiles have been updated, and downloads them.
What we found is that when you're moving to a system with non-blocking I/O and a single event loop, a lot of things change. It's easier to write smaller programs that are more connected.
Enough War Stories. Two more Things:
Package Management for Drupal
drush make will make your life easier when you have to revisit an old project and apply security updates. One of the nice thing about drush make in the larger ecosystem is that it can rely on a few things that make it interesting. Each module has a namespace on drupal.org
Contrast this to PEAR, which has a high threshold for new projects. Imagine if pear was wildly inclusive, awesomely useful, and awesomely successful? I don't bring this up because I think Drupal should use pear. However, npm is the node package manager that does these things for node. You should know about this even if you're just experimenting with node.
A brief comparison of the number of projects recently (early 2012):
- pear: 584
- d.o: 15,296
- npm: 7,976
Keep in mind npm has only been around two years.
Node.js is not good for:
- Computationally heavy tasks. If you're doing a lot of math, this is just not the kind of stuff node excels at. Every operation you do that doesn't take a callback is blocking the event loop. The loop can do one thing at a time; doing a computation means that while the loop is handling that, everything else is firing up.
- Databases are the classic example of something that is poorly handled with node.
Node.js is awesome for interacting with other services. For example, databases, web services, mail servers and web clients.