These are my notes from Dave Pacheco's talk on July 3, 2012 at NodeConf. You can follow Dave on twitter at @dapsays.
A lot of the work we do at Joyent is engineering work on the Joyent Cloud. These services are taking tasks from users to provision and de-provision servers, which is the perfect use case for node: taking HTTP requests and reacting to them. At Joyent, we found node.js to be a great environment for systems programming, especially UNIX abstractions. Node allows us to write this software much, much faster. Since it's core infrastructure, it has to be highly reliable and fast.
Dtrace provides comprehensive tracing of both kernel and application-level events in real-time. It scales arbitrarily with the number of traced events and is suitable for production systems:
- It's safe
- has minimal overhead
- can be enabled/disabled dynamically
Perhaps the most important aspect of dtrace is that it allows performance investigations to be data-driven.
There are two types of performance problems. Asynchronous problems are when the node program is waiting for other things to happen. The problem is with something else (e.g. an authentication service) This may also be an architectural problem with your program. The second case are Synchronous problems, and that's where your program is the problem. You're doing an enormous amount of work on the CPU, but that's the bottleneck that you're running into.
Let's talk about the asynchronous operations. Usually, you want to understand the latency of each task. One approach is to call Date.now() or process.hrtime() and compare timestamps between the time your code starts and finishes. It doesn't allow you to correlate different types of operations without doing a whole lot of work. With dtrace, you get a lot of this stuff for free. Paolo talked about dtrace probes. The very basic thing you can get is to look at system calls. For example, get a histogram of the time spent in the read sys call in your program. That may or may not be interesting, but a node-level event like garbage collection may be very interesting. You can use that to get a histogram of the time spent in garbage collection over the course of a large number of operations.
Let's talk about dtrace provider. You can put a line in your node code that says, fire a dtrace event. The reason I'm using restify is that restify does this automatically for you, for all the handlers you specify.
if your program is waiting for something else, it must be doing something itself. But what? System probes may still be useful. dtrace provides the ability to fire a probe on all CPUs at some arbitrary interval. You can use the dtrace provider with jsstack() to see what your program is actually doing.