BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Interviews Operating Node.js in Production, with Bryan Cantrill

Operating Node.js in Production, with Bryan Cantrill

Bookmarks
   

1. We’re here in QCon San Francisco 2011 with Bryan Cantrill to talk about operating Node.js production systems. Hi, Bryan. Would you like to introduce yourself?

Sure. I’m Bryan Cantrill, I’m the Vice President of Engineering at Joyent where I acquired computer start up here in San Francisco.

   

2. What are the main challenges for operating a platform like Node.js because I assume Joyent is probably the largest organization and has the largest deployments of Node.js.

In the words of an old razor commercial, we love Node.js so much so much we bought the company. So, we actually hired Ryan because we started to use Node.js and then ultimately bought Node.js from him and so Joyent owns Node.js and as such, we are, of course, proponents of it. What’s interesting for us is because we came to Node by actual using it, we are deploying Node heavily in production not because we own it; not because there’s some sort of fiat but because we’re finding it’s the right tool for so many jobs. It’s really been amazing to me how many things are a good fit for Node, and surprising things.

And so to give you a concrete example, it is so easy to build a high performing network service in Node that we’re using in most surprising ways. I’ll give you concrete example as part of the implementation, the cloud, we have a compute nodes boot up and it’s going to load the image for the OS from a head node. And so the head node is going to survive DHCP, TFTP, and needs to make some HT?P calls to an API to figure out who the compute node is.

So you end up with the service, this kind of Frankenstein service and what we sort of doing is like okay, it’s basically a DHCP server with some TFT bolted on and some DHCP bolted on and we took IECD HTPD and tried to get it to cooperate and it just was not cooperative and the engineer who was working on this is like, "You know what? This thing really only wants to be a DHCP server and I think I can actually go write this in Node."

And so, we gave it a shot and three days later, he had a working prototype of a DHCP server, the DHCP and TFTP and makes necessary TFTP calls and that’s our DHCP server, that’s we deployed and that thing got right correct so quickly, amazingly quickly and has had zero issues affectively in production.

So we’re using Node a lot for those kinds of network placing services over legacy protocols, DHCP, one thing we did recently is LDAP; so we’re using LDAP internally. And we’ve implemented LDAP service in Node using arbitrary storage back-end. Mark Cavage from Joyent did that and actually it was funny because you know, Mark was confiding in me, he said, "You know, I’m looking forward to make this open source and we made it open source and we hope that people will be interested in this.

   

3. It was a big story.

It’s LDAP, no one’s interested in it, I mean, it’s like LDAP, no one’s interested in it. I mean, it’s like LDAP is like a jail sentence, no is actually interested in LDAP, LDAP is something that is inflicted upon them. He said, "Why just we hope people want them?" And I come like, "We’re trying to talk out this and calm down his expectations, you got to know this is great work no matter what the rest of the world thinks, everyone’s going to ignore it but this is terrific work."

And sure enough, we put it out there and it became one of the top stories on hacker news, got a lot on discussion and there’s a huge amount of interest for it. And if you look at how long it took Mark to do that work, it is not just that long. I mean, Mark is a terrific engineer and he worked on it very diligently but we’re talking like six weeks, eight weeks, 12 weeks, kind of the outside to have a high performing LDAP service, that’s pretty amazing. I certainly can’t imagine doing that. So we’re doing that across the board.

   

4. Actually many people have tried to do something similar and see they have actually been using the same libraries for more than a decade and coping each other’s stuff. What is the Node.js implementation? I think it was pretty much stuff thrown in scratch.

Yes, it was from scratch. Yes, that’s right you got to have these libraries that have been dragged along and they kind of imposed a bunch of complexity and they constrained the implementation. So it’s been terrific from the speed of development perspective. One of the things that we were of course interested to see was: how does this thing actually got a work in production?

Fortunately, some very early production experiences in Node that were very positive where we saw much higher load and we thought we would see and nodes seemed to be doing great. We were not seeing run-away memory consumption which is my concern. One of my concerns was, you know, just having seen all the difficulty that the GC posed for Java apps as they were first put in production.

I was really braced for some serious GC issues never really materialized, huge tribute to the V8 guys, and a tribute to the fact that the Node model is so straightforward. We have, of course, some early node bugs but they were relatively minor in the grand scheme of things.

The thing that quickly became the issue was not node itself but when you have a node program that misbehaved in production, how do you diagnose that? Now of the things that I found is that more any other environment I’ve been in program in, it is easier to get it right in node. I’ve been amazed about how frequently I’ve quoted something up a node and talked to myself, "Wait a minute, is that it?" like it feels there should be more than that, there should be more work than that which we haven’t experienced as software engineers. Normally, software engineers, okay, I’m going do something that can really simple, and "Oh my God, this is so excruciating, we complicated. Why is this taking me so long?"

Nodes was one of the first environments we had that kind of invalid experience like, "Okay, this is very complicated, it’s going to take me a very long time…Oh my God, I’m already done…Wait a minute, something is wrong here!" But even still, of course, the program has got bugs, how do actually deal with these things in production? The one aspect that we’re having a very hard time with was not programs that died fatally, that it blows up with exception generally got the stack phrase there, you can make some sort of progress, not perfect, you want to have more information.

Where things were brutal was, and this is unusual, were the other programs spinning out of control and it’s unusual because you don’t go compute bound in node very frequently. So you really do have to have a programmatic error in order to see this, in order to see a program that’s spinning out of control but we saw one of these before we went into production with a big release in the spring.

We were doing our testing and one of our demons was spinning out of control and we’re sitting very frustrating. You’re trying to watch the thing and it is just totally opaque of what it is doing. You know, one of the great things working with Joyent is Ryan’s down the hall, you know, I kind of shout, "Hey, Ryan, get out of here, help us debug this thing." And all three of us who stare at this thing tried to debug the goddamn thing and you don’t know where it is, you don’t know what it’s doing, it’s very hard to get context and so what we increasingly believe and that bug, by the way, didn’t debug it, didn’t find it, which to me is like the worst thing imaginable.

You say I had a bug, I had it in front of me and I got no idea what it was - it’s terrifying, and that to me, is unresolved disaster. And it’s a disaster waiting to happen. Actually we put a bet about how frequently are we going to see this when we stayed up in production? I thought we’re going to see it very frequently. Fortunately we didn’t, we still work with the same production and we never saw it again but we knew it was out there. We had it, there’s this bug that was out there.

Meanwhile, we were thinking, "Hey, we’ve got to go tackle this problem because we know this bug, other bugs are out there, we have got to tackle the problem and in particular, we got to tackle the problem of being able to take node state and V8 state and be able to understand what’s going on in the program. And so for us, as systems guys, we needed post mortem debuggability in node. I need to be able to walk up to a node process, take a core with gcore, take a snapshot of each state and then go debug it.

It’s a really hard problem and it’s a hard problem because you take up a core dump with node, you go to the tracks trace and it’s all meaningless, right? Fortunately, we felt strongly about this that we really put some really intense effort into this and Dave Pacheco, on our team at Joyent, who was very excited to go tackle this problem in part because it was his software that had the bug and actually we didn’t know because it was his either he’s code or my code that had the bug.

Dave did all this work, terrific work that I’m very excited to be demonstrating today here at QCon and we’re able to take a core dump now and got a stack trace. We had that core dump from back in the day and we were able to take that core lab and then I should say, we solved this problem one a time; we solved this problem this was about a week and a half ago, of course, I’m demonstrating it for customers and this of course, when it happened, I got profile customers come in, I’m demonstrating and I’m like, "It doesn’t seem to be working; and of course you got the, you know, I’m sure you had demo goes out on you and it’s such a fight-or-flight reaction when you’re, "Oh I’m doing something wrong, Oh God, no, it’s not me; Oh God, the demo is not working; Oh, it’s happening right now; my demo is feeling like I feel the adrenalin and the cold sweat," and did the kind of hand waving and go demo with another machine.

Meanwhile, we have back-up machines, so it’s spinning again but now, David done his work, we did a gcore on that thing, he looked at the core dump, we applied the new stuff of the core dump and we found the bug and we were able to find that and sure enough, it was a stupid bug. We are passing an object into routine that has all of its parameters and MIN and MAX are both set on this same number which this routine was never expecting, my routine was never expecting and didn’t assert that these things were not different. So my code Mark should have blown up but it didn’t; it just went into an infinite loop and stupid bug?

With this technology, immediately debuggable; without this technology? Don’t matter who you are, you are not debugging it. And I know that from experience because I import a lot of resources in trying to debug this thing - Dave and Ryan and I really tried to debug this thing earlier and we can do it. So we’re really excited about that. I think that that kind of technology is the difference between interesting technology and technology that’s fun to play with and really rock solid production ready technology.

I think Node is actually far more production-ready than other environments that have been around a lot longer. I would deploy Node way before I deployed Ruby. I would deploy Node way before I deployed Python and I know that with people having aneurysms all over as I’m inciting religious wars, but from my practical experience, we can now determine more about what’s going on for a node program than any of these other services.

I think the only environment that has made as much progress on debuggability is Java. Of course, Java has got 15-year lead. So we think that while we’re more focused on it then Sun was on this particular problem with Java; so we are going to continue to flush out and continue to make it better and our commitment is really to make Node the best production environment and second only to C, of course, the language that God intended but if we can do it in C - everything we do is either in C, or in Node. So, from those two systems, we are bound on making them absolutely debuggable in production.

   

5. Talking about systems that are rock solid in production, how do you manage the rapid evolution of the Node.js platform? I mean, the new versions, the API not being without concrete - those things like dependencies, many of the pluggings and the modules that people use have weird dependencies in production that are issues.

Yes and that’s been a challenge and we have kind of known, we keep our dependencies under control; what we’ve been doing is pointing dependencies in the sub repos So the repo that has the service has got all its dependencies sitting right there and when you want to go pull in the latest in Node which may require you to update three of these things and may require you to go debug nodes and now not working then at least you got that whole thing as a unit.

So what we have and had is production problems as a result of that; what we have had is it is slowed down development in certain regards. Fortunately, I would say, more, six months ago, eight months ago, nine months ago, than it is today, things are settling. I mean, it used to be. You remember you know a year, a year and a half ago, you go to look for anything and there were five modules that did it and it was very unclear and also did poorly.

The only thing is that was so demoralizing. I said, "Okay, I need an options parser; Oh, there are five options parsers." I can see after 10 minutes that all of them are wrong and I actually need to go a right sixth. I said "Great." That was more a year ago or a year a half ago than it is today, these things are now settling and we’re seeing more canonical implementation on a lot of these stuff.

   

6. I suppose you’re using NPM to manage.

Isaac is very fond of saying that if you set out to write a package manager, you might, just might, if you’re lucky end up with a really good options parser as the primary artifact. And you know options are bad examples, I think, we’ve probably used three or four different ones under Joyents’ roof. I would try to be flexible in some of that stuff.

The nice thing about Node is its pretty light weight and we’re not seeing the kind of the Ruby gems kind of monkey patching metastasize monstrosities where you’re not able to kind of twitch the environment without breaking everything, not yet anyway. I don’t think we’re going to see that because the ethos is different. The Node community very much to me has the Unix ethos, it’s very Spartan in terms of interface and really driving towards simple, composable systems and less concerned about the way I read it anyway, less concerned about things, I think are more kind of superficial simplicity and more concerned about things that are deeper simplicity.

Unix is revolutionary because it’s a very deep kind of simplicity. It was so deep that we don’t recognize now what operating systems used to look like. I think Node is the same way so I think that it has not been kind of the convoluted hairball that it could be.

   

7. You mentioned a little while ago about the production issues that the JVM has. Most of the systems have some inherent issues like with JVM, you usually have memory issues; with Apache plus module environment, you have memory and connection issues, what are the inherent issues with Node.js?

You know, I feel like a jack-ass saying this but I really can’t - that’s an issue. It has been so solid for us in so many ways across so many services. It’s hard for me to say, "Oh yes, we’ll be running Node, your CPU consumption is going to be pretty high," no, we will brace for that, never saw it.

We cannot get a node out that’s not spinning in infinite loop even crack CPU; I mean the CPU learners or the light CPU load is just amazing. The DRAM footprint is unbelievably small. I mean, we got a free developer on the cloud; we run those 128 meg zones, 128 meg virtual OS containers, 128 megs, it’s how you get; and yet people got lots and lots of nodes apps on there because node actually can do a lot in 128 megs.

So it’s not DRAM because of its even oriented architecture, we’re not really seeing complicated threading problems obviously it’s not threaded, we’re not seeing IO problem and so on because you don’t corked on IO. You’re doing IO but you’re still servicing requests.

So I feel like I won’t blame anyone for just thinking, "Oh, Cantrill went to the marketing department. I thought this guy was actually like in the trenches, what’s going on? But now it’s true, I mean for us, the biggest issue has been: when you do have node bug which fortunately has been very rare, you got a node bug that spins on CPU that has been historically undiagnosable; and that’s why we moved on the work that we’ve done to go address that.

   

9. Yes, I mean the tools and the process for someone to debug this.

Yes, we’re very much using the tool set that we have in the operating system. We have our own operating system in Joyent called Smart OS and so when you are running on no.de you’re running in SmartOS and we have a debugger built in the SmartOS called MDB and into that debugger we have built modules that understand V8.

And Dave Pacheco again at Joyent has got a terrific blog entry on outlining all the things he’s done. We’re still working on this and rolling it out so people can look to see it out there in the coming weeks and months. It will be just kind of warning people in advance; it’s going to be only on SmartOS feature not because it’s not open source, not because we don’t want to see it in platforms but because it builds on effectively 15 years of foundation that we have poured into the debuggability of the system.

So you can’t build this kind of stuff into GDB. That’s not how GDB is architected. MDB is architected to do this and MDB is itself built on foundational elements very deep in the system that are just not readily portable. We would welcome anyone who wants to port it to different system, we will help them out but it’s rocky. So for now, it’s going to be this kind of debuggability is really only going to be on SmartOS.

   

10. Being one of the creators of DTrace and Node DTrace, would you like to tell us how and why a dtraceable Node.js matters? Obviously it matters if you have poured work in it.

Actually I can be very inverted about that kind of stuff. Again maybe you’d be right not to believe me when I say that I absolutely don’t have a problem, I don’t tend to kind of invent things and then use them for their own sake. I tend to be inverted in that I tend to invent the things that I need to solve the problems that I have right in front of me.

You know, unless we know that I would be, if we didn’t need node support for DTrace, I don’t know that necessarily we may find I guess but if I don’t need it, then who cares? We do need it. That’s the reason it’s important it’s because we need it. We need it when these things misbehave and they misbehave with this one pathology of not interacting with the outside system at all and kind of going into the black hole. We got to be able to understand what that thing is doing and it’s not just trip or node, I mean, in fact actually if you want to talk about production issues we have, just so you know, that I’m not in the marketing department.

Let’s talk about Erlang and RabbitMQ there, we can talk about production issues where we did have rabbit losing his mind off in Erlang never, never land and I mean you got no idea what’s going on. Now, the Erlang community is also working on DTrace support which is great. Obviously we’re very excited about that because we need that to be able to debug those kinds of production problems.

It comes in DTrace node, it’s not just, "Hey, it’s nice to have," it’s a "Hey, if you got this problem, if you got this kind of a problem in production, you got a production node system that loses its mind, you need this technology, you need DTrace, you need MDB to be able to figure out what the hell is going on.

   

11. You have talked about deeper support of DTrace within V8 in the past, would you like to elaborate on this?

Yes, what we’ve done, we’ve done a couple of things: one is that we have added some probes to node itself around point of semantic relevance; we have also…there’s a terrific module that I’m talking about my talk, Cris Andrews put together a module that allows no developer to define their own DTrace probes. Mark Cavage again at Joyent did this for his old apps, his old apps at Node.js defines DTrace probes so that you can walk up to that thing and instrument it.

So not only that Mark developed effectively the most scalable LDAP server, it also is observable because he’s leveraging this terrific work. Beyond that, the work that David has done recently is the ability to take a stack trace from the kernel and be able to take that stack and translate it into actual javascript frames, that’s what we will be demonstrating today. The implementation details are incredibly hairy for that, I mean, it’s just unspeakably hairy, the tree of knowledge of good and evil to understand how all of that stuff work, but that actually work.

What I don’t know if we over achieve is the ability that you have with DTrace on a system that is in a native environment to instrument an arbitrary function by only changing effectively that function. We don’t have the ability to do that in any dynamic environment today, it’s an extremely hard problem. I don’t know if we’ll ever get there, we hope we’re going to get there.

I think we may get there kind of incrementally and it maybe that the kinds of things we’ve done now, the probes, the DTrace and so on, but that’s enough to give so much even though you don’t have otherwise that to solve that other problem which is ten times as much work as that software together, it’s just not worth it, I don’t know, time will tell.

   

12. Do you have any information about what the Google guys are doing with V8 with regards to instrumentation?

With regards to instrumentation this is not something that the VM guys, I mean, they really want to focus on the tools for the developer; whereas, we focus on kind of production uses. So the focus, they’re very supportive of the work we’ve been doing, we have great conversations with them but the reality is, it will require them and us to do a lot of work to make V8 truly arbitrarily instrumentable and it’s a huge challenge. I mean, they’re receptive to it, we’re receptive to it in the abstract but I don’t know if that works again, we’ve done all other stuff, kind of a periphery, that maybe enough.

   

13. Would you like to give us a comparison between operating Node.js on your own infrastructure and using a platform like Joyents? What would be the main differences and the experience for someone who operate Node.js?

For me, again, I don’t mean to sound to like too much a salesman here for Joyent but the reality is, that to stand out node in production and be able to debug it and understand it, you need these underlying technologies, you need DTrace, you need mdb, you need the stuff we built on it.

So you want to be on SmartOS if you’re going to build it yourself. SmartOS is open source, so anyone can go and install that and play around with it. You got DTrace, ZFS, you’ve got a lot of goodness in there, we’ve put recorded KVM on SmartOS - there’s a lot of reasons why SmartOS it’s open source, you don’t need to worry about becoming encumbered with Joyent in that regard.

If you were to tell me right now that Joyent has closed its doors and I’ve got to venture forth in the world and I would almost certainly be developing a node because that’s a great environment for some of the systems tasks, I would absolutely be standing upon SmartOS, no question.

So, you got to send it up on SmartOS then the question is, do you want to manage that yourself? Do you want to have obviously, Joyent would be happy to sell you the orchestration software to manage that, do you want to go with the cloud provider for that? I think the decisions will kind of fall off from that but I think where you send it up with Joyent, you send it up with a different cloud provider or you send it up yourself.

To me, the constraint is, you’ve got to have the tooling to understand how and why this behaves in production and when it does, which is not frequent necessarily but when it does, they are really debilitating problems.

   

14. How do you see operations involving in general as new kind of applications become popular like real time stuff and data intensive applications?

To me, as we go to deploy things, there are more real time into production, we have to be able to understand the behavior of those applications better when in real time system - there is a real time system and a non-real time system is that timeliness is correctness for real time system, if it’s late, it’s wrong.

And if that’s the case, ‘it’s late, it’s wrong’, then you need to be able to know if it’s on time or not? And in order to be able if it’s on time or not, you’ve got to be able to dynamically instrument the system; so, I think one of the ways that I think operations does need to change is that today, largely, operations has a very kind of metric focus of how many of these things that are due for second? And how does that compare historically?

Well the number of beat per second is not actually that relevant in the real time system, what’s relevant is how long did it take? And it’s a horror question to answer because it requires you to actually instrument the system to answer that question but that’s where I think the operations folks need to embrace.

Mar 08, 2012

BT