1. We are here at Ruby Fringe, Toronto. Who are you?
I am Tom Preston-Werner. I work at Power Set and Github. Power Set is what we call natural language search or semantic search. It's trying to do search a little more intelligently. Your Google searches all keyword baTse so you type in something like Ruby Fringe Toronto and then you get a page that would come up and say you'd have a list of things and they would somehow relate. At Power Set we would say you'd have a more specific question say. In the GUI you type that in and try to find a page, for PowerSet we wanted to be more like "When is Ruby Fringe?" and you just say that and it would try to find words on the pages that would say something grammatically like "Ruby Fringe is held in Toronto on July second", or whatever the date, I don't even know what date it is. It would say something like that. It would take meaning from the sentences; extract meaning out of those sentences. We do grammatical parsing of web pages and we also grammatically parse the query as it comes in, we index that semantic understanding of the pages, and then try to match the query to those. So it's a more intelligent search with deeper knowledge of the content of the page.
2. What content sources do you index? Do you index the web or specific to special sources?
Sure, so right now it's Wikipedia only. We start with the Wikipedia corpus because it's relatively small compared to the entire web, so it is a manageable piece to begin with. We also do it because it's a little more structured, and we start with something that is a little bit easier than all the really crappy grammar on the web. And by doing this we can refine the algorithms to a point where they work on decent text. Not have to worry about spam or really obnoxious grammar or some of the really difficult extractions, like text extraction kind of things. We leave those for later we say we are going to get this problem solved, this piece of the puzzle which is the really hard grammatical parsing, figuring out what the index looks like, figuring out how to query that, how to do the parsing or the grammars and the incoming query. We do these things right now on a smaller nicer corpus like Wikipedia. And then we branch out to things like New York Times, like Web MD, other sources of really good knowledge, figure those out and just keep expending and eventually go web scale. That's the dream, that's the goal.
3. What technology is Power Set built on?
We use a lot of different stuff, we use a lot of Ruby, we use Ruby almost entirely for internal tools, anything that people are going to use in the company for our relevance testing, things like that, that's what I work on. We use Ruby because it's a very productive language, we have a lot of good Ruby knowledge there are some really excellent Ruby programmers. As well in the company we use C, C++, for some of the runtime that needs to be super super fast, we have an Erlang project that handles incoming queries, from the first part of the runtime, and handles taking them in and routing them to where they need to go. That's the Fuzed project that I also work on, as well as some Python for the build system, JavaScript and HTML, CSS obviously. We try to choose languages that are appropriate, we don't say "No we can't use this specific language at all" we say "Is that language appropriate to be used for a specific domain? And if it is and it makes us more efficient then we'll do it". So all of it applies.
4. You open sourced Fuzed on GitHub I think.
Yep. That's correct. So Fuzed is an Erlang project which takes incoming requests, figures out where they need to go, and sends them there. And the way that these backend things are configured you can have, like let's way we have multiple indexes at Power Set. We have different indexes, different build of the index. And we have different versions of software that run on top of that to perform those index searches. When we do testing we need to run tests on say two consecutive indexes, to be able to then do relevance testing and figure out did that feature that we think we fixed, is that actually better now? Better language feature? Is that, did we fixed that for real?
We have multiple versions of the software and with Fuzed we can attach both of those versions up to a single thing so we have homogenous pools of resources that are the same within a pool. But then inside Fuzed it's a heterogeneous mixture of all of these things. And by displaying what the versions are and what API that they support is, incoming requests that come in as JSON can contain a little snippet of JSON that says "I want to use software x at version y and here is the RPC call, here is the RPC to do". And it's just a big routing machine and at the same time it allows these resources to come in dynamically.
You don't have to change any config file or anything they will just say, they connect to the master and say "Here is who I am and here is what I offer". And then you are done. So now they join into a pool, Fuzed figures out what pool they belong in, make sure all the things in the same pool are the same configurations and they become immediately available so we were able to ... PowerSet was kind enough to be able to allow us to open source that. And so it is up on Github now. It's Fuzed, and if you do a search for that on Github you'll find some repositories. The root is actually Kirin Dave, he is the other guy at Powerset that works with me. So you can find it there.
Something similar, well there is really not a whole lot. I mean one of the primary reasons for creating it was that there wasn't anything that really served that need that we had. I mean you can come up with something that is based on Apache and Nginx I guess there is a Ruby project. How was it called? I can't remember what it is called but it basically allows you to do a dynamic join of Mongrels into a cluster so it handles that being able to have just a bunch of mongrel servers running on various machines. And you can add them dynamically and they will join right up. But it was very specific to that scenario, it was written in Ruby. And it was only for that I don't think it was very extensible but what we needed was something different we needed for just generic JSON calls. And at same time we were able to make it more generic as open sourced so that it could run Rails as well. And it does that by just having Ruby instances that have Rails in them that are running behind a Rack handler. And then those connect up to the master and say "Here is who I am, where I am" and then they become part of it so... Oh, it is called Swiftiply. So that is sort of similar but it is difficult to do this, not a whole lot of systems that are designed to do this.
6. So at Power Set what parts are written in Ruby? Can you talk about that?
Sure. The front end website runs Merb, and that is Ruby, Fuzed is written in Erlang but the things behind it, the things that it talks to, the resources, a lot of those are written in Ruby. There is one piece that we internally call a power mouse and if you go to a website that is called Factz, with a z, that's written in Ruby. We have some other things like there is a Lexicon server that takes in words and then does lexical expansion on them, to say "Well a cat, maybe he meant a feline or a variety of cat like a tabby" it will do this kinds of things, Power Set can take those words and do broader searches on them to try and figure out "You typed in that but you might have meant this other thing which is basically equivalent". That is written in Ruby right now. A lot of the internal tools are Ruby. Anything that glues two pieces together like let's say we have some C things over here and then some Erlang thing over here for instance, we use Ruby a lot as the go between. So a lot of script writing, we have a lot of scripts that are Ruby based. So we use it all over the place. We use it wherever performance isn't super critical and where we need something done expediently development-wise. That's the sweet spot that we have for Ruby.
7. So you mostly use MRI, do you use JRuby or other projects?
We use only MRI right now, we are on 1.8.5 I believe. We have to be sure that any of the upgrades aren't going to break anything throughout all the hundreds of servers that we have. And so we are on 1.8.5 because the bugs that were in 1.86 at the time when it was started, we are looking at upgrading to 1.8.6 right now I believe and possibly even 1.8.7 but the whole patch debacle sort of made that a little unfortunate. So we have to be very careful about upgrading, but we are only on MRI I don't think we use JRuby for anything. But we do really want to use Rubinius and so some of the employees at PowerSet had a chance to work on Rubinius and contribute back to that during work hours because we really see that as the future of the Ruby VM. Once we get that on the VM and make it fast, and have really short start up times and whatnot that's going to be really awesome so we are trying to put some resources behind that effort so we would love to use Rubinius but we're on MRI right now.
We were several weeks ago acquired by Microsoft and that's kind of a big change for us, we have been going through a lot of meetings with Microsoft, it's specifically the LiveSearch team that would be working with. And it's actually been really great to meet with those guys, really smart guys over there. And I didn't know what to think at first, those guys coming in, but it's really been eye opening to interact with Microsoft employees at that level. And yes, they really kind of blew me away with the knowledge that they have and the vision that they have for their product so I think it would be pretty cool I mean it gives us at PowerSet a really good access to information, resources, I mean we are going to have as many servers as we want now, right? And that is going to be huge. It's sort of sad anytime you lose that start up, you go from a startup to an acquisition, and it's awesome but you are a little sad because it's not a startup anymore you become more of a corporate thing. But in the long run it's exactly what PowerSet needs to be super successful and I think that PowerSet will be.
Sure. I co-founded Github wit Chris Wansrath and now PJ Hyett so we are a three men team right now. Github is what we call "Social code hosting" which is like the next evolution of putting your code online. For people who are not familiar with Git, Github is a place where you put up your Git repositories. Git is very much like Subversion only a lot more advanced. It has things like first class branching and merging, that's built for that. It has offline commit so every clone of a Git repository is a full repository with whole history. So you can clone a repository down on your laptop get on the airplane, fly across the country be hacking on the airplane and then making commits on the airplane.
You can commit to your repository on the airplane and when you get back to the ground, you could push those back up to a place like Github and then share them with people. So in essence Github is a place where you and I can put up repositories of our code, and you can come along, maybe you do a search or I tell you about it, you come on there and you say "ok, here is all the source, you can clone it down, you can view it online we have tools to visualize the branch structure and how people have forked. On Github we call taking a project and making a copy of that on your account, we call that forking.
It's forking in the good sense of I like your project enough to fork it and start working on it and contribute back. The whole process is streamlined to make open source contributions or working together, collaborating, making that as simple as possible. This came out the need that Chris and I had of we do a lot of open source stuff and it gets pretty onerous after a while to be responsible for all of the merging of people's patches that come in. and on subversions you accept patches on email or whatever, you have to try to figure how to apply them, maybe things have changed, since then, there is no shared history so the merge strategy can only be two way, there is a lot of problems. And so, Git and Github try to solve those problems.
10. What is the infrastructure, what is Github built on? What does it use on the server side?
Github is written on Rails, it uses Grit which is the open source library that I work on. Also we have another developer now Scott Chacon who did a lot of work on a pure Ruby implementation of Git, we've actually contracted him and we have replaced a lot of the shell out calls in Ruby you do the shell out, we replaced most of those, possibly all of them by now with just pure Ruby, we go directly to the file system. We read the object data from the Git repository, we read that directly. And by doing that we have been able to get a multiplier of two speed up on the web pages throughout the site. Double the speed of the page load and that's been awesome. So we are using that when you install Git.
We are using that less and less and we are implementing some of that functionality that we need for the site in pure Ruby and that's been really helpful to us. I just finished up an Erlang git daemon Whenever you do a public clone of a project on Github right now that runs an Erlang Git daemon which is just a server for pushing down the repository as a whole there is kind of a back and forth, it has it's own protocol. So I implemented that one in Erlang because we needed to have more flexibility. We have grown to the size where we can't just take all the user directories and put them in a single place because we end up with a very very large directory. So we have to take those and what we do is take an MD5 of the username and then split that up into directories.
And the built-in Git daemon that comes with Git isn't capable of doing that. It can only say "Here is a directory, fill all the repositories within it". What we need to do is say "Here is a directory structure, serve all the repositories within that". And at the same time we want to do logging, we want to be able to log, when someone does an initial clone on a project, so that we can say "Hey your project has been cloned twenty times, hundred times, whatever" . We want to be able to tell you as a Github user that. Also the error messages that come back from Git daemon, if you try to push to a public clone address, regular Gib daemon just severs the connection, it's just gone.
And it says "Connection closed unexpectedly". But with egitd which is what this project is called, egitd I can insert an error message into that and it comes back and it is shown to the user and it can say "You can't push to this, you are try the one that you should be pushing to". So it gives us a few advantages. We also use bj which is a background job processor and by Ara Howard it has been really solid for us. A lot of the tasks, like every time you push to your Github repository we need to do certain things, we need to run the Git garbage collection which compacts the objects on the server so that we can save space, we have things called web hooks, meaning if you put in an URL we can ping that every time you push, create a JSON packet or a YAML packet or whatever you want.
And we can push that to that URL, we just do a post to that URL with that information, so that you can create things off site that integrate with Github however you want as long as you just write a little service. We also have a bunch of built in services, for doing things like take the commit messages and put them into a Campfire or IRC or you get them emailed, or you get them on IM, or there is all sorts of integration. There is Lighthouse integration, so there is all these integration points, and for those we need to have just a bunch of background jobs that handle all of that asynchronously.
11. Why did you choose Erlang as an implementation language for egitd ?
I chose it because I am familiar with it and it's a really productive language for writing servers. Just the way that Erlang is structured, the way that it looks at the world, the way it perceives the world. It's functional. So it is more like mathematical functions and Erlang makes doing servers rally easy because it makes sure that memory is really separated. It makes it really easy to spot of new small lightweight processes, and so every incoming connection hits egitd gets spawned. off into another lightweight process and at the same time Erlang is able to utilize all of the corse on a server, this is SMP enabled and so it's spinning up processes on all of the cores, shared right across, you don't have to do anything like create real operating system processes, which take more memory so it can be very memory lightweight. And it is just the whole paradigm I'm just very in tune to that paradigm now, and I think it's really, egitd turns out to be a very small amount of code. And so that's why I chose it. It's just so strong for server applications, the way that it wants you to approach them is very in line with that kind of service requirement.
12. Did you write it as an OTP behavior for TCP Servers?
Yeah... well, it actually doesn't use OTP right now. It uses gen_tcp which I don't think is considered officially part of the OTP stuff. gen_tcp allows you to do all the protocol stuff like you bind into a socket and then you accept a connection and then you can read and write to is and then you close the connection. That's gen_tcp and that's all I really use right now but what I want to do now that I have it working and it's in production, I want to go in and clean that up and turn it into an OTP thing, do a little refactoring, make it so that right now if you close it if you kill egitd it will hang on to that socket because it isn't closed cleanly and in order to do that I want to be able to send a message to that OTP process to say "Close your socket". Because right now you can't do that it just runs and spawns off processes you can't really get into that where it stores that socket object and so I don't want to go into that direction but right now it's not really leveraging well OTP stuff. But Fuzed is pretty much all OTP and OTP is awesome. I mean that's what makes Erlang.
13. To close up let's talk about a rather divine topic. Your monitoring system God. What is that about?
God is I call it a process monitoring framework. I like to call it a framework because your configuration files but it's pure Ruby, it's just a Ruby file. And so in this file you can do really advanced things like you can do looping, you can even have variables, you can have abstractions, the way that the config file works it's a lot of block, methods that take blocks and then yields to them. So if you want and have a bunch of functionality that can be the same across a bunch of configuration files you can take that and create a method that contains all of that stuff. And all you do is pass it in a certain block which you then use and applies all of this code that you have written.
And you can reuse that whereas in your traditional systems like monit you don't have looping, you don't have variables, it has a very obtuse syntax that you have to learn specially, and it doesn't have extensibility. Another primary reason that I wrote God is to have an easy way to extend the framework. There is a lot of built in conditions like "Is the process running? Is it using too much memory? Too much CPU? Does it URL respond?" These kinds of things. With God you can write a new condition in Ruby just as easily as a few lines, all you have to do is implement a specific interface.
And when you are done you can just put that into your configuration file if you want. It could be that easy. Or you could refactor it out and include that from elsewhere. But it makes the whole concept of extending the framework so easy, because it's just Ruby and we all know Ruby so why not use that as a configuration file?
I don't know if I would really draw a parallel there. I suppose, I mean Erlang supervisor stuff is pretty specific like you can say "Make sure one of these are up, or make sure n of these are up or if one of these goes down then kill the rest of them" things like this. They are sort of built-in strategy. Whereas God is really more flexible. All it says is "Here is a bunch of conditions. Choose whichever ones you want and then composite them together in order to create your monitoring strategy".