BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Interviews Scott Chacon on Git and GitHub

Scott Chacon on Git and GitHub

Bookmarks
   

1. Scott, who are you?

My Name is Scott Chacon, I work at GitHub.com, which is a Git hosting provider platform. It’s a social Git host and I’ve been working there for about 2 years now. Basically my whole life revolves around Git and GitHub and speaking about it and training and telling people about it and trying to get them to use it. That’s Scott in a nutshell.

   

2. What’s up at GitHub.com? What’s new?

There are a number of things that we’re working on. We have GitHub:FI, which is installable inside your firewall. For large companies that don’t want to host their code on GitHub we’ve been working on that. We’ve been hiring some people to help us get that to be a better product. Basically one of my big goals for the last couple of months (and probably it will be through the next year) has been making GitHub work with other version control systems. Git is very nice as a tool because it has a very simple storage mechanism and so it’s very generally applicable.

It’s relatively easy to write a server that will have different protocols. Mercurial or Subversion protocol and because the goal of GitHub is not necessarily for people to use Git, it’s for people to code socially. If people want to use Subversion or Mercurial I’m sorry, but I feel for them. Mercurial is actually a very nice tool as well. I feel mostly for the Subversion people. We do some Ruby meetups and conferences and GitHub is taking off in the Ruby community but we want all the communities to be able to work together because that’s the strength. The real strength of GitHub is having people find your projects and be able to contribute to them very easily.

If your preferred tool is something else and you are a smart developer then we want you to be able to participate in everybody else’s project as well. I did a project called Hg-Git, which is a Git Mercurial translation layer that’s a plug in for the client where you can push to a Git repository and that was sort of the first phase to help Mercurial people participate in the GitHub community. Right now I’ve been working on a Subversion protocol, DAV Protocol Bridge.

You can run SVN check out SVN.GitHub.com, whatever the project is and has the same name space as any other project - it’s just SVN.GitHub.com instead. That would do a checkout via SVN DAV with a Subversion client and you can work on stuff and submit a patch and things like that. Eventually I’d like to have a Mercurial translation layer in the server as well, so you can hg clone a GitHub repository.

On the server, everything is still kept in Git, so we can manage that and we can do the translation relatively easily, but we want as many people as possible to be able to collaborate together, no matter what tool they happen to like the best. I think it’s a good thing for keeping GitHub relevant for the indeterminate future. If it is something newer and cooler and hotter than Git that comes around in 10 years or something we want to be able to support that relatively easily and incorporate early adopters eventually into the rest of the community.

   

3. You mentioned GitHub:FI. How similar is that to the server-based GitHub? Do you use the same infrastructure or do you customize?

It’s basically identical. GitHub is a Ruby on Rails application generally. We also have some Erlang parts and we use MySQL and Redis as well and Memcache. Those are the main points of the system. GitHub:FI is very nice because it installs everything for you. For one, we run it on JRuby, so it’s a Java app that’s installable, so we package up the entire GitHub basically the whole GitHub source code that runs the site in JRuby on the JVM.

We package Memcache and MySQL and everything and run them on a custom port so you get one binary, you can install it on whatever server you want to internally and run GitHub and it’s almost identical to GitHub.com. For any company that can, we prefer them to use GitHub.com because it’s easier for us to update, they get updates faster. With the FI they do get updates, but we have to make sure that they work properly with installer and all that stuff.

It takes a little bit longer, but we want people to be also part of the community. If they had something that they want open source, we would rather they’d just be able to say make it a public project or push to the same thing that we use for work rather than having 2 separate instances of it. But, if they’re a government organization and can’t put private code on another server or they are a large company that has policies against that, then we want to be able have them use a tool that makes Git easy to use.

   

4. With GitHub:FI you install things like Erlang and MySQL? If you consider using virtualization for that, just shipping a virtual image?

Yes, we did, but we wanted it to be able to run on existing hardware. We want them to be able to reuse a server if they already have a server that they use for something. Providing a VM has a lot of infrastructure that they would have to do, they would have to run. If they are running Xen and we give them a VMWare image or vice versa, then they have to set up a whole other system for that and it might be a pain.

We’ve been talking about providing maybe a preinstalled server that you can do, but it just tends to be for the organizations that buy GitHub:FI, there is a licensing cost and there is a per seat license, so it’s a little bit more expensive. But for the organizations that want to do that, that just cannot put code on GitHub.com, they tend to be large enough that they have corporate policies , they’ll want more control than a VM would generally allow so we are just going to installer route.

   

5. You mentioned using Erlang. What do you use Erlang for at the moment?

Come to think of it, I don’t think that we installed Erlang as part of the FI because it’s all on one machine. What we use Erlang for is on the backend. We have a federated data store, so all of the Git repositories are on several different file systems and Erlang acts as the binary transfer, the RPC mechanism between the front ends that run Rails and the back ends that actually have the Git repositories on them.

When the Rails process wants some data, some log listing or tree listing or some specific content or something, Grit is the library that we use and we have split off the front end stuff actually calling the methods to the backend stuff that does the fork/exec calls and patch that with an RPC mechanism. That RPC mechanism is Erlang binary term format-based and Tom Preston-Werner wrote an extension of that so we can do more than just binary Erlang primitives. We can do all sorts of fun stuff, arrays and things like that. That’s just the RPC mechanism.

It sends an RPC call back to the back end, it does routing, to figure out what backend that particular project is on, passes the packet over. It actually runs the command and sends the data back over and most of that is done in Erlang and with the BERT stuff. There is a Ruby component to that as well, so that Erlang can call Ruby methods on the backend and then translate that and send it back over. But that whole mechanism of the RPC is all done in Erlang and it’s a great tool for the job. It actually works really well.

   

6. You mentioned using Redis. How do you use that?

We use Redis for exception handling and for our queue. We tried a lot of Ruby-based queuing mechanisms. Chris wrote an abstraction to the queuing mechanism. We used to use BJ and DJ and in the super early days we tried out Amazon SQS and a lot of queuing mechanisms and they all fell over at one point or another with the amount of traffic that we were doing on them and the types of queries that we were trying to get from them. Eventually we moved to a Redis space that Chris also wrote, called Resque.

That’s open source, you can get that on GitHub, a couple of other companies you use it but it’s Redis pack. We use the Redis list and stuff to queue up jobs and to pull the jobs out of that and it’s been really solid. If you are using DJ or something and it’s not working quite well for you, then you might want to check out Rescue.

   

7. You recently moved servers and you are now running on dedicated machines. What was the reason for that? What was the reason for moving from a virtualized service to dedicated machines?

The virtualized servers were at a hosting provider that only provided virtualized servers and there were a couple of different problems that we had. We actually solved 2 unrelated problems simultaneously. One was you get a lot more power out of bare metal than you do out of the virtualized services and it made a huge difference, especially in the file system because on the virtualized disk, the IO is not generally very high compared to raw disk access. That made a huge difference for our back end, because Git is incredibly IO intensive. We run unpatched Git on a server if we want to continue to do that. We couldn't really optimize it for our specific needs of the Git server. Having it run on bare metal on real disks makes a huge difference in the IO throughput that we can get. That was probably the biggest thing. At the other hosting provider, we were actually using GFS like a global file system that was just not working with the amount of IO that we were putting through it and the number of shared systems that we had on it. Federating it out with the Erlang-based BERT system and getting real hardware behind it that would actually run on unvirtualized gave us a huge performance increase. The impetus was that our file systems were falling over and we needed something else and we decided to go to bare metal instead.

   

8. You’ve been doing Git for a few years, are you getting tired of it yet? What’s so nice about Git?

I’m not getting tired of it. The nice thing about Git is that it’s very general. It’s a very general storage mechanism that’s behind it. That’s actually how I learnt Git. At my previous job we were using it as a distributed content distributions system and we were actually using Perforce for source control. I learnt it as a distributed snapshot system where we would create custom snapshots for contents that we needed in a whole bunch of clients and then distribute them via the Git mechanism which is really good at that, at just sending deltas and only sending incremental updates and doing it very efficiently. That’s how I learnt it. That’s how I continue to think about it. There is a lot of stuff you can do with Git it’s not just version control that it’s fun to play with. I never really get tired of it. It’s very similar to POSIX-based file system model. Anything fun that you can think of to do with just normal files and stuff you can pretty much do with Git and you get distribution and you get versioning and you get a bunch of other stuff. There are a lot of fun projects that you can come up with for it. The other nice thing is that at the heart of Git it’s a key value store. You can re-implement the Git backend in any key value store. You can re-implement the Git backend in Redis. I did a fairly comprehensive one in Cassandra. That’s always fun to play around with that. It is general enough that you can do stuff like the Subversion front end, whereas making a Subversion frontend and a Mercurial frontend or Git frontend for a Mercurial back database I would think it would be a lot harder to do. Because it’s file-based delta system and it’s just not sort of generally abstract as Git is. There is a lot of fun stuff you can do with it.

   

9. You mentioned using Git as a storage format. Do you know any other projects that use that?

There are a lot of backup based systems where people use it like Apple’s Time Machine because it’s very similar in the way that it thinks about data to TimeMachine as hard links for files and for directories that are identical in content. I don’t know how to say it but there is a project called BUP that actually does virtual machine backups relatively efficiently. There are a couple of backup based systems that are based on Git and that are interesting. They are actually really cool. I did a system called Ticket that has your ticketing data along. Because you can have multiple branches in your project that are unrelated, they don’t have a common base in them, so you can have branches that have nothing to do with each other. You can just have a ticket branch that has ticketing information that is along side with but not a subdirectory of your project, which is nice because it doesn’t muddy up your actual project. You could do that with lots of stuff. You can have a documentation branch that’s just a documentation and isn’t a doc subdirectory. The commits to add new documentation doesn’t mess with the history of the project. In GitHub, one of the interesting things is if you create a specially named branch, DH-pages in your Git repository and you push it to us and you have static HTML in there we’ll host that as a website for you, as documentation or a website for your project. A lot of people use that on GitHub for their project homepage and they give you whatever static HTML they want and put it in separate branches, nothing to do with your main project history. It’s pretty cool. There is a lot of neat stuff that you can do with Git because of its generalization.

   

10. What’s the tool support for Git nowadays? Git is still mainly the main C-based binary or are there other implementations?

It’s largely the C-based binary. There are 2 different ones: there are GUIs on top of the C-based binary so that’s tools like Git extensions or Tortoise Git for Windows or Git X for example will do fork/exec calls to the binary and interpret the data. The nice thing about Git is that there are a lot of plumbing commands. You can run much more low level commands than just the "add" and "commit" and things like that. The front ends will tend to call those and they can do a lot more low level stuff with them very specific stuff that is machine readable. That’s one category. The other category is there is a fairly complete Java implementation called JGit that Shawn Pearce wrote from Google. The JGit project is incorporated to a lot of the Java-based IDE, so anything that’s running Java. There is a Smart Git, it is a really good cross platform Git client that is in Java so you can run it anywhere and it works the same and it’s very complete. It’s actually a very nice project. Then there is IDE integration. Eclipse.org is starting to move towards Git as their main repository format for all of their projects, so they are concentrating and Shawn as well is concentrating on the EGit project which is the Eclipse plug-in and that’s getting really solid now. Any of the Java-based IDEs are using either the EGit tools or the JGit implementation to write their own plug-ins and stuff.

   

11. What’s Git on Windows like today?

I’ll find somebody that uses Windows, I’ll ask them and then I’ll tell you. There are a lot of people who use Git on Windows. I use Git on Windows if I do a corporate presentation or something that only has Windows machines available. It’s doable and it works fine for all the stuff that is in the presentation or in the tutorial, which is a lot of the stuff. It does work; it’s a little bit slower in some places because Git takes advantage of a lot of weird POSIX stuff to make stuff run fast. You have to have POSIX simulation layer in Windows to make it run, but it’s certainly doable. There are a lot of people that use it especially now with the Eclipse integration and the JGit project having more Java-based stuff and that runs fast as well. It sucked a year or 2 ago and it’s pretty good now and I think it’s going to be great in the next year.

   

12. I think you have a book out about Git. What’s it called?

I do. It’s called Pro Git and it’s published by APress.

   

13. Should everybody read it?

Everybody should read it. The nice thing about it was that the fine people at APress allowed me to use Creative Commons License. The book itself is Creative Commons licensed. You can read it online, progit.org is the website and all of the content is available online and there is a blog that I write updates to it. The book itself is in Markdown. You can go to GitHub.com/progit and download all the examples in one repository and all of the Markdown for the book in another repository.

There are really cool things about it. One is that the GitHub community is great. They’ve really embraced it as being an open book and so there is a tone of translations of it. There are full translations I believe in Chinese, Japanese, German and Dutch. There are people working on Arabic and Spanish and a tone of other languages as well and they are doing a really incredible job. I incorporate that and I published that.

If you got to progit.org you can read it in any of these languages. You go down to the bottom and there are links to all of them. That’s updated regularly as new translations come in. That's one of the cool things, that that’s all Creative Commons License, it’s all online to read for free. You can download it and create a pdf or something to put on your Kindle. It’s been a really nice project and the blogs are for new features, stuff that I didn’t cover in the book.

It was published in August so it’s been 7-8 months. It’s a sort of beginner to intermediate book, it gets into the advanced stuff, but not the really low level what I thought might be confusing stuff. I’ve been doing blog posts on that. If you want to learn stuff that comes out in the newer versions of Git or stuff that’s a little more obtuse or esoteric, then there is a series of blog posts on that as well.

The reason why I wrote it is to supplement the training that I do, because if you want corporate training you can contact GitHub and I can come out and do corporate training. One of the things that I didn’t like about it is that I only had a day or 2 to do that. I can cover a lot of stuff, but it’s kind of hard to make everybody a Git master. What I like to do is cover a lot of stuff so that people know what’s possible to do and then link to a chapter in the book.

It’s free and online so they can read it in more depth. That was sort of why I really wanted to write the book, so that I could have for a one hour talk or 8 hour training, I could say "Here is all the stuff, but if you want to learn it in depth you can go to this website and read all about it." It’s been really nice to be able to do that and not feel like I’m just abandoning everybody.

   

14. You mentioned Markdown. You have another project that uses Markdown for presentations. What is that?

It’s called ShowOff and it’s at the GitHub.com, schacon is my username and then ShowOff is the name of the project. It’s really stupid simple it’s just a Markdown-based slideshow like a Keynote type presentation style. I give a fair amount of talks and I have a lot of different presentations. There is a bunch of different thing in Keynote that are sort of frustrating for a programmer. I want to be able to extend it, I want to be able to do a lot of showing off, typing off in the command line and see what the output is or showing off highlighted source code.

All that stuff in Keynote is either impossible or difficult to do. I spend a lot of time wasted and trying to do that and keeping up-to-date and stuff. What this is, is just Markdown. I write out the slides in Markdown, you can put a whole bunch of slides in one Markdown, you can have different subdirectories for each section and it makes it easy to move the sections from project to project. It’s all text-based; you can do everything in Git, which I like.

You can fork a presentation and change it a little bit and do your own presentation-based off of that or steal sections from some, or borrow , take sections from, copy sections from somebody else’s presentation, if it’s a really good section on something that you want to teach about. It’s easy to upload the whole presentation in Heroku it takes 5 minutes to do that and you have it online on Heroku. All that stuff just saves me a ton of time.

I can write a really good long presentation and I don’t have to deal with any of the stuff I don’t need to. It’s really stupid simple. There is no interface on it, you just edit it in TextMate or VI or whatever and then it’s the Sinatra server that renders the Markdown and it’s all JS-based.

   

15. I guess I can burn my copy of PowerPoint.

I would burn your copy of PowerPoint even before ShowOff and remove Keynote because it’s PowerPoint and then Keynote and then ShowOff is what I like. I mean there are other HTML-based Java script-based ones that are nice as well. There are like Slidy and S5 and things of that nature, but I like it just because it’s all text files, it’s all one project. You just run showoff serve in a ShowOff directory and go to the port on your browser and full-screen it and that’s it.

I actually had a presentation the other day where I had something wrong in one of the slides and I was able to switch over to my text editor, find it, delete it, reload the slide live just to make it look right. That would be difficult to do in Keynote especially if it’s syntax highlighted code you have an image in there or something like that. That’s why I like it a lot.

   

17. It’s under your account, I guess.

Yes. Schacon/ShowOff. I have a lot of Git-based projects and presentation type stuff in there. In the "Read me" I tried to incorporate examples of a lot of different people that use it. I’ve actually been pleasantly surprised at the developer uptake. Because the developers like to tinker, work with text files and have stuff be as simple as possible. It’s very developer-oriented. It does command line stuff well and it does syntax highlighting source code well and that’s stuff that developers want and not necessarily business people or something.

But it also does PowerPoints and all the normal transitions and stuff like that. I have a list of a whole bunch of the projects that people have done. The other really nice thing, the reason I like it is because it loads any JavaScript and CSS files that you have in the project. Since most of it is JavaScript-based you can understand it very easily. Instead of having some plug-in system or something it’s just you put in a .js file, anything that you want to do in JS and you load it and run it. I don’t need to incorporate it into ShowOff. You can just use it in your own project.

There are a lot of little plug-ins and stuff that add functionality that is not necessarily generally applicable. A good example is there is a presentation format which is 20 slides with 20 sections each for a presentation and I forget what’s called [Editor's note: Pecha Kucha]. Somebody wrote a JS file that just does that. It has an extra key binding in a button and it automatically does slide transitions for you.

Doing something like that you can time it like in Keynote or something, but it’s really a pain. If the JS file exists and you know of a presentation that does it, you just copy it and you have it. It’s really nice, I like it a lot.

Sep 06, 2010

BT