Recently Adobe released Puppet recipes that they are using to automate Hadoop/HBase deployments to the community. InfoQ spoke with Luke Kanies, founder of PuppetLabs, to learn more about what this means.
Puppet is an open source tool for data center automation and was already featured at InfoQ back in February 2010. Puppet is used in small and medium range companies as well as in the infrastructure of big players like Google, Digg and Sun/Oracle.
Hadoop is an open source project of the Apache foundation. It is written in Java and provides a scalable and distributable framework to deal with large amounts of data. Inspired by Google's MapReduce paradigm, it is being used by companies who have to deal with Petabytes of data, like Facebook and Twitter.
The Puppet recipes Adobe released provide a way to automate the Hadoop/HBase deployment. InfoQ spoke with Luke Kanies, founder and lead of Puppetlabs, the company behind Puppet. Luke highlighted three main facts related to Adobe's release of their Puppet recipes:
- Large companies like Adobe are using Puppet to manage critical infrastructure
- They're managing both their traditional and cutting edge infrastructure with Puppet
- They see real value in sharing and collaborating on the solutions they create with Puppet
Big companies stating they use open source are important both for the ones working on open source projects as well as the ones who think about integrating open source components in their infrastructure. When asked about his view of open source in enterprise companies, Luke said:
In my experience, the vast majority of enterprises are very comfortable with open source. Of course, my experience is naturally biased toward those companies who are comfortable paying for services and support around open source, and especially those who are replacing non-functional proprietary software with our open source software.
I know 5-10 years ago a lot of the companies using Puppet would have been very hesitant to rely so heavily on open source, but so many people do now that the market seems to have really changed.
Business backed open source companies are good both for developers as well as customers. Asked about how Puppet evolved, Lukes answers:
One of the rare things about Puppet is that the project and company were launched at the same time, because of my experience consulting. I knew that if the project wasn't good enough to pay my bills, that it wouldn't suffice and that if my ability to survive didn't depend on the project quality, then i could get away with it not being good enough. Given that Puppet is a sysadmin-focused project, it doesn't have as much contribution as many of the developer-focused projects, so sponsorship by Puppet Labs is that much more important.
People using Hadoop usually have to deal with large amounts of data but Hadoop also found its way to universities for education. Asked about whether Puppet recipes for Hadoop could be interesting for smaller and middle-sized companies as well, Luke states:
I think the Puppet module could make Hadoop approachable for these smaller organizations. Without strong automation, the cost of deploying and managing Hadoop can be high enough that it's hard to justify, but with simple deployment and management, the overall project cost is low enough that the rewards don't have to be as big.
Automation is well established in big companies but especially small companies fear the effort it takes to start with configuration management, automation, and deployment. Luke provides tips how to get started with Puppet:
In starting with Puppet, my best recommendation is to start by automating the things that cause pain - the things that cause you to get paged at night, that result in many trouble tickets, that consume all of your time with little reward. They're usually actually not that complicated, and they free up your time to come up with a larger, long-term plan.
The vast majority of Puppet users started with a very small roll-out - managing very little on a small subset of machines. For instance, we generally sign support contracts with individual departments or divisions within a company, and only as the Puppet deployment spreads from group to group or server pool to server pool does the support contract get expanded.
As to deploying Hadoop with Puppet, obviously the first thing you need is a problem that you need Hadoop to solve. But i think it's completely reasonable to build a Puppet deployment that automates Hadoop and nothing else - if you know you want Hadoop but you're not using Puppet yet, there's nothing that says that Puppet needs to manage anything but that Hadoop deployment. Ive seen multiple cases of companies building a purpose-specific Puppet infrastructure. Eventually that success usually causes growth of Puppet usage, but only once the initial problem is solved.
InfoQ readers interested in Puppet should visit the Puppet Module Repository to get an idea how Puppet users solve their problems. The Google Group at http://groups.google.com/group/puppet-users is a good starting point to ask questions and have discussions with other Puppet user. Several videos and slide decks from PuppetCamp Europe were also recently released, for those who wish to explore Puppet in more depth.