Today, Big Data and Hadoop are taking computer industry by storm. Its usage is on the mind of everyone, from CEO, to CIO, to developers. According to Wikipedia:
“Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.[1] It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers. The entire Apache Hadoop “platform” is now commonly considered to consist of the Hadoop kernel, MapReduce and HDFS, as well as a number of related projects – including Apache Hive, Apache Hbase, and others.”
Unfortunately this definition does not really explain either what Hadoop is or what is it role in the enterprise.
In this virtual panel, InfoQ talks to several Hadoop vendors and users about their views at current and future state of Hadoop and the things that are the most important for Hadoop’s further adoption and success.
The participants:
- Omer Trajman, VP Technology Solutions, Cloudera
- Jim Walker - Director of Product Marketing, Hortonworks
- Ted Dunning, Chief Application Architect, MapR
- Michael Segel, founder of Chicago Hadoop User Group
The questions:
- Can you define what Hadoop is? As architects we are trained to think in terms of servers, databases, etc. Where in your mind does Hadoop belong?
- Despite the fact that people are talking about Apache Hadoop, they rarely download it directly from Apache web site. The majority of today’s installations are using one of the distributions, whether it is Cloudera, Hortonworks, MapR, Amazon, etc. Why do you think this is happening and what are the differences between them (For vendors, please, try to be objective, we know that yours is the best)
- Where do you see the prevalent Hadoop usage today? Where do you think it will be tomorrow?
- With an exception of Flume, Scribe and Scoop there is very little in terms of integration of Hadoop with the rest of the enterprise computing. Do you see Hadoop starting to play a bigger role in the enterprise IT infrastructure?
- Today Hadoop implemented most of the Google projects with the notable exception of Percolator. Do you think such project is (should be) on Apache’s radar? Do you see any other direction for real time Hadoop?
- Many people are trying to expand their skills to use Hadoop. There are also a lot of people looking for people who know Hadoop. Still it is not clear how one acquires Hadoop skills? Read The book? Take training? Certification?
Q1: Can you define what Hadoop is? As architects we are trained to think in terms of servers, databases, etc. Where in your mind does Hadoop belong?
Omer Trajman: Hadoop is a new data management system that brings together the traditional world of unstructured or non-relational data storage with the processing power of compute grids. While it borrows heavily from the design patterns of MPP database, Hadoop differentiates in a few critical areas. First, it is designed for low cost of byte economics. Hadoop runs on just about any hardware and is extremely forgiving to heterogeneous configurations or sporadic failures. Second, Hadoop is incredibly scalable. In its first version Hadoop was able to scale to several thousand nodes and in the current version tests are ongoing to reach over ten thousand nodes. Using modern eight core processors in two socket servers, that’s a compute capacity of 80,000 cores. Third, Hadoop is extremely flexible with regards to the type of data that can be stored and processed. Hadoop can accept any kind of data in any format and has a rich set of APIs for reading and writing any format of data.
Jim Walker: Hadoop is a highly scalable open source data management software that allows you to easily capture , process and exchange any data. Hadoop touches nearly every layer of the traditional enterprise data stack and therefor will occupy a central position within a datacenter. It will exchange data and provide data services to both systems and users.
Technically, Hadoop is all these things, but is also transformational to business because it brings the power of supercomputing to the masses. It is open source software that has been created by the community to bring massively parallel processing and scale out storage all on commodity hardware. It does not replace systems, it forces a specialization of existing tools and has its place in the modern data architecture alongside your existing tools.Ted Dunning: Defining Hadoop precisely is probably not possible, at least not if you want everyone to agree with you. That said, you can come pretty close if you allow two definitions:
- the eponymous Apache project that has released a map-reduce implementation and a distributed file system.
- the set of projects, Apache and otherwise that use or are related somehow to Apache Hadoop.
The first definition is also often used metonymically to refer to the software released by the Apache Hadoop project, but exactly how close any piece of software must be to a released artifact before it can or should be referred to as Hadoop or Hadoop derived is the subject of considerable dispute.
For myself, I prefer a less commonly used definition of Hadoop as primarily the community of people who use or develop Hadoop-related software. I prefer to use the term Hadoop to refer primarily to the community and secondarily to the code or project. To me, the community is much more important than any particular code.
Michael Segel: I view Hadoop as a framework and a collection of tools for doing distributed/parallel processing. You have distributed storage in HDFS, a distributed compute model in the Job Tracker and Task Trackers, and a distributed persistent object store in HBase. In terms of positioning Hadoop, I think it depends on the specific solution.
It’s difficult to classify Hadoop into a single category. Some view Hadoop as a way to do intermediate processing that can’t easily be performed in a traditional RDBMS, and use Hadoop as an intermediate step where the end analytics are being performed using their existing BI Tools. Others are using Hadoop to provide data in subjective ‘real time’ processing. Add to this the integration with Lucene/SOLR and you have Hadoop/HBase as part of a real time search engine. The point is that Hadoop is being used to solve different types of problems for different organizations; even within the same Enterprise. Its probably one of Hadoop’s biggest strengths in that it is a basic framework that can be adapted to solve different types of problems. The more people work with Hadoop and push it to its limits; there will be a wider variety of solutions.
Q2: Despite the fact that people are talking about Apache Hadoop, they rarely download it directly from Apache web site. The majority of today’s installations are using one of the distributions, whether it is Cloudera, Hortonworks, MapR, Amazon, etc. Why do you think this is happening and what are the differences between them (For vendors, please, try to be objective, we know that yours is the best).
Omer Trajman: The need for a distribution comes from two primary sources. At Cloudera we first encountered this requirement four years ago when we started engaging with customers. Every customer was running a different base version with a slightly different patch set and different client library versions. It became impossible to provide architectural guidance, support and resolve code issues once we hit a dozen customers. The second reason for a distribution is that partners need to be able to test against a code base that customers are running. If a partner certifies on CDH4 and a customer is running CDH4, they know that their software will work with no hiccups. Cloudera first created CDH in 2009 in order to create a standardize baseline for anyone (not just our customers). If a customer is running CDH 4.2, everyone knows exactly what is in that code base and can test against it. Since it’s fully open source and Apache licensed, anyone can still make changes to their installation if required and they can get support from anyone that provides it.
Jim Walker: We think the trend towards downloads of Hadoop-based platform is a more recent phenomenon. As more organizations realize that they can benefit from Hadoop they are looking for an easy to use and easy to consume experience when they first start using the technology. After they get going they are interested in tools that make Hadoop easy to manage and monitor as they run it in production. There are many similarities between the distributions that make them easier to use and operate than open source Apache Hadoop. One of the principal differences between the distributions is that most but not all include proprietary software components. This does lock users into a particular distribution and doesn’t allow them to take full advantage of the open source community process.Hadoop and its related projects each have their own release cycles and version structure A distribution provides significant value to the consumer, as is it packages a known set of releases across all related Hadoop projects into a package so that an implementation isn’t required to tie it all together, test it and maintain a complex network of solutions.
Ted Dunning: The fact that people download packaged distributions is simply a matter of saving time and the intellectual capital required to download all the necessary pieces and ensure that they are all compatibly installed. It is common for people to focus in on one particular capability or Hadoop community project and download that directly from Apache in source form, while using a standardized distribution for the rest. This allows them to focus their efforts on what they find most important.
The fact is, building and testing a full Hadoop distribution is a major undertaking. Using a standard distribution really makes that process vastly simpler. In addition, all of the major distributions have proprietary additions to Hadoop intended to make things work better. Hortonworks has the VMWare-based software that they use to try to make the namenode more robuts, Cloudera has proprietary management interfaces and MapR has proprietary file system, table storage and management capabilities. Users obviously find value in these additions.
Michael Segel: First, I think it’s important to point out that there are a lot of people who do download Hadoop directly from the Apache web site. There are many who either want to investigate the latest release available, as well as those who are new to Apache. On the mailing lists people still asking compatibility questions like “I have X version of HBase, which version of Hadoop do I need…” Were they using a vendor’s distribution, they wouldn’t have this question.
Enterprises tend to choose vendors because they need vendor support. They want to be able to drop in a Hadoop cluster and focus on solving the problem at hand. Vendors provide free and supported versions as well as tools to help simplify the downloading of their release. (They also have their websites chock full of useful tips.)
When talking about vendors, I have to preface that all vendors are in fact derivatives of Apache Hadoop. While both Cloudera and Hortonworks have pledged to be 100% Apache, I still consider them to be derivatives. Vendors have to determine which patch to apply and/or back port to their release. This creates minor differences even though all of the code is 100% Apache. In addition, the vendors also include additional tools and proprietary software as part of their release. This isn’t to say that being a derivative is bad, but a good thing. The derivations also help to differentiate between the vendors.
I happen to be vendor neutral. All of the vendors that you mention support the Apache Hadoop API so that when you write a map/reduce it should run on each release with at most a recompile.
As I already stated, both Hortonworks and Cloudera have pledged to be 100% Apache in terms of their core Hadoop offering. Cloudera has been around the longest and has the majority of the market share. Hortonworks is relatively new with their 1.0 release and have aggressive in their marketing. I think over time it will be difficult in choosing between these two. It will come down to support and ancillary tools. Hadoop is constantly evolving. Looking at things like YARN, HCatalog, Impala, and Drill, just to name a few recent innovations. Vendors can’t be complacent and have to determine which of these advances they want to support.
MapR is a bit different. From their inception, MapR had decided to replace the core HDFS with their own MapRFS. While MapRFS implements all of the HDFS APIs it’s a POSIX compliant file system and can be mounted via NFS. In addition, they solved the SPOF Name Node issue found in Apache’s core release. They also offered HA (High Availability) from their initial release over a year ago. While MapR supports the Apache Hadoop API, their software is proprietary closed source. MapR comes in three flavors M3, a free version, and M5, the supported version with all of the HA features enabled, and M7 recently announced with their own rewrite of HBase. MapR has taken a different approach than the others and MapR will definitely have their share of followers.
Amazon has their own release along with supporting MapR M3 and M5 in their EMR offering. Because they are not selling a version I wouldn’t categorize them as a vendor. And I think you would also want to include Google since they recently announced their competing offering. Again here we see a partnership with MapR on their recent entry in to the Hadoop market.
In looking at all of the options and a variety of vendors, I would definitely have to say that it’s the consumers who will eventually win. In the early 90’s we saw the RDBMS evolve due to the strong competition between Informix, Oracle and Sybase. Today? I think the marketplace is still relatively immature and we are in for a wild ride.
Q3: Where do you see the prevalent Hadoop usage today? Where do you think it will be tomorrow?
Omer Trajman: Today Hadoop is used to tackle several different challenges across a variety of industries. The most common application is to speed up ETL. For example, in financial services, instead of pulling data from many sources systems for every transformation, source systems push data to HDFS where ETL engines process the data and store the results. These ETL flows can be written in Pig or Hive or using commercial solutions such as Informatica, Pentaho, Pervasive and others. The results can be further analyzed in Hadoop or published to traditional reporting and analytics tools. Using Hadoop to store and process structured data has be shown to reduce costs by a factor of ten and speed up processing by four times. Beyond traditional ETL, Hadoop is also used to gather telemetry data from both internal system such as application and web logs as well as remote systems on the network and globally. This detailed sensor data gives companies such as telecommunications and mobile carriers the ability to model and predict quality problems in their networks and devices and to take action proactively. Hadoop has also become a centralized data hub for everything from experimental analytics on cross organizational data sets to a platform for predictive modeling, recommendation engines and fraud, risk and intrusion modeling. These applications are deployed widely in production today and offer just a glimpse of what is possible with all of an organizations data is collected together and made available to help drive the business.
Jim Walker: The majority of organizations are just starting their Hadoop journey. They are using it to refine massive amounts of data to provide value within a business analytics practice. Some are using it to capture and use what was once thought of as exhaust data or to simply capture more than they could before form existing systems. More advanced organizations are starting down the path of data science with exploration of big data and traditional sources. In 2013, Hadoop becomes mainstream and is considered part of a traditional enterprise data architecture and a first class citizen with ETL, RDBMS, EDW and all the existing tools that fuel data in an organization today.
Ted Dunning: I don’t see a dominant use today, but I do see two huge areas of growth that I think are going to dwarf most other uses of big data. These are:
- Systems that measure the world. These include a wide variety of products which are beginning to include massive amounts of measurement capability and have the potential to produce vats of data. For instance, turbine manufacturers are instrumenting jet engines so that each aircraft becomes a huge source of data and disk drive manufacturers are building phone-home systems into individual disk drives. Similarly, retailers are literally watching how their customers react to products in the store. All of these applications have the potential to produce more data than most big data systems that already exist.
- Genomic systems. Sequencing a single human genome produces about a quarter terabyte of data. Cancerous growths include communities of cells that often have hundreds of thousands of mutations spread across hundreds (at least) of variant lines of development. These different cell lines express their genes differently and that can be measured as well. This means that medical records for a single person are likely to grow to several terabytes in size in the reasonably near future. Multiplying by the number of people who get medical care each year indicates that current electronic medical systems are probably undersized by four to six orders of magnitude.
There are also the systems that we don’t yet know about that could produce even more data than these systems.
Michael Segel: That’s a tough question to answer. I think that we tend to see more companies starting with a Hive implementation because it’s the ‘lowest hanging fruit’. Many companies have staff that knows SQL and for them, Hive has the shortest learning curve. So companies have a shorter time to value, solving problems that can easily be implemented in Hive. As more companies gain internal Hadoop skills, I think they will take advantage of other Hadoop components.
Because Hadoop is a fairly generic framework, it can be used in a myriad of solutions across many different industries. As an example, capturing and processing sensor data (Offshore Petroleum Exploration) to determining which brands of laundry detergent to place on your store’s shelves. You also have the difference of how it’s being used. Is it used as an intermediate step in solving a problem or is it being used to provide real time data? Because you can extend Hadoop with additional tools, I think that we can expect to see a lot more adoption by companies that have requirements that can’t easily be achieved with their legacy tool set. On top of the Hadoop HDFS framework, you have HBase (Included in Hadoop distributions) but also Accumulo which is another ‘Big Table’ derivative build by the US DoD Community. And on top of HBase, you have OpenTSDB and Wibidata both of which extend the capabilities of HBase. As these tools mature and are enhanced, I think we will see even greater adoption.
I think that as more companies adopt Hadoop and start to understand its potential, we will continue to see Hadoop used in different ways and solve more complex problems.
Q4: With an exception of Flume, Scribe and Sqoop there is very little in terms of integration of Hadoop with the rest of the enterprise computing. Do you see Hadoop starting to play a bigger role in the enterprise IT infrastructure?
Omer Trajman: Hadoop is rapidly becoming the model data hub in IT. Because of the richness of connectivity via Flume for any event data, Sqoop for all structured data, HttpFS for SOA integration and ODBC and JDBC for reporting tools, any existing data management workflow that produces data or requires data can interface seamlessly and securely with Hadoop. This expansion has lead to further development of Hive as the standardized meta store for Hadoop. Originally designed to map SQL schemas on to unstructured data, Hive has gained security capabilities for defining authentication and access controls to data as well as integration with Flume, Sqoop and other interfaces to the system. All Hadoop deployments today use these data integration capabilities to exchange data with Hadoop. In the future they will be able to exchange metadata securely with Hive.
Jim Walker: There is another important component in a Hadoop stack that you have forgotten. That component is called Apache HCatalog and it provides a centralized metadata service for Hadoop so that it can more easily communicate and exchange data with traditional systems. HCatalog breaks down the impedance mis-match between structured data systems and Hadoop so that they can be deeply integrated. Great examples of this come from Teradata and their Big Analytics appliance and Microsoft and their HDInsights product. Ultimately, Hadoop plays its role in the enterprise data architecture alongside the existing tools.
Ted Dunning: My own company, MapR, has made a business out of selling a customized Hadoop distribution that integrates in enterprise infrastructure systems more easily than do other distributions. Customers find that they really have a huge use for the ability to build computational systems that cross scales and that work together with existing applications. In most big data applications, truly massive computations are only part of the story. Many components of a big data system are actually not all that big. As such, it pays off to be able to apply small data techniques to the small parts of big data systems and save time to be applied to the development of the big data parts of the system.
Michael Segel: I think that this is a bit misleading. Flume Scribe and Scoop are all open sourced (Apache) projects designed to integrate Hadoop with the rest of a company’s infrastructure. Both IBM (Data Stage) and Informatica have added Hadoop integration in to their products. In addition Quest Software created some solutions to help with the integration of Hadoop as well. So there is an attempt by the industry to adapt and adopt Hadoop in to their enterprise offerings. In fact if we were to talk to all of the major hardware and software companies, we would find that they all have some form of Hadoop solution as part of their offerings.
Overall, we are still relatively early on Hadoop’s growth curve. As more companies adopt Hadoop as part of their infrastructure, we will see more tools coming from the traditional vendors. Many of the existing BI tool vendors have taken advantage of the Hive Thrift Server to connect their applications. In terms of gaps, data visualization is one area that I think we can expect to see more tools coming to market. Many large enterprises have invested in BI reporting and dashboard tools. It may be more attractive to extend their existing infrastructure to support Hadoop, rather than purchase a completely separate tool.
Q5: Today Hadoop implemented most of the Google projects with the notable exception of Percolator. Do you think such project is (should be) on Apache’s radar? Do you see any other direction for real time Hadoop?
Omer Trajman: Google has many projects that they created which are specific to Google’s needs. Well beyond Percolator, there are dozens of projects, some published and others still closely guarded secrets. Not all of these projects are useful or applicable to other organizations. Historically, the community and sponsoring organizations have looked to Google and other large data companies for inspiration on how to solve very large data management problems. BigTable inspired HBase, Chubby inspired Zookeeper and F1 inspired Impala to name a few. Today Hadoop already provides solutions for real time data ingestion (Flume), real time data storage (HBase) and real time data querying (Impala). Future developments are likely and will be driven directly by the Hadoop user and development communities.
Jim Walker: We do think that there are use cases for Hadoop that require faster access to the data that Hadoop collects and analyzes. There are of course many other use cases that do not require fast interactive access to the data. We believe that the overall Hadoop market is best served by broad full open projects that take advantage of a community driven process. This process has proven to create stable and reliable software as expected by the enterprise. Apache Software Foundation guarantees this.
Ted Dunning: I actually think that the Hadoop community has only implemented a small fraction of the Google projects. At the application level, F1, Spanner, Dremel and Sawzall equivalents are missing and the open source community has no good equivalent for Borg, the Open Flow based switches, the OS virtualization layers or the enormous variety of infrastructural services that Google uses internally. Where comparable projects do exist, they are generally pale imitations. For instance, I wouldn’t stand Mahout up against the machine learning systems that Google has, Lucene has a long way to go to match Google’s search systems and even Hadoop doesn’t play in the same league as Google’s internal map-reduce systems.
The open source community finally has an answer to the source code management systems of Google in the form of Github, but we in the open source community have vats of work ahead of us if we want to provide some of these capabilities to the world at large.
Michael Segel: I think if we look at some of the current discussions on extending and enhancing Hadoop, I think that you can see references to Percolator. On Quora we can see some active discussions on the topic.
So while we have coprocessors in HBase, there is also a relatively new project Drill that Ted Dunning is heavily involved in. (I’ll defer to Ted to talk about Drill). Also recently Cloudera announced Impala. (I’ll defer to Cloudera to talk about Impala.)
There are some companies like HStreaming who are also involved in processing inbound data in real time. HStreaming was a local company and has presented at one of our Chicago Hadoop User Group (CHUG) meetings. (They have since relocated to California.)
In terms of real time processing, I think that the use of Coprocessors, and tighter integration with Lucene/Solr we will see more use cases of HBase/Hadoop.
Q6: Many people are trying to expand their skills to use Hadoop. There are also a lot of people looking for people who know Hadoop. Still it is not clear how one acquires Hadoop skills? Read The book? Take training? Certification?
Omer Trajman: The best way to acquire Hadoop skills is to work on a Hadoop project. The best way to get started on a Hadoop project is to take a training course, take the certification exam and keep a good book or two handy during the project. It’s also useful to talk to a company that specializes in Hadoop and the greater ecosystem. While most projects start with core Hadoop, they quickly expand to tackle data ingestion, data serving, analytics and workflows. Our recommendation is to partner with an organization that can provide a complete Hadoop platform from training to services, support and management software across the entire stack and the whole Hadoop lifecycle. While Hadoop is a challenging technology it can also solve some of the biggest problems organizations have today and in our experience is well worth the investment.
Jim Walker: Training is one of the best ways to quickly learn the basics for Hadoop. Once these are in hand it is then possible to experiment with Hadoop and start to learn more of the concepts.
Ted Dunning: Some of our largest customers have taken to looking for people in surprising places. Instead of competing for the same small crop of smart people at the customary top few universities, they have taken to finding the absolutely smartest students they can find at second tier universities, especially people in (roughly) technical fields outside of computer science. They then train these people on big data techniques and, based on the people I have met, are doing very well.
I have also met quite a number of people who come to HUG’s or other interest group meetings who are engaged in self-training efforts. These people are using all possible avenues to gain skills including on-line courses (Andrew Ng’s course on machine learning is very popular), attending meetings, inventing personal projects and trying to find big data analytics needs at work.
My feeling is that the moral of these two trends is that there are more smart people in the places companies aren’t mostly reaching and that a much wider variety of backgrounds qualify people to work on big data than might be thought at first glance. This world is still quite new and the skills of generalists are turning out to be extremely valuable in pioneering settings. Eventually specialists will probably dominate, but that is going to take years. In the meantime, all kinds of people can excel.
Michael Segel: I think you missed the most important thing… actual hands on experience. While reading the book, the online guides and examples are important. Training helps to reduce the runway, however, nothing replaces having hands on experience with Hadoop.
There are a couple of cost effective ways of getting that experience. First most of the vendors have a free version of their product that one can download. Its possible to run Hadoop on a single machine in pseudo distributed mode. While most of these require Linux, there is a release that one can run on Microsoft. And let’s not forget Amazon. It’s possible to spin up and spin down individual machines and clusters to run Map/Reduce jobs and test out your code.
Also companies like Infochimps have placed data sets on S3 for public use and downloads. This makes it very easy for someone who has no hardware or infrastructure to work with Hadoop.
In addition, there are many Hadoop related user groups around the globe. So it’s easy to find a user group near you, or if none exist, you can always create one. Finding people who are also interested in learning about Hadoop makes the experience fun and takes some of the pain out of the process.
Last but not least, there are mailing lists and discussion boards where you can ask questions and get answers.
About the Panelists
Omer Trajman is Vice President of Technology Solutions at Cloudera. Focused on Cloudera's technology strategy and communication, Omer works with customers and partners to identify where Big Data technology solutions can address business needs. Prior to this role, Omer served as Vice President of Customer Solutions, which included Cloudera University, Cloudera's Architectural Services and Cloudera's Partner Engineering team. As the authority in Apache Hadoop related training and certification, pioneer of the Zero to Hadoop deployment process and leader in Hadoop ecosystem integrations, Customer Solutions ensured the successful launch of some of the largest and most complex Hadoop deployments across industries.
Prior to Joining Cloudera, Omer was responsible for the Cloud Computing, Hadoop and Virtualization initiatives at Vertica. He also built and managed Vertica's Field Engineering team, ensuring success from pre-sales to production for Vertica's largest customers. Omer received his Bachelor's degree in Computer Engineering from Tufts University and was a visiting scholar at Oxford University reading in Computation and Electrical Engineering with a focus on large scale distributed systems.
Jim Walker is a recovering developer, professional marketer and amateur photographer with nearly twenty years’ experience building products and developing emerging technologies. During his career, he has brought multiple products to market in a variety of fields, including data loss prevention, master data management and now big data. At Hortonworks, Jim is focused on accelerating the development and adoption of Apache Hadoop.
Ted Dunning has held Chief Scientist positions at Veoh Networks, ID Analytics and at MusicMatch, (now Yahoo Music). Ted is responsible for building the most advanced identity theft detection system on the planet, as well as one of the largest peer-assisted video distribution systems and ground-breaking music and video recommendations systems. Ted has 15 issued and 15 pending patents and contributes to several Apache open source projects including Hadoop, Zookeeper and Hbase™. He is also a committer for Apache Mahout. Ted earned a BS degree in electrical engineering from the University of Colorado; a MS degree in computer science from New Mexico State University; and a Ph.D. in computing science from Sheffield University in the United Kingdom. Ted also bought the drinks at one of the very first Hadoop User Group meetings.
Michael Segel is a Principal Consultant with Think Big Analytics. As a Principal, Michael is involved in working with clients, assisting with their strategy and implementation of Hadoop. Michael is also involved as an instructor with Think Big's Academy, teaching courses on Hadoop Development in Java, Hive and Pig, along with HBase.
Prior to joining Think Big, Michael ran his own consulting firm, developing solutions for customers around the Chicago area. Since 2009 Michael has been working primarily in the Big Data Space. Michael also founded the Chicago Hadoop User Group (CHUG).