It happened a long time ago actually, it was 1975, there was a group in Colorado that was very active in writing software for the new micro computer revolution at the time. And it was natural to us to share code between each other, and to give the code away. That was the start but it’s been down hill ever since. I’ve been involved in open source pretty much continuously since that time.
Well Hadoop has been a real revolution and it’s anomalous in the sense that there have been a number of technologies in the past which seem to have revolutionary potential but Hadoop seems to have revolutionary actuality. It really is leading to new methods of doing things. So a lot of people are talking about it, not a lot of people are doing it yet, but we are seeing some real changes. And I think what happens there is that we see a difference in shape. And the shape that I see is the curves that describe the cost for particular levels of analysis, and the shape change is that they have changed from quadratic or exponential with data size to linear. And now we are finally getting to change that linear coefficient and that change to linear things changes the ultimate behavior of the society using the technology, and changes the way things are adopted and changes the fundamental economics. And most technologies have not really changed shape in that way. Hadoop is new that way.
Yes, this is a pattern that we have seen a number of times, early in mid 80s or so there was an emulation revolution sort of happening. Before that in the 70s the first virtual machine revolution happened where you had one architecture emulated on an emulator running on the bare metal. And after each one of these mini layers of simulation the benefits of abstraction become clear but then the cost of abstraction become clear again, and people find more efficient ways to do this, and it looks like we are collapsing back to the bare metal each one of these times. I wouldn’t say that Hadoop is not near the bare metal yet, there are a few implementations which have some exciting proximity to the bare metal you might say, ours might be one but I only speak about what I know. And those things I think are really driving that again. But you are right; by sharing a large cluster we are avoiding the pigeon holing in some ways that has become popular in other approaches.
There are a couple of things that are changing. Up until not so long ago Hadoop was an excellent thing for startups and other people who were willing to bet on things continuing to work. It was not durable in the enterprise sense it was fast for new kinds of problems, and therefore the upside was large enough that it was worth taking the chance to run with that. And that led to this change of shape that I described earlier which is then leading to tipping point changes in where the economic maxima are for analytics but we are just on the cusp now of having the things that are necessary for enterprise adoption so that adoption can really go forward. The banks and things like that have very strict requirements about disaster recovery and those are only just beginning to be suitably addressed by the Hadoop community.
Yes, currently it is definitely isolated on the side but there is still a profound impact of that glorious isolation. In the future I think it will be much more integrated and I think that Doug Cutting had a great quote of somebody else whose name I forget last summer in Berlin, and what he said was that people are saying that they have big data because of Hadoop not that they have Hadoop because of big data. There’s always one project that breaks the barriers finally gets something like Hadoop adopted but once that happens within a few months they overflow the bounds that they had projected purely because once those capabilities become possible there’s enormous demand throughout the organizations to use those capabilities and so the existence of Hadoop, just the pure existence in an organization can suddenly cause great volumes of data to materialize around it. Because people have been discarding this data with the thought that some day perhaps we could analyze it. When suddenly there is a hint that you might analyze it, they start saving that data they were trying to become Hadoop fanatics in many cases, because they can now look at things they could never look at before.
Absolutely, but I think there is a deeper cause at work here, I think what is happening is that the applications are growing in general enough that as Michael Steinberger put it "One size no longer fits all". And so we are seeing specialists in terms of applications that are beginning to appear. This in memory databases like Vault DB that are specialized for transaction processing and they make new trade offs and they are able to specialize in this particular task and execute it in ways that were previously not possible. And then that also enables Hadoop to become its own kind of specialist and to specialize on these large scale out of core problems, the in core things have in core search engines, in core databases, the out of core very large trimming applications have need frameworks like Hadoop to succeed so in some sense the change, the appearance of Hadoop it’s an effect as much as it’s a cause it’s causing radical changes in architecture but it’s also an effect of this specialization that has been occurring over these last few years.
Well HDFS is just unacceptable for a number of applications other than running map reduce programs it’s pretty decent for map reduce but it really doesn’t have the semantics and it’s not intended to have the semantics needed for state update kinds of problems. And those are exhibited in statistical counter type applications, user profile updates or messaging. And in those cases we have small data objects, relatively small, less than a mega byte that need update semantics. And so HBase is providing these update semantics, and whether HBase has mini answers in this key value column data store noSQL sort of space, they are all attacking one particular problem and that is that the trade offs imposed by full SQL are unacceptable in a scaling world.
But they are attacking them in different ways and they are tacking them by focusing on different aspects of the problem. HBase in particular has very successfully attacked for instance the write volume. And so where we used to have always the requirement in minimum reads per write were going to happen to the database. Many HBase applications now have ten thousand writes for one read and that has opened up new areas. But we are just at the beginning of this and there’s a lot of miles you have to go, HBase is still difficult to run, Hadoop is still difficult to run, you need experts and it’s not a friendly out of the box environment. Other contenders, notably Mongo have attacked the ease of use problem to the detriment of their performance on scaling. And so we need to start synthesizing some of the virtues of different systems. The virtues out of the box of Mongo, the virtues of write throughput and call them store scan performance of HBase perhaps virtues of other systems as well.
HBase is very exciting now, I know I have tried many times over the year I’ve tried it sixteen, seventeen, nineteen to use it and find each time that it was one thing but now we are beginning to see where we can really drive major enterprise applications but that doesn’t mean it’s done and that doesn’t mean that there won’t be he changes on the part of HBase over the next few years. We are just now seeing co-processors we are going to see interesting text applications, analytics applications, HBase a few years from now will still be called HBase, but it will be quite an extraordinary new thing.
I think that I should take the hat off and answer as an Apache member a little bit there. And I think that it’s commonly true that the raw Apache distributions of all kinds of software are not really quite appropriate for most end users. There is packaging there is shimming there is documentation and support there is just a lot of wrapping that needs to go around the gift, you need to tie a ribbon around it, things like that. The community is building the raw capabilities but that isn’t necessarily everything you need and I think that that is well recognized by the existence of so many companies that are trying to produce something comparable, even DataStax with Brisk which is Hadoop like, they are trying to get some of that wonderful virtue and use it in their own products so that there is recognition that there is packaging needed that there is consumer interface required, these are not consumer consumers, but they are professional consumers, but there is a lot of disagreement about what that packaging needs to be. Do we have a blue ball, or a red ball, or a three pink balls? We put lots of balls on ours because we thought we needed more balls and shiny stuff, HorthonWorks is saying "Well we should be very pure and we should have transparent wrapping".
These sorts of decisions about which is the best answer will be decided by our customers not by us, but those different opinions I think are very veryhealthy to have in a community like that. The heart is still there, the heart is still beating, that’s the Hadoop’s heart, and that’s by far is the largest piece of any of these distributions but as you say none of them are exactly Hadoop if only because you downloaded them from different places, their bits are not necessarily exactly the same. There’s always something needed.
That proves that it’s terribly fashionable to have Hadoop in your product, I am not sure what else - it remains to be seen what the practical impact of this larger offerings are. We certainly think that one large company EMC that is selling ours we wish them all the best, of course, more I would call it innovative, not the word I usually use with them, the innovative offerings like Microsoft’s in terms of very unusual packaging for Hadoop whether those succeeded remains to be seen, but I think that it’s definitely a consensus across the board that Hadoop is exciting, revolutionary and is something that people need to address.
There’s just a huge variety of that and I think there is also emerging new answers, not only Map Reduce but BSP related solutions like Giraffe from Apache, or some of the commercial offerings of BSP that are coming out. It isn’t going to be just one computation framework, there is going to be many. So it’s going to get worse before it gets better. I think that the best way to learn about this MapRediuce things is to try to invent simple problems that you have that you might be able to address with this tools. And probably it’s best to start with tools that hide some of the details unless you are the kind of guy who likes bits between the toes. But start with Pig. or Hive if you can get access to a working Hadoop cluster and experiment that way. And you begin to see the limitations of the framework other than the algorithmic paradigm that you have. And there are some very simple hurdles that everybody has to get through.
They have to learn not to put static variables in their map functions, things like that, it’s surprising how similar the misconceptions are across a large population but I think that the key is to start trying it. And try building some of these programs and see where you run into problems. Certainly aggregation like word count is a good place to start, but we have a number of other examples floating around that are probably more interesting. And then there is specialized communities like Mahout for instance doing data mining, some people may be interested in that sort of problem, or distributed mathematics, or recommendations. And so they can start with a problem that is closer to what they really would like to happen.
Machine learning is a very interesting problem, which is on the border line of where we may or may not want to use any particular technique. Mahout’s mission is scalable data mining. It is not Hadoop data mining, it is not any particular technique, it’s for scalable machine learning. And it’s a very welcoming community so just jumping in on the mailing list is a great way to get started. But if you identify a problem, and you will probably need help recognizing what category of problem it really is first, but if you can identify a problem and a data source, then you are half way to the solution. And that will drive the decisions. And in the Mahout community we always recommend start with the simplest possible implementation. That may be Mahout or it may be another tool.
12. Is this Mahout or Mahoot? I am always confused.
Well that’s ok, if you use it you can say it anyway you like. I say Mahout because I love the pan, there is a pan with Hebrew, truth or essence is the Hebrew word, and so I like that. And also my friends are from South India a lot of them and they tend to pronounce it that way. I know there is a lot of people say "Mahout" it’s fine either way. But I was trying to say I that in any of these data mining problems I think starting with the simplest techniques first, and the simplest implementations first, is a very important thing to do. Many of the problems which appear very, very large, actually are small relative to what large means nowadays. In some sense small has got very large lately. And so you very often use very simple techniques on what seems like very large and significant problems, and just be done. And so that’s an important thing to realize is that you can start easy. And then only go to these scalable techniques which have more inertia and are ponderous in some sense, more difficult, less Agile than the quick and easy interpretive sort of tools that you might use in contrast to Map Reduce. That’s always good to have a choice of tools, to decide which one is the right one right now for your problem. And so I always start small start simple and then we see what is really needed.
Just do it. But of course just do it is a hard thing to say, well it’s a hard thing to hear, it’s easy to say. My recommendation is that the best way to just do it is to get in touch with the communities, and the best way always is in person. If you live near a big city there is almost certainly a big data or Hadoop MeetUp that meets roughly monthly. And then there are specialized things like data mining groups that also meet roughly monthly. If you could get to one of those and talk to people there, then you start to hear what they say how it works, and you can see how you can try these things out. Amazon makes available publicly a number of completely free resources of large data, they have large sets of genomes, they have US Census data, Apache archives for email from years and years of sample data sets that you can try all kinds of experiments on and I hate to be recommending of any commercial service but Amazon makes available Map reduce, Hadoop in small dozes with their EMR product so you can write and run small EMR programs against publically available data really quite easily.
You can then also if you want to run a Hadoop cluster you can run it in the cloud small clusters it just takes a few dollars to run a significant amount of hardware for a reasonable amount of time, ten hours, two hours, whatever it takes to learn that. You can also then run on cast off hardware that’s how I started we had ten machines at work that were so unreliable, that nobody would use them, and so they were fine for me to experiment with. These sorts of resources are often available if you don’t want to spend a little bit of money for the EC2 instances. So that’s the implementation of the very simple specification of just do it: find people, find mailing lists, try out on real data. Those people will give you pointers and I think I would be happy to help people as well.
14. Thank you very much Ted and the last question: MapR distribution comes with your services, right?
It does yes. So a complete distribution and frankly, I’ve never been able to say no, so if you try the free version and ask me a question I’ll be answering.