1. We are here at the GotoCon Aarhus 2013 and I am sitting here with Dean Wampler. Dean, who are you?
Hi. I am Dean Wampler obviously. I am an independent consultant mostly working in the big data space these days - Hadoop and trying to get the Scala in places that need to use it and that sort of thing. It is called Concurrent Thought.
2. Is Hadoop is still the “be all - end all” of big data, or is there something new?
I think a lot of people believe that it is the “be all, end all”, certainly some vendors would like you to believe all that, I think. But it is good for what it does, it is proven at very large scales, it is great for batch mode processing meaning: I have a big data set and I need to scan it all in from hard drives and do some massive analytics or whatever. If you need to do event processing though, or you want more CRUD, create/read/update/delete operations, that sort of thing, it is not fully designed for that and there are alternatives and there are some inefficiencies in Hadoop which I hope will go away eventually so that it will be easier to use and more performant even for the things it is good at already.
3. When you say inefficiencies: are these systematic or just implementation specific?
There are a couple of things. The MapReduce compute model is a little coarse-grained and it is a little inflexible to make it easy to translate a wide variety of computations that are typical into this framework. That is the reason why some higher level toolkits, like my personal favorite Scalding, have emerged to let you write programs that use the typical so called combinators. like mapping, filtering, group-by.etc, and then they do the heavy lifting to translating them to MapReduce jobs. So I would love to see an API or at least a framework that is a little bit better at representing these higher level concepts as composable compute modules. The Berkeley project Spark is a little bit closer towards that ideal. And then there are some general inefficiencies that hopefully will go away eventually. For example, when you do that translation to MapReduce sometimes you end up with a sequence of jobs that have to run, jobs being the term they use, the map step, the reduce step, then you write to disk, read all back in for the map step, reduce step and it would be great to minimize that disk I/O between jobs.
Things like that are some of the places where efficiencies could be built into the system to be better. Actually, one of the ways it is being addressed is in fact, in some cases, custom-built replacements for particular problems are being implemented. A good example of this is a query tool called Impala that Cloudera has been working on that basically replaces MapReduce for doing Hive style queries, where Hive is the SQL tool which is popular on Hadoop. There it is a special purpose compute engine as opposed to a general purpose compute engine, so I think you will see a lot of that sort of thing as well, both general purpose replacements emerging as well as the special purpose replacements for MapReduce.
Werner: You have mentioned Scalding. Remind us what Scalding is.
One of the first high level APIs that abstracted over some of the low level details for Hadoop is an API called Cascading and it is a Java API. Then Twitter wrote a Scala API on top of Cascading called Scalding and there is also, I should mention, a really nice Clojure API that does the same thing called Cascalog. The Clojure one actually embeds Datalog style logic queries which is a very interesting approach whereas Scalding is more of a – I think it fits the model better of general ETL extract/transform/load where I might want to read it in, cleanse it, split it, filter it, etc and then maybe do machine learning algorithms over it or sometimes straight up queries like with Hive but if I like just using one tool then Scalding is that more flexible tool that I would use. But I would say, for me, a typical work flow, especially with clients is, if it is a query question I am asking of the data I’ll write a Hive query because that is really nice for queries just like SQL is really nice in databases. But if it is a general purpose problem that does not fit a query question model then I would typically use Scalding for that.
It is especially tough for beginners to figure out what all these acronyms, names etc mean. The real core of MapReduce or rather I should say Hadoop is this distributed file system called HDFS. That is where everything ultimately ends up getting stored, although, in fact some, of the other tooling will interoperate with data in S3, Amazon S3 or databases as well but usually people start with the Hadoop distributed file system virtualized over a cluster. Then on top that it is just general purpose compute model MapReduce that, like I said, is really good at some things, not so good at others, but that is sort of the core and then around it there is now a constellation of other tools. Hive is a SQL, you can really call it a domain main specific language. It is generating MapReduce jobs, it is reading HDFS resident data, but it gives you this SQL query semantics so that you can conveniently write MapReduce jobs in SQL. A tool that sort of fits the section of the ecosystem like Scalding is Pig, which is again on the animal motif, which is more of a dataflow language for writing transformations of data. You can write Hive-like queries in Pig, but it is more traditional for people who already know SQL anyway to go to Hive and then developers might use Pig or Scalding or Cascading to do these other kinds of calculations.
Everyone admits that the word NoSQL is kind of a bad term and partly because it has negative connotations. Mostly, when people gave up relational databases, initially it was really because they could not be scaled to the needs, to the sizes that were required by people like Google and Amazon and Yahoo. So, they sacrificed transactions, they sacrificed this relational model in order to get the scale and then we realized that sometimes you really do not need a relational model, it is not the right fit for the data, if you are just storing key value pairs, if you are storing documents which are in JSON format then it is great to have an alternative. But at the end of the day, what I have observed in most enterprises is that you often have maybe a small team of developers and administrators, but on top of them there is this huge group of data analysts who primarily know SQL, who are really comfortable with SQL and very performant with SQL because when it is a SQL problem there is really no better tool for it.
What we see now is that SQL is roaring back. If it weren’t for Hive, I am convinced that Hadoop would not be nearly as popular as it is because it enables people who are not Java developers to use this tool that would otherwise be unable to use it. As a result there has been more interest in alternative SQLs on Hadoop, a few that are coming online, one, for example, also on top of Cascading is an ANSI SQL dialect and that tool is called Lingual. It is exciting to see so called NoSQL databases like Cassandra building in SQL query APIs that are quite powerful, quite capable, customized for the particular style of data storage and performance and CAP trade-offs that people need for a tool like Cassandra, but nevertheless, for people in NoSQL it is really a fast way to ask questions of your data.
I guess the final point on this topic is that there are people who are now leveraging the lessons of scalability that we learned from NoSQL and bringing that back into SQL data bases so there are some new SQL databases now that are designed to scale as well as the NoSQL database. So we will see how far they get, but certainly NoSQL is not going away. It is not like it was a mistake. What we realized is that not everything fits a relational model and hence we should not use it for everything. But nevertheless, I think people forgot how useful SQL is and how important it is to traditional enterprises and now it is being kind of retrofitted or addressed by these new developments.
Werner: So it is a long learning process and we might have some interesting middle point at some point in the future.
I think so. Yes. I think we will figure out the balance of what really should be SQL, what should not be. People who maybe got carried away with using NoSQL in context where they really should have had a relational database will go back to the relational model and hopefully get better scalability when they go back.
Werner: You mentioned Scalding and I suppose it is written in Scala.
Yes.
I still love Scala a lot. When I do talks about Scala, introductory talks, I call these talks “The seductions of Scala” because I picked the language when I was learning functional programming just because it was a JVM language and I thought it would be a good learning vehicle and then found that I really was seduced by it because of several things, ways in which it improved Java’s object models, ways in which it gave you some features that accelerated productivity by reducing the amount of boilerplate code. Then the whole functional side of the thing really brought it in so I still see Scala as an ideal transition from Java for typical Java shops because it is very much like Java when you want it to be, if you are just learning, if you are trying to take an incremental approach, but you can leverage the full capabilities of functional programming to greatly improve your productivity. I also believe actually that, while a lot of people thought that it was concurrency that was going to drive interest in functional programming because functional languages tend to emphasize immutability, they provide often some good constructs for writing concurrent code.
I actually think that most developers will continue to find ways to avoid addressing concurrency head on either with toolkits or letting the smart guys worry about it, but I think more people are going to run into this problem where they have to write data intensive applications and for me, data fits the functional model much better than objects do, because it is really about reading stuff in, transforming it, filtering it, grouping it and the kind of stuff that we are used to doing in SQL and I just find that that is the right paradigm for even application development that should be more data centric and less focused on building an object model. So I think it will actually be data that drives interest in functional programming and I think Scala will be a good choice, Clojure is a good choice, the new stuff coming in Java with Java 8, for lambda, their anonymous functions and the Collections improvements. That is going to be an interesting thing to watch because that will probably slow the movement to Scala and Clojure because it will retrofit Java - retrofit may be the wrong word - but nevertheless it will add to Java that will help in driving interest in other languages so that is going to be interesting to watch.
7. Could Java 8 bring you back to the Java language? What else would they need to bring you back?
That is an interesting question. Actually I thought a little bit about this, this morning, when I was listening to Brian Goetz talk about Java 8. I think the way I would put it is that I won’t really be compelled to go back because there are still a lot of things in Scala, some of which were planned for Java but that will come in time. But I will be happier when I have to code in Java for whatever reason, I will feel like I have more of the tools so I really want to do that so I won’t be as much of “Gee. I wish I could be using Scala for this”, I won’t feel like I would want to use Scala, I’ll be a little bit happier when I work in Java, but still there is so much that Scala adds that I will probably still prefer to be there.
Werner: So there is no one thing. The Collections seem to be a good selling point for Scala.
I think it is one of those things more than syntax that is the power of the Collections API that actually gets work done. For me, that is very compelling. But on the other hand, a lot of the Java Collections will now be usable in sort of a Scala style, gluing combinators together to very concisely manipulate data to get from point A to point B so a lot of what I love about Scala collections will now be doable in Java as well. Not everything, but still, I think it will a big improvement.
8. The advanced typing or the static typing. Is it also a big selling point that distinguishes Scala?
Yes. I think most developers either really prefer dynamic typing or they really prefer static typing and that usually determines which language families they go to. Ruby developers often go to Clojure, for example, because they are both dynamically typed. I really do actually like static typing, I like the sort of discipline it imposes on me to think about data and its structure and its shape for the kind of transformations. It certainly gets in the way sometimes where I know this is going to work with just some informal approaches like the way I might write a Bash script or something and not having all that typed stuff would be better than just having to think about all the types explicitly. But nevertheless when I am building bigger, richer, more sophisticated applications, I really like that kind of discipline as well the kind of documentation it gives me when I look at a function signature I know exactly what is coming in and going out. So I still like that, despite the fact that Scala’s type system has a reputation for complexity which I think, to Martin Odersky’s credit, he recognizes that is a barrier for people and is working actively to making it more approachable, to make the error reporting better and even improve the type system so it is more approachable to people. So I think that is more of a transitory than a permanent problem with a language like Scala.
There are a couple of things I would mention. Going back to what we were just discussing, there are some advances features that can be very intimidating to beginners so one of the ways they are addressing this issue in the current release of Scala is you can optionally turn on or turn off some of the features. For example, there is a part of the type system called higher kinded types or higher order types, I should say, it’s abstracting over a list or something, a map of something and so forth. You can enable that or disable it if you want. So, I think that is one way you could start with a sort of safer version of the language if you are being conservative and then add these things as you see fit. Even though that was a controversial move, I think that was a wise move just from the pragmatics of addressing the concerns of the average development organization.
The other big feature that was introduced in current release of Scala 2.10 and is being expanded for 2.11 is a macro facility. These are the so called hygienic macros where it is effectively just a plug-in mechanism for the compiler where you can manipulate the abstract syntax tree of your program before it goes down to farther stages like code generation and so forth. The way I just described it is both the strength and weakness of it, it is a constrained form of hacking the compiler and hence it is very powerful but it also implies that there is a certain amount of learning that has to go on before it is even usable by mere mortals. I think that is going to be a really powerful feature.
There is a bunch of design problems I ran into where I would like that kind of flexibility to manipulate the program, but still have a nice syntactic sugar for the end user of an API, but it is going to require some learning on the part of end users. There are some ways that are making it easier for people that do not have to understand this full API that represents the syntax tree and so forth. As usual, any time you have something very powerful it is one of those things that is controversial because it is “something else I have to learn” for some people and for other people it is “Great. This gives me the sledge hammer I need to solve this problem that has been difficult”. So we will see. But I think in the long run it will be a real benefit for end users.
Yes. The compiler has supported plug-ins for a while. I honestly do not know if they have removed that. The problem, of course, with general plug-ins is that it is really easy to just completely screw up the compiler, generate bad code or whatever, if you have unconstrained plug-ins. Use something that is more constrained, so you are less likely to shoot yourself but you still have most of the power that completely open mechanism will give you. This is probably going to be the main tool for extending the capabilities of the language.
Yes. He has done a couple of things, one of which is the ability to turn on and off features so that you can leave the advanced features off if you want to be conservative. The other thing that he announced just recently – in type theory land one of the hot topics these days is the so called dependent typing which is sort of blurring the lines between objects and their type. For example, you could have a type that depends on - maybe it is a list of size 5 versus a list of size 10, for some reason that is important to represent as a different type. Those two numbers maybe not so much, but it could be a small imager versus a large imager. That would imply things about how you represent it internally for performance and so forth. So dependent typing is a hot topic in research and he has recently come out with a variation of that – part of it is an ongoing research - called dependent object typing or DOT for short, and has suggested that it might be the future for the Scala type system that it would eliminate some of the complexity of understanding how features interact at the type level and provide a more unified picture that makes it easier for people to learn the type system and then therefore exploit its power.
Obviously, the type system is one of the things people have cited as something that makes Scala complex. Partly perception, partly reality, I think, and this in one of the areas that we will hopefully, over time, address that issue by providing a - I think one of the things that Scala has generally been good at is giving you a lot of power by figuring out one particular feature that they can add or syntactic sugar or something that enables a lot of sophistication by the user. I am hoping that the evolution of the type system will kind of be along those lines, simplifying things but yet giving you power to do what you need to do.
When they made the transition from version 2.7 to 2.8 which was around in the early 2010 time frame, it really was a 3.0 transition as far as its impact but, for various reasons, it started out as an incremental change but they ended up doing a complete rewrite of the Collection’s API. They made the decision that we were starting to get a critical mass of adopters and unless we made some core changes that we really want to make to set the language on a better course for future evolution we had better take the pain now and do it and it just created a lot of what was really transitory versioning and compatibility problems for a lot of people. It was somewhat painful in that period, but afterwards they tried to adopt a policy of only breaking compatibility when it is absolutely essential and trying to make each version backwards compatible to previous versions as far as taking existing code and running with it.
In general, they have done a pretty good job about this. I have not really run into upgrade problems for a long time. Like in any situation, I would pick a strategic point where I want to migrate to a new version of the language and in general, I find that it is pretty seamless these days. If anything, the effort will be “I know that there is a new feature in this release that I want to exploit so there is some energy I have to put into updating my code for that feature”. I have not really run into many painful incompatibilities since that 2.8 transition.
Werner: But the cost is not zero.
As it would be for any language environment. But I find that it is reasonable at this point. It is just a matter of the usual trade-offs you always make: am I going to hurt users of my API that don’t want to move yet or do I want to move quickly because there are some features that I want to adopt.
Werner: Well let’s get to the dreaded last question. Dean, what is your favorite Monad.
My favorite Monad. I am not as heavy a user of Monads and category things as some other people in the community, but probably the one that I tend to fall back on a lot is a variation of the so-called State Monad which is, I need to carry some context around and these data APIs are a good example. I might be carrying context that represents “Here is the actual target environment that I am going after” but some of the features of Scala make it pretty easy to actually hide the fact that you are carrying this information. The surface API that users see may not even mention it, but because of features like implicits I can pass this information along as I go. Personally, in my public APIs I try to avoid exposing more advanced concepts like Monads and so forth, although they are not as hard to learn as people think. Their name is scary, right? But nevertheless it is that ability to carry context and evolve state that I think is the most useful one for me personally. The I/O Monad is the other big one people talk about to encapsulate the fact that I/O is inherently stateful because you are changing the world. I have not used it that much because I find it is a good concept, but typically, the way I think about functional versus non-functional design is that there are going to be places around the edge of a program where I have to interact with the world, it is going to be imperative in some sense. I do not want to care so much if I am not using a monadic structure, if I can say that, for I/O. My personal preference is to just use the standard I/O APIs but once I am past that surface and in the internals then we just try to make it as stateless as possible and where I need state, I use a variation of the state Monad.
Werner: That is definitely good to hear. Thank you, Dean.
Thank you very much.