1. [...] Maurice, welcome to QCon. Lambdas and streams – what is all the fuss?
Barry's full question: I am Barry Burd, I am a professor of computer science and mathematics at Drew University in Madison, New Jersey. I am interviewing Maurice Naftalin. He is co-author of “Java Generics and Collections” and author of “Mastering Lambdas: Java Programming in a Multicore World”. He is an Oracle Java champion, JavaOne rock star speaker in 2013 and 2014 and these days he is doing a lot of Java training, mainly working with Heinz Kabutz. Maurice, welcome to QCon. Lambdas and streams – what is all the fuss?
Lambdas are really interesting because they allow you to convey around in your program little bits of function, in a way that has not been possible in Java until now. If you wanted to essentially have a free-standing function in Java, before Lambdas were introduced, you could define an anonymous inner class. There is not a great deal that you can do with Lambdas that you could not do with anonymous inner classes, except express them in a really concise way. So, there is all kinds of times that we would like to pass to a method, a function and essentially say: “Do this thing”. I want to pass in as a parameter an action. In the past, you could do this with an anonymous inner class, but the definition of an anonymous inner class was hugely bulky: you had to define the class, you had to define the method and it would take five lines of code. If what you want is to be able to write a lot of these things, if you want to write many of them, in a single statement, it would be unreadable because they were so huge. Now, where in the past you had to define the class and its super-class and give the method a name and so on and so forth – all of that has been removed.
So, with Lambdas, all you have to do is to say what the parameters are and what the action is going to be. And that is much neater and much cleaner. So, because it is easier to write, we are going to use them a lot more and it has been found, the experience with other languages has been that when you can start passing functions around, when essentially you are making functions into first class citizens, then we get more maintainable code, we get neater, more concise, more readable code and in particular something that I wanted to emphasize in the work, we get finer grained APIs. The APIs read much more nicely and are much easier to use. So that is the fuss about Lambdas and you see major applications off them in the innovation of streams in Java. So streams are going to become the new way of doing bulk processing essentially. In the past, if we wanted to process all the elements in a collection, what we would do is we would write a loop and we would iterate over all the elements of the collection and the action that we would do, it would tend to be quite long-winded. This is sometimes called external iteration, because what we are doing with external iteration is we are getting each element from the collection in turn and doing some processing on that and then typically dumping it into a new collection.
But with Lambdas and with streams, we are nowadays doing something different. We are doing internal iteration and then internal iteration essentially means is that instead of getting many elements in turn from the collection and doing something with them. Instead, we are taking the “do something” action and we are giving that to the collection and we are saying to the collection “You apply this action to each element internally”. This is a much better idea for a number of reasons: first of all, it makes our code a lot cleaner and secondly, it allows the collection to perform this in an optimal way. So, collections often know better than we do how to optimize internal processing but until now we have stopped them from doing that because we have not been able to give them a way of doing that because we have been taking the elements out and essentially, we have been removing responsibility from the collection. So, here is an example of why it might be better to tell a collection how to do things, supposing there was an action that we wanted applied to every element of the collection and we did not care what order it was going to be done in. Now, iteration does not allow that possibility.
If you are iterating over the elements of the collection, you are forced into saying what order you want it done in. Quite often the order is not important at all. So, an analogy I invented for the book was to suggest that you had a task to do you want to give to a friend – that you got a pile of letters that need to be mailed and you want them all to be put into the mailbox. You ask your friend to put them into the mailbox, but you say “Before you put them into the mailbox or I should put them into the mailbox, I want you to put them in alphabetical order of the addressee” And your friend is going to say: “Hang on. I am feeling micro-managed here. That is pretty controlling and it is very inefficient, it is very slow”.
Your friend knows better how to put them into the mailbox than you do. He or she is going to take them and is going to chuck them all in. Order does not actually matter in the case of that task and yet the iteration over the collections, as we have been doing it until now, has forced us to impose an order. We had to do that. We had to give that very micro-managing command. It may well be that the collection knows how to do things much faster if no order is imposed. So, this is an example of how being able to give an action as a Lambda to a collection. If the API is designed to accept the Lambda, as they now are designed to do, that is an example of how Lambdas are going to improve both the appearance of our code and its maintainability, but also, potentially, the efficiency of it as well.
2. What are the stages in the processing of a stream that the user needs to understand?
Well, a stream is like a Unix pipeline really, in a way. For many people that will be a helpful analogy. A stream has a source, it has intermediate operations, zero or more intermediate operations, which are like the filters within a Unix pipeline and then it has a sink, which is a terminal operation –that is what the stream API calls them. These have different characteristics: a source of a stream can in principal be almost any data source.
In practice, most often, it will be a collection. What is interesting about stream sources is that, although for sequential streams they are essentially like the values that are fed into a sequential stream one by one, by a process which is like iteration. So, stream sources have iteration in the name. But for parallel streams, it is really useful to be able to break the data up, possibly to break it up in a recursive manner until you have segments of a suitable size to be fed into each of the parallel streams. So, that is like splitting the data and therefore the actual piece of machinery which feeds into a stream is called a “spliterator” for the split for the parallel part and iteration for the sequential part. That is a stream source. The intermediate operations in a stream are operations that are pretty familiar to us in another context. “Map” and “filter” are obvious ones and there are various other ones as well, you can sort elements, you can truncate streams and so forth.
The terminal operations are pretty much like reductions in functional programming. For example, you can take all the elements of a stream and if they are numbers, you can add them together or you can get the average or you can manipulate them in various ways, you can maybe get statistics which actually embody the average and the minimum and maximum, and so forth. But for reference values, we can do more interesting things. If you we are getting a series of strings coming down the stream, then we can concatenate them, for instance. But probably the most important and useful terminal operations will be those that take reference values that are coming down the stream and put them into new collections. So, these terminal operations are called “collectors” and there is a wide variety of them provided by the stream API and there is a really nicely engineered collector API which allows you to compose the collector operations together and do really quite fancy things with them.
There is really quite a lot to learn there with a very high reward if you do that – probably the biggest single learning component of the stream API is the collector sub-API there is in it and it has this very nice characteristic that will actually allow the elements that are being delivered to it through parallel streams – it will actually administer the addition of those to non-thread safe collections. So, this is actually a major innovation.
Well, the biggest single performance pitfall that people are going to come across in using Lambdas is not actually with Lambdas because Lambdas by and large are going to perform better than the anonymous inner classes that they are replacing. They already do and the way that they have been implemented using invokedynamic means that in the future, there are many possible improvements that, as the developments in the VM proceed, will make them perform better still. I am not really very concerned about that. In general, using sequential streams instead of iteration are going to be, in general, comparable as well. The biggest problem comes with the introduction of parallel streams.
So the way that parallelism was introduced in the stream API was what the design team called “explicit” but “unobtrusive” parallelism. So you can simply say “I want this stream processing to be executed in parallel” and what that means is that the stream API, the implementation will attempt to divide up the work that you are asking to be done and to distribute it over multiple threads and, in fact, over multiple cores. So the biggest pitfall in using parallel streams is going to be that people will use them inappropriately. They are really useful in the cases where they work, but that is quite a narrow set of cases and you have to understand the problem that you are trying to solve, the program that you are trying to execute and the data you are going to have. You have to understand those pretty well before you are going to be able to assess the benefits of going parallel.
The reason for that is that the very act of writing parallel or parallel stream causes overhead. The parallelization has to be set up, must be implemented in the fork/join framework and that involves creating separate threads, new threads to execute the code that you want parallelized. It also involves merging the results of the parallel stream together. Those things have overheads and if you aren't making the saving you would like by parallelizing the work that needs to be done in between those two things, then you may actually find you are incurring an overhead that brings you nothing but a loss.
I would say that, certainly, you need to know those things. You need to know whether the data that you are starting off with is in a form that is reasonably splittable because dividing the data up for distribution is very important. You are absolutely right that then the next thing that you really need to know is whether the work load you have in the intermediate operations – those that are actually going to be parallelized – you need to know whether that work load is sufficient to justify the overhead. You need to know whether or not that workload involves the threads that are going to be parallelized, the work is going to be parallelized, whether you are going to have problems with blocking on IO or similar. The tasks need to be compute intensive and you need to know as well that there isn’t going to be a great deal of interference from other processes that are going to be contending for the cores. So, what you talked about is useful, it is actually really important, but there is some additional aspects to take into account as well. I would say the most important thing is to be confident that your tasks are compute intensive and that your data is splittable.
Well, side effects in Java Lambdas are not going to be a huge problem because the language really does not allow for them very much. One really big question to be asked “Are these going to “real” closures?” and that would have required them to have access to local variables that they can mutate and the decision was that they were not going to, that the same rules will apply for local variables as applied to anonymous inner classes. In other words, it had to be final. There's a minor syntactic relaxation of that, but essentially, the restriction is still the same.
Side effects for Lambdas are not that significant, but I think maybe the question is really about side effects when you are doing stream processing. One of the guiding principles of the stream API was that you would not be able to do anything, you would have no operations that could not be equally well performed in a sequential, as in a parallel mode of execution. So, the idea is that the mode of execution, whether sequential or parallel, is an implementation detail that can be adjusted at any time that you discover you want to adjust it, sort of the like the idea of the choosing of an implementation for an interface. If you discover that the use case for your application is different, maybe you got a different kind of data from what you expected, you should be able to just switch interface implementations, to choose another one which is functionally the same, but has difference performance characteristics for your different data.
Exactly the same kind of idea goes for sequential versus parallel stream processing. In other words, the result you get by going parallel should be functionally identical to the result that you get from being sequential and what that means is that everything should work the same and therefore the constraints that are required for parallel processing should also apply to sequential processing as well. That means that we are going to exclude the use of side effects in stream processing. When you are processing an element of a stream, you cannot depend on the result of having processed any other stream element, because in principle, in a parallel stream, you do not know what sequence they are going to be processed in. You are not allowed to depend on mutating any field because you do not know in what order these stream elements are going to be processed and also because this is going to give rise to contention. Either it is unsafe with the parallel streams or else, if it is not unsafe, you implement mutual exclusion then you are going to get a lot of contention which will destroy the idea of having parallel streams. You are not allowed to mutate the source of a stream, when stream processing has begun. So, the restrictions on using side effects of stream processing are quite severe, but they are also quite logical.
Once you understand the basic idiom of stream processing, they make a lot of sense. I think people might fall over those in the first place. You were asking about pitfalls earlier – I think that people might fall over those in the first place, before they get a feel of how stream processing is intended to work. It is always the case with any API. You have to get into sympathy with the ideas of the designers and this one as much as any other.
Oh, of course. Well naturally it is always possible to abuse any API. How do you do it? Well, you just overuse it, I suppose. I think the worst style offense that I have seen is writing stream program. Stream processing typically involves a number of intermediate operations, each one of them taking Lambdas as argument and it is possible to write very long Lambdas. You can write very verbose Lambdas just as much as you could write verbose anonymous inner classes. So if you have multi-line Lambdas and many of them within a stream processing statement, the result is going to be pretty unreadable. So that is a style offense I do not like to see.
7. Now what will be the main difficulties, obstacles, in the adoption of Lambdas and streams?
The main one is going to be that Java programmers are used to imperative programming and here comes a little bit of functional style. Once you understand them properly, the advantages are really great. You code is definitely more concise, it is more readable and it is more maintainable. But you have to get your head out of the very strong compulsion that we have of seeing everything as being ordered and iterative and thinking in terms of “I am going to tell you what to do” rather than thinking in terms of “Here is an action you can carry out”. That is the essence of the functional style here. It just requires a transition in the way that you think things and people will find that difficult in the first place. But my experience has been that once people see it, it is a light bulb moment.
Barry: I am comparing it in my head to the transition from strictly procedural programming to object-oriented programming and I am wondering if the way that we learned to phrase things in terms of objects and classes is going to be a similar learning experience to the way that we will learn writing in terms of Lambdas.
You kind of needed a light bulb moment or a series of light bulb moments in order to get the idea of object orientation, I think. I believe that – this one is going to be an awful lot easier – it is a similar kind of move in the way that we think of how our programs work, but I do not think it is nearly such a big jump as was required. Certainly speaking for myself, the light bulb was a long while coming on in my case before I got the idea of object orientation. This, I think, is a lot easier.
8. Can I have your take on purely functional languages?
Yes. That is a very interesting question. I have seen the functional programming academics. I know some of the super stars in functional programming – Philip Wadler, my co-author on “Java Generics and Collections” is one of those. He was one of the developers of Haskell. I see a kind of look of triumph in their eyes that has been growing over a very long period, like 20 years or more than that, and they think that multi-core arriving means that we are just going to have to do things their way because shared mutable state are the words that strike fear into everyone’s heart and where could we avoid those problems before? With multi-core – they are coming more and more to the fore.
So, the different cores need to cooperate on a task that all of which depend on the same data. So, actually synchronizing them or coordinating their access to the data is going to be a big problem. Now, functional programming seems to have some of the answers to that. I actually respect them and I accept their triumph to some extent. I see why they are feeling so pleased with themselves. They feel like they have been prophets in the wilderness for 20 or 30 years and finally, their dream is coming true. But I think that purely functional languages are not the future because I think the benefits of being able to mutate state – we see that hugely in Java and we do not think that Java is going to go away from mutating state. I mean actually, the major innovation of streams in Java 8, the big accomplishment, was something that I have not mentioned so far which was an extension of the idea of functional reduction. Functional reduction only really works if you have immutable value that you are composing at the end.
In Java we have mutable collections and we are going to continue to have mutable collections and one of the big achievements of the stream framework is that they found a way of getting parallel streams to eventually dump their data into non-thread safe mutable collections and to do that in a thread safe manner. That is actually a significant achievement.
I do not believe that Java, which depends so heavily on mutable data, is going to be replaced any time soon by a functional language because the efficiency of mutation, when you can do it safely, is really very hard to beat and although processors are much faster and memory is much cheaper, I do not see the possibility of relying entirely on copy-on write style thread-safety is going to be practical on a large scale any time soon.
Barry: Maurice, thank you so much for doing the interview.
Thank you very much.