Transcript
Kuksenko: I'm going to talk a little about our project, Valhalla, to deliver what we did in Oracle in that area. I'm not going to answer that question, does Java need inline value types? I'm just showing what we have for you and you will decide if you need it. First of all, sorry for that, I have to show you this slide from my company, but I really like it. It means that I can tell you total crap and the company will be fine with that. I'm protected.
Who am I? I am working at Oracle. I am working on Java performance. My goal is to make Java faster. That is why here I will not talk a lot about functionality for inline types. I will talk about what I see about performance, what we’ve got, and how it works.
What Is Valhalla?
First of all, I'd like to show you one simple demo. It's the execution of two programs drawing Mandelbrot set, zooming each frame, counting on each step. On the left side, is a classic Java vault, as you could write right now. The other program, when it is caption inline, is using inline types. The only difference between programs is only seven characters. It's six characters for keyword inline, and one space. That's all. You could see a frame per second, which we get in here, and how it works.
What is Valhalla? Valhalla is a large project we started several years ago. It doesn't have a single goal, we have several sub-projects. They key idea is to provide new types that we call inline types. Initially on the design, if you read our mailing list whole discussion, it called value types. Later, people started looking into the Java language specification, virtual machine specification and found that in Java language specification. You can find the word “value” 800 times in Java Virtual Machine Specification, you can find it 1,000 times in Java language specification. If you add more values, it will be some kind of mess. We decided to rename it..
We almost did inline types. We have a very good prototype, several sets of prototypes, and that part became more clear for us. The other large part of the project Valhalla will be specialized genetics. We didn't start implementation yet. It's only on the prototype, on language discussion level. Of course, we have to provide the library migration for all that new features, which we have. As sub-projects for Valhalla like Nestmates, which is already done in Java starting JDK 11.
I'll just briefly describe what inline type is. Our goal is to provide advanced memory allocation, and we found that the one thing that creates a problem for us to make a better location of data is object identity. It could be considered as an address or something like that, memory location. In general, it's an identity. Identity provides different effects, it's a required allocation in heap, it's not required locking in general in other languages. In Java, identity coupled with locking monitors and all of those things.
We decided to make types without identity. Why? The general question, why JVM, a smart machine, can't understand that we don't need identity eliminated, allocate object on stack, or whatever does a replacement for it? In reality, JVM can do it. There is escape analysis, there are other things that provide better performance in that case. Just as example, I entered to ACM library, and tried to make a search for an escape analysis article. I got approximately 8,000 articles about that, and tried a synchronization elimination. Also again, approximately 8,000 articles. Then I tried different kinds of such keywords, like stack allocation. Again, 8,000 articles. It's probably the same articles because the logic and analysis are similar here. It means that we have 300 articles per year if you took into account 25 years of Java existence.
First of all, it would be difficult to implement everything that people invent and write in articles. The other thing is the high rate of articles. Almost every day there's a new article, which means that there is a demand, that existing algorithms are not enough, and we have problems with that. It's too difficult to implement proper, effective, automatic solution. We decided to give that task for you, for Java developers. You tell us where identity is not required, and we'll optimize it. So it's required to make a new types in Java.
What is inline class? First of all, it's a class. Absolutely, it's a class, but with simple keyword. It has no identity. It's immutable, it's an essential requirement. You can assign no value into variables of that type, and it has no synchronization at all, of course. This is an example of code. We assume that everything is final, all our fields are final, it's class immutable. Also inline type has no deep class hierarchy. Every inline class extends Object, and it's final. You can't create an object hierarchy here. The only nice thing we get here is that inline class may implement any interfaces you wish.
As a change of semantics, there is such an operation like “equals”, reference comparison for Java. We had a very long discussion about what we could do for Java here, for inline types. What should we do to equals? Should we forbid the operation? Should we make it always false? We tried that option. Right now, we chose the so-called substitutability check. It's like recursive comparison of two variables by fields and if it's also inline types, you'll be going down and on.
If you write an inline class, it doesn’t mean that we magically allocate it on the stack. It just means that that class has no identity. The virtual machine will decide where the data for that class will reside. In general, if you're a third party implementer of Java Virtual Machine, just to make it compatible with our Valhalla project, you may do nothing here. You may just accept inline keyword and inline flattened side class for structure, and it will work. Nobody prevents you to allocate it on the stack or on the heap and doing everything that you do. Again, our key feature is to embed inline types into containers, other containers like as a class or arrays, it's our key goals. For locals, put it on a stack or completely eliminate it.
There are other two functional features that I want to emphasize here. First of all, there are the inline types extend objects. It means that inline value could be assigned to object variable, and also inline arrays are covariant objects or interface array. Even if it's a flat array, there's no difference, you can assign that array into objects. Also, every inline type has a boxing, like companion, etc. Right now, we have indexes like type name and question mark. We don't like these indexes, they're ugly. We already have an idea on how it will be changed. Anyway, it's a way to force inline type to be allocated on the heap. It can be called boxing.
People use the fact that if you box primitives fairly, you will get your own features of identity there, like normal reference comparisons, synchronization, etc. For inline types, even if you boxed it, you won't get identity. You still won't be able to do synchronization in that class and it's the same for reference equality.
Code Like a Class, Work Like an Int
I think we should offer some description. It's very short and very weak, but it requires the whole presentation to describe all features of inline types, but I want to talk about performance.
For a long time, the key slogan of our project Valhalla was, code like class, work like an int. Let's check it. Let's check that inline types works as an int. Here's a piece of code from the demo I showed you. It's a counting Mandelbrot set for images, if you get here or not. We made complex numbers inline type and what did we get? It's a comparison. The average time is nanoseconds to make the whole computation for reference class, for primitive and for inline class. I see that here performance of inline classes is the same as primitives, as its heap allocations, one operation.
Another interesting fact for dealing with that graphical demo Mandelbrot set is that it’s required much less heap allocations and a lot of performance benefits from a hardware point of view. Really, it has a much better scalability. I took my program and run it with different amount of threads on a large server machine. The redline is a typical reference implementation of complex class and the green line is inline class. That started almost similarly. It's around 30 operations per seconds for reference implementation, and 50 operations per seconds for inline implantation. As I said, one's red, a little bit faster. When I execute 64 threats on 64 core machine, I got 16 times boost here.
Ok, let's consider another situation. I like that function. I use it for checking function invocation. How our inline types are passed through method parameter, there was an argument here. Value checked for two ration reference class and inline class. It's an argument and a result, and also, there is a handwritten primitive version. It works exactly as a primitive, as we have to check. Array access, what do we have here? On one side, we have our goal –good reference array when array contains reference for a class, and so-called flat array, when we don't have references and all fields of our inline class are allocated in the same memory.
Let's check the cost of random access. In blue there is a primitive. It's the cost for random access array, and then one is a reference. Again, we got completely joining between primitive and align class from the performance point of view. It's a random access, or maybe in case of sequential access, we'll get different results because of the hardware significantly helping here. Yes, results are different, but primitive and inline types have the same performance.
Collateral Damage in Legacy World
That was the first section. I believe I proved that inline types work as a primitive, like an int, as it should be done. That section is different. In my history of Valhalla project, it's the first project, the first attempt to add new feature in Java when we didn't just add a new feature and all other legacy code with other feature remained untouched. It was for many years, for probably the whole evolution of Java. Unfortunately, here, for Valhalla inline types, we can't do it. It's impossible. There is some impact on legacy code, and in that section, I want to tell you about that result, just to know what negative effects we got.
The key question is the following. If inline type is inline into array, into upper class, aggregate class, into stack, metaparameters, whatever, it doesn't matter. We already know that it works like an integer, like primitive types, but in some situations we have to allocate our inline class on the heap. It could be intentionally by using type name, question mark, notation. It could be caused just by assigning value type into object, it's assignable to object or interface. In that case, we have to create a reference to put object into heap and leave it here, or JVM just decides to not allocate objects on the stack, whatever.
For example, just for comparison, the very aggressive optimizations of inline types are done by HotSpot JIT compiler. The interpreter of HotSpot doesn't care about inline types, and allocate those on the heap. Functionally, it works. What does mean? It means that if you have a reference, you don't know if it is an inline type or not, because if it's inline type, the semantic of some basic operation is changed. The first operation we have issue is our reference comparison operations equal equals.
On one side, this is how we do it right now, and on the other side, what there's what we have to do in the case of Valhalla project. The key equation here is, of course, to check if the class is inline or if all Java classes are good. In the first implementation, we just put that information into class descriptor and we are going into our object, we have two internal words like mark word and klass ptr and go into class descriptor to check it. It's not a good idea. It requires two references and probably it requires us to get two cache misses in that case.
What if you put that information about inline class into mark word? It's a long fight. If you look into HotSpot sources, it's a very tight place. Mark word fits all that information. We are unable to find the one single free bit in that mark word, but we found a pattern, which means permanently locked. Inline types can be locked, so that pattern could be used for us. We implemented it and right now, we are doing only one check and one difference is going to mark word.
That is a code that I wrote, and should check which assembly we got after compilation. That is a code for the good old Java, like you got right now. Just compare it to references and do one action on another action. That is Valhalla, it's a little bit more complicated, and it's not even the whole code, because on the right bottom corner, you could see invokedynamic. Invokedynamic should invoke substitutability check for inline types, and also there is a bunch of code there. There is a difference. If you don't have inline types in your program, you'll never be able to enter in the green value world where our inline types exists.
Even for reference classes, performance became complicated. Initially we just compare references. The second comparison with 405, it's our magic patterns that inline class or not, and it costs. Other issue we found is that if your loop has invoke dynamic allocation, that loop won’t be unrolled by HotSpot. We didn't need it before, it's not implemented. We will implement it, but right now, it's a real status. That is why on that comparison I have three different lines. One is for ACMP reference, not inline comparison, but typical old reference comparison in the case of Valhalla implementation. In the case of baseline like before Valhalla there are two versions unrolled and not unrolled, because of the different cost and to show it fast.
In the worst cases, the operation became more than two times slower. Reference comparison, in the best cases, it didn't become so bad. The other operation we prohibited for inline types is synchronization. It's just throw an exception, if you're trying to do synchronization on inline types, but here, everything is fine. All our benchmarks are showing less than 1% difference and we don't care about that anymore.
Arrays – if we have an object array in runtime, we may have a reference array, but we may have a reference array of inline class, which is flattened. It also causes additional runtime checks. That is a problem. We have to know that our array is a flattened array, and we can't put that information inside class descriptor because the array of the inline class could be flattened or maybe not flattened, like a presented array of reference. Otherwise, we have to make two different class descriptors and mess with having that as the same class descriptor but different, or whatever. We don't have a place inside mark word. We decided to do the klass ptr. We only needed 2 bits for that information, and we did a little bit spoiling of klass pointer, which is a part of every object.
Here, it's fine to do it, because it's impossible to write a program when you have such a huge amount of classes, different classes and class descriptors that will be required. It will not be enough memory for that.
What does it mean? It means that for every access for class descriptor, if it's an array, we have to clear the bits. It causes some little performance regression, but in reality, we checked, and HotSpot perfectly eliminated it. Anyway, load values from array, just compare which action should be done in one case, and what we will have to Valhalla. This is the same situation but it's different for a store operation. We also may have storing something inside flattened array and it should be checked.
I won't show any charts. The situation for our access is quite different. There are benchmarks that showed very huge aggression, to minus 10%. In reality, the source of all aggression we got here, is not general inefficiency as for inline types. It's just because HotSpot doesn't have some particular specific optimization that may help in that case. We simply didn't need it before, and we'll do that implementation and we will check it. The last example is not related to any kind of functional semantics. What do I have here? I have two integer arrays – very short arrays, just 1000 elements. They are almost equal on the last element, only different time. I'm checking the performance of array equal separations, and that is why I put indifference on the last element.
Here is the performance difference. Right now, Java is doing it for 600 nanoseconds and Valhalla is doing for 900 nanoseconds. Could you guess what is the problem? It's the Integer. It's an array of Integers, not an array of Objects. Inline type has a very flat hierarchy, we know that if static type is Integer, no inline types could be there. It could drop it. The problem is here. The size of generated code for arraysEquals 300 bytes. It became more than five times larger in case of Valhalla. It's just due to “invokedynamic” stuff, which we required to put into that minutes.
Our inline of arraysEquals failed, because of inline boundary is just 1000 instructions. We know that issue, we found it. We are not going into it right now, but obviously, before giving it to the public, we will either change boundaries or we will do something particular and specific for our inline types. In general, the picture is not so bad, as I'm showing you. We checked all our benchmarks database. It's all our base, there are no inline class code. It's just checking for regressions. We have no huge regression on the big benchmarks. There's two percents on some large benchmarks, it's very difficult to analyze, so we started from microbenchmarks. If dealing with microbenchmark, we will fix big benchmark, it will be fine. When we finish with our micros, we will go to big benchmarks.
Right now, I consider that the whole picture is very promising for performance, and I have a request for you. Value types achieved a pretty well situation. For the next Valhalla release, we are going to make it as a public preview feature for open JDK. I can't say if it will be 14 or 15 right now, but it will be 14 or 15 when inline types became a feature preview in Java. In that case, you may try it. You don't need to write inline types, but you can try it for your application for your code. It obvious that Oracle has much, much less amount of code than you. If you check it earlier, if you find some issues, performance regression, something else for old reference legacy code, just please report it. Try it. I hope it won't take a lot of time.
Brighter Post Valhalla Future
Let's finish talking about negative effects of Valhalla. Let's check what it could bring us. I showed that the performance of inline types works like a primitive. It's comparable as primitives. What else? Here’s just a little set of examples when we see other differences. I don't even want to target it, because that inline type is very useful for any kind of arithmetic code. We play in these complex numbers, we did the Mandelbrot set demo, we played with simple complex matrix multiplication and got a very nice result.
I played with some raytrace algorithms and I'm just adding that keyword inline got speed up from 50 second of rendering to 30, I did nothing, but let's consider those as examples. First of all, I did a very official example about java.util.optional. java.util.optional is our litmus test for Java Library migration. It declares that it will be a value type in the future, and it will be definitely a value type when it will be ready.
I tried to do a little bit of something about optional, just some benchmark to check how it may help us. Another situation, like earlier our discussion about bad design of HashMap class, is when you get meta made the news for two cases when there are no such mapping insight map or there is mapping to know. The idea by design when HashMap interface was created, is that there were no such classes option. Let's make it returning optional, I can do everything and measure what we get in case of such precision optional class. Here is a comparison current classical reference optional and future inline optional results. It's average number for one million get operation from a HashMap which has one million pairs. I think these are probably nice results.
Another example – that task is one that I solved in my life many times. It's when I have to map several keys into one value. I know only two ways to do it. I will write in a new class composite key, name it somehow, or I'm doing map of map of map. I'm going to ask you, if you know such a solution, please tell me. I'm just curious, I'm looking for that. Let's check the performance. I did a simple map of two integers into one integer, one million get from one million map and I did reference composite key, map of maps, and inline composite key. Which one is better? I prefer composite key from a code reading perspective. It's much easier to understand than chain of maps, but chain of maps right now works a little bit faster. Inline should eliminate that difference.
Another situation we are also working on, HashMap has entries, and entries have chain to side brackets. What if the first entry in side chain will flatten into our array and make it inline class? We don't care about other possible chain, but it's a question to you hash function, not for us. Performance results are a little bit controversial. There are some interesting effects, but in general, it gives performance – not huge, but performance improvements are possible. In that case you don't even have to know anything about inline class, it's an internal goal for JDK to deal with that.
Another example is the simple iteration of HashMap or benchmark. If you write that benchmark, you may find that in different circumstances, in different environment, not in that particular vision, there are performance problems of such iterations through the whole map that may require different amounts of times and very significantly different.
What's the difference? Why does it happen? Here's a solution. The problem is iterator. When it's possible to eliminate our iterator, we get much better performance. Please don't mess with the heap allocation. That difference, that 20 milliseconds, it's not a cost of heap allocation of our iterator, because we are doing a loop in one million iterations, and we allocate our iteration only once. It obviously can't cost us 20 milliseconds. The problem isn't inside HashMap and inside iterator. The problem is here. The iterator has a state and that state has a reference field and that state should be updated on every move from every state when you're going through your iterator. If you update a reference, you have to invoke garbage collector write barriers.
That 20 milliseconds for us is a cost of G1 GC write barriers. If you run it on the ZGC when there are no write barriers, it won't have such an effect, but it will have other effects because the ZGC has other barriers.
What would could be done? How come we can't make an iterator as an inline type? It's not immutable. To make something as an inline type, you should make it immutable. What about such interface? We call it cursor. There is no way to update that cursor. It's immutable if you want to just get the cursor to the next element to create it, and change it for such kind of benchmark using inline cursor. Here are performance results. It's the same. Is it the same as scalarized version? No GC barriers; we are happy with it, it's the best way that we don't rely on the escape analysis and we always get nice results.
This is probably my last example for the talk. In general, every program is just move date from one place to another. There are some permutations, but it's moving data to that place, to that place, etc. It looks like the cost of moving inline types will be higher. They are larger. We don't have only simple four or eight bytes reference for all heap allocated objects when we're just moving our references and don't touch all of it. In case of moving the inline types, you have to move the whole object from one place to another.
I choose that as sort in algorithms. It would be a very nice example to measure date moving performance. I did three implementations. One, I took the basic default TimSort reference from JDK. Then I identically reimplemented it for inline types. The sequence of compare operations will be absolutely the same. I don't care about it. I only care about moving data. Also, it was some ideas that inline types could be very large. Maybe we should do some optimizations, like the first idea, sort of indices, primitive indices and after that, restore order by moving inline classes.
Here, I can’t show you all results of data sorting. These are all the results. I, again, fit all results into one slide, so I'll show it by pieces. For small arrays, when they fit into L1 cache, what I see here, on the X axis, it's different sizes. It's 16 up to 64 byte size of my objects. Y axis is time in microseconds, and lower is better. Here, for small arrays, for data which fit into L1 cache I see that for all cases, simple inline classes is faster than references, even if I move more data.
Let's do the array a little bit larger, for when it doesn't fit into L1 but it perfectly fits into L3, on that laptop. Every data is gathered to my laptop just to show you. Again, inline classes are faster everywhere and a little bit less for a very large, like 64 bytes, inline class. Anyway, checking with indices at the line helps us, in that case. Let's do it larger and feed it into L3. Here's the only case where references became winning, but only for large classes.
Let's do it even larger, just 400,000 elements. Again, stations is done to back, our inline class is faster there than references. Clearly they do yet another larger array, and again, I can see that inline class is faster. Indices is better in that station, but inline classes is better than reference. The general conclusion I got from that just simple results is, the dense location we got for inline classes is much better for performance than moving less data. To the cost of moving is much higher than the cost of moving more data, and it gives us benefits.
Inline class I think gives us quite a lot of performance benefits and we could use it in the future probably, if we want. Here are the links for our projects. We have a Wiki page publically available, which contains a lot of details. If you wish, you may participate in discussion. We have two different mailing lists, one for internal specification committee group. Valhalla offers general discussion mailing list. You may ask any questions, show any examples of what you have. If you wish, you can download it and try it. I promise that soon it will be made for trial, not done on by default, but under option. If you review it, it will be available.
Questions and Answers
Participant 1: I wonder how does Project Valhalla play with GraalVM, for example? Do you have insights on performance? Do we get anything automatically? Do you have to work on something special for it?
Kuksenko: It's a general question for the Graal team. It's a little bit different team and right now, there are no value types implementation in Graal. It's just a question to implement. When guys will implement it, we will check the performance of it.
Participant 2: This is very similar to C++, you guys are kind of closing the gap. How do you compare to C++? One. Two, what happens when you assign two value types? Do you copy or do you do move semantics? Do you consider move semantics from C++, etc.? I think C++ went through this many years ago.
Kuksenko: It's a little bit different. In C++, even if you abstract or whatever, without reference, it's not in Heap, the object still has identity. It has address inside.
Participant 2: You can take the address of operator, but you don't have that. You kind of simulate it.
Kuksenko: We don't have itв completely. There is some description of how it put into array, but there is no any limitation for that. It is possible. Automatically, we don't play with that yet. It's not a goal. If you have, for example, an array of inline type. Do you know about AOS to SOA optimization. From an array of structures to structure of arrays optimization. If you have an array of structures, you move to a structure of array, every field of our original structure is put in the other array. Sometimes vectorization is more helpful with that optimization. We can do it automatically, theoretically. We didn't play with that. We don't have that goal here.
There is no kind of limitation and specification how we put values into array. It's the same for container. If for example, we're talking about local variables, inline types, implementation completely has no local allocation of inline type on stack. For local variables of inline types, it's just a set of primitive variables for fields. It's completely mess. Every field of our inline type, may live in different places, and it provides us very nice optimizations.
Participant 2: Lastly, what happens when you assign?
Kuksenko: For assignment, you move the whole value of inline type.
Participant 2: You do the copy?
Kuksenko: Yes, sure.
Participant 2: The object also? When you assign two inline objects?
Kuksenko: If we assign inline, really into object, we have to do a heap allocation. We have to do boxing. We box it.
Participant 2: You have to box it?
Kuksenko: Yes, we have to box it for HotSpot can eliminate boxing and prove that boxing is not required, of course.
Participant 3: I had a question about backing arrays of inline types with native memory? Is that a heretical question?
Kuksenko: It's already become a practical question. We get that question constantly. First of all, the first question for us, the first implementation, is to do joining Valhalla with Vector API. You know what Vector API is. Vector API should be inline classes, Vector API values, but Vector API requires specific layout of data, which depends on hardware. In general right now, user data, that has no control of layout. Somehow that problem will be solved for Vector API. Vector API will be inline classes, but with a required hardware layout. The general question is how to exploit, how to control layout of inline types for other cases except of Vector API? Mostly it's joint project to be done when we are going to move it to native to heap. Right now, there is no solution which we can show you, but it is on our list to deal with that, to provide that for you.
Participant 4: Do you plan to support variable length inline time? Effectively converting string, for instance, into an inline type?
Kuksenko: No. Variable length classes or so-called fused classed, fused strings, first of all, are not related to inline classes, because you may still locate it on a heap. It does a typical job object, just to touch our array to that line for string. So-called fused strings or any kind of fused objects like big integers are a candidate for that. It requires two implementations as a projects which are called Arrays 2.0. Have we thought about that? As a project, it's called Arrays 2.0, and from that project we will get probably some fused string and etc. The other question is, is it possible to make a string inline type? String has three fields, it's immutable. Don't care about array. Don't touch array about that. Could you make a string as inline type? The answer is yes, we can make string inline type, but it looks we won't do it, because of so many applications, they stuck with strings identities. They're doing synchronization on that. They used to base the functionality on string and term function reference comparison, and if you'll change the inline type, a large set of application in the world will crash.
Participant 5: We are running some big server by using Java. If I turn on that Valhalla, if I enable it, am I supposing to expect that the TLP memory usage to increasing and the heap memories decreasing? Is that correct? I mean the threat level.
Kuksenko: I don't think that it will be that way. Yes, the size of stacks is increased. Even there are some examples when you may get stack, a whole exception earlier in the case of inline types, but it is mostly related to when you have inline types as arguments and results, but not for locals. For locals, it's pretty highly optimized for current examples. Definitely, I never saw that stack access providing a very large impact to there. From my experience, from my data, I saw mostly heap provide impact. Not stack. In that case, I think it will be even better. Like the other case, in which we can see on the comparison of HashMap or inline element in the market, I saw that amount of misses decreased significantly because of data locality. If you have reference, you're going to pause a page, you have still be missed or whatever. If data is local, even you require a little bit more pages, but it's the same such subsequent page of memory, not random references to other pages sent. In that case, it provides better behavior for TLB.
See more presentations with transcripts