BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Harnessing Exotic Hardware: Charting the Future of JVM Performance

Harnessing Exotic Hardware: Charting the Future of JVM Performance

Bookmarks
47:42

Summary

Monica Beckwith discusses the world of the JVM and its evolving relationship with exotic hardware. She presents a hypothetical scenario where GPU optimization plays a pivotal role.

Bio

Monica Beckwith is a leading figure in JVM performance tuning and optimizations. Her role as a Java Champion and the author of `JVM Performance Engineering` underscores her dedication to the Java community. Monica's significant contributions and thought leadership have enriched the community and set new benchmarks in JVM performance optimization.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Beckwith: I want to show how the JVM can harness this kind of specialized hardware. I'm Monica Beckwith. I'm a Java champion. I'm a JVM performance architect at Microsoft. I've been working with OpenJDK from since before it was OpenJDK, so it was Sun JDK, and it was closed source. I started my Sun JDK career at AMD, and used to help bring in improvements to Sun JDK back then. We'll focus on using OpenJDK, specifically for these kinds of advanced hardware, such as GPUs and vector unit. For that, I want to build up to the story of where Project Panama is today, and the future with Project Babylon, which was recently announced at JVM Language Summit.

Content

Why me? Why this talk? I have a JVM Performance Engineering book that should come out soon. In that book, I deep dive into OpenJDK and focus on HotSpot virtual machine. I talk about various projects, and one of the projects that, like I said, I wanted to talk about Project Panama. I thought that I have a great intersection with my life story, which I'll share shortly. The gist is that this is a chapter from my book, the concluding chapter, looking forward chapter. In this particular talk, what I want to do is introduce what exotic hardware means, and how does the JVM play with this concept of this specialized hardware? Because remember, JVM is write once, run anywhere concept. How do we blend it? The second thing I want to talk about, why is it so important? The major importance to this is that most cloud vendors are now offering it. You may not know that you have a specialized hardware on the cloud machine that you're using. Why not optimize for it? Why not write code for it? One of the things that I want to talk about from the JVM perspective is, how does it happen? Where does the magic happen? It's really not just one plugin pluggable block, like what was the reason that we needed Project Panama today? That's basically an evolutionary story. It starts with language design, and toolchain, and how do we make it happen underneath. Project Panama is a necessity now, because we have seen various things and everything points into the challenges that have led to where Project Panama is today to circumvent those challenges.

I'm going to go talk about case studies. We'll start with the baseline. Basically, I want to talk about JNI. Because of some of the work that I've done with Minecraft. I wrote the foreword for OpenJDK for ARM. We wanted to make sure that Minecraft works on ARM devices. Worked with the LWJGL community, which is the gaming library community, and we made sure that the JNI stuff exists for Windows on ARM, as well as on M1s. They did that themselves, actually, because they were so eager to enable it for M1s, but we did help with the Windows on ARM. That's my baseline, where we are. You can get it. You can write your JNI library. You can write your wrappers and then cross fingers and everything works. Hopefully, it will. Probably it won't work, but it's performant. That's the baseline. Then I'll talk about Aparapi, which is a parallel API, basically designed to optimize OpenCL. You're bridging the gap between Java and OpenCL. You're thinking about now a computing paradigm shift, where you're saying that, I know what I want from OpenCL, so I am going to provide you everything within this API framework. Then came Project Sumatra, because there were a lot of issues with Aparapi as well. Project Sumatra talks about data interop and everything like that, which set the stage for TornadoVM, as well as Project Panama. Sumatra fed into TornadoVM, which is right now the number one way we can interconnect with OpenJDK and GraalVM and bring specialized hardware functionality up at the Java programming level. Project Panama was something that was developed based on the stories of JNI. That was one aspect of it. The other aspect of Project Panama is just vector API, is actually very interesting because I talked about this. All these units, like special purpose units are available on your actual processor. If you look at any modern processor, so even ARM, for example, ARM64, you have this vector unit, right now you have SVE2. In the past, you had Neon and SVE2. There's evolutionary vector units that are highly capable. They're really used for vector computation. Intel actually pioneered this with their AVX stack. They have AVX2 4.2, and then now they have AVX-512. All these things are available in your hardware. Then you have the native libraries that actually know how to speak the language of the vector unit, what's the vector length and all those fun stuff. At the JVM level, we have something called vector API now, that helps you actually bridge the gap. After this, I would go towards the future for where the JVM and this Project Panama will help us build on to this high-level abstraction. I'll tell you the need for it. Then we'll talk about Project Babylon.

My Journey

My journey, again, why am I here? Why am I talking about this? This is my journey. I was born, I'm a Xennial. Learn to walk and crawl, but I had a different life, so I worked in a different field. Basically, electronics board design and everything. There was a need at Center of Adaptive and Astronomical Optics, at Steward Observatory in Arizona, where they needed a person like me. Basically, they knew their stuff about how to make deformable, multiple mirrors telescope. They needed somebody to do the electronics design, and they wanted somebody to design the chiller for these actuators. These are deformable mirrors, and you have hundreds of actuators. They keep on trying to readjust the deformities in the wavelength. These actuators need chilling because they were constantly operating, so that's where I came in. I learned a lot from all the geniuses that were with respect to adaptive optics. I'm going to build up these case studies. I'm going to work with the story right there with the adaptive optics. Then I got myself a permanent job at AMD. That's where my story with Sun JDK and OpenJDK started. I started working as a performance engineer for Sun JDK. Eventually, as all people that work with Sun JDK do, they eventually go to Sun Microsystems. Then Sun Microsystems became Oracle, and I continued my journey there as well. Right around 2011, we had this polyglot VM show up on our radar, and that was called GraalVM. Everybody was talking about Graal. We were all interested in looking at Graal. I was right in the middle of it as well, as always, apparently, by my history. What happened is that at AMD, there was also around the same time the project, a parallel API was born. AMD was trying to harness this GPU. If you know about general-purpose GPUs, AMD was one of the teams that started putting it on your processor design, the silicon on chip. They were thinking on how can we harness Java for OpenCL, for example. Then, because Graal came in, AMD also moved to this project, which actually used the Graal JIT to optimize the execution stack such that we can have optimized code for OpenCL, for example. Right around the time, because we were looking at the JNI issues already, Project Panama was born, which was originally thinking about foreign function interfaces, FFI, and how to do the linkage properly, and all those things. Trying to avoid the overhead. Eventually devolved into memory limitations that actually these projects exposed. Then, because of Sumatra, TornadoVM was born because they also saw the opportunity with respect to GraalVM and trying to optimize for this exotic hardware usage. Somewhere in between, I was working at ARM as well as being independent. I started working on Microsoft. That's important because, like I said, the LWJGL library, which is the baseline for all this. Then, recently, we went to JVM Language Summit, Project Babylon was announced there. That's where the future of these exotic hardware and the JVM interconnect is headed towards.

Introduction to Exotic Hardware and the JVM

We have the JVM, as I mentioned, it's just traditional GPU unit. It's write once, run anywhere. That's all we know about. Then comes these hardware accelerators, from GPUs, to your vector units. They're available right now, everywhere. Even if you have one of those gaming boxes, you're of course using the GPU offload. These are more common than you know, like my kids use it, specialized graphic cards and everything. Of course, it's time for the JVM to be able to utilize the capabilities. It's always been time, like I showed through my history, these projects existed. One of the things that's more important now is that we have the ability with Project Panama, to have any of those challenges overcome. One of the things that we have already established, is that we need APIs and toolchains. We're talking about TornadoVM. We're talking about Project Panama. We're talking about vector API. These are enablements that we need already. What they help us is have portability. The whole concept of JVM, Java, the whole runtime environment. You're using these APIs, so that we can run it from one system to the other. As long as you have a GPU, you can run these, thanks to those APIs and toolchains. They also abstract the complexity. You don't have to go learn how to program a GPU. You don't have to learn the kernels. What they do is that you just focus on the application logic, and the APIs will enable you to do that, just like vector API does. You don't have to worry, if you're running on AVX-512, or you're running on SVE, or you're running on Neon, or 4.2. You just use the vector API, and underneath it all, it'll choose the right stack based on your underlying hardware. Of course, there's the whole need of interoperability as well as native libraries and systems where we're talking about these optimized. For example, if you're working on an Intel system, they have this concept of one API. You can think of intrinsics if you were coming from the JVM world. They have these specialized libraries that know exactly what the data width is, how many things can you pack together, and what all you can do. There is a need for these APIs and toolchains to talk the language of these native libraries, and actually, the actual underlying hardware. I mentioned the JVM and accelerators, you utilize the hardware capabilities. Then in comes this API and toolchain that helps interface with the hardware, by the hardware accelerators providing you. Like, "I am a GPU, I'm capable of this. I need you to write your kernels in this particular way." This is an understanding. For the end user, this is all underneath. All you need to know is what the API is and what API you need to use for what particular purpose.

Exotic Hardware in the Cloud

I was thinking about this when we were thinking about just in cloud. I did research in general about how many cloud offerings offer different variations of specialized hardware. Prior to that, just the onset of cloud and how everybody is migrating. I just did a Google search on this. There's projections, and more than 61% is about 1.2 trillion by 2028. There are all these articles, you can read about that. What I'm trying to say is that there is already a shift from physical to virtual. Everybody is looking for adaptable, because if you think about cloud, it provides you the scalability benefits. You can scale out, scale up. Already you have GPUs equipped machines, so anything you get with AMD and all those things, they have GPUs and other accelerators. I told about the vector unit. There's the crypto unit. Most of the processors today actually come with crypto unit on board. It's not just about storage and scalability anymore. It's about harnessing these units because you understand. We do that in the JVM, we enable it for different units by ourselves. We have, like I said, intrinsics to do that. We already provide you these enhanced capabilities. All you have to know is that they exist. All these FPGAs, ASICs, CPUs, any AI/ML processors with M1, as you know, these are available in the cloud as well. Basically, you have to understand that this is there today. We're already beyond tradition. We're already democratizing hardware. We're already pioneering the new era.

Role of Language Design and Toolchains

Let's dive into the language design and toolchain. Very basic stack right here. Your OS, for example, could be a guest operating system. Your JVM, JRE could be running on top of containers. Then your hardware could be heterogeneous hardware, specialized hardware, whatever you need. There's already a need, you have the adaptive language design. That's because you have to harness the full potential of that hardware underneath. We already have that. Then there are some advanced toolchains that I'll talk about. One thing that we have to consider is how we deal with the language abstractions. What do you need to know as an application developer? That there are some language abstractions, and how do we deal with them? Underneath it all as compiler engineers, we already give you optimization. We already today for the general-purpose computing. We optimize for ARM. We optimize for Intel. We optimize for AMD. We optimize based on the layout of the system. We do lots of optimizations. That's how your runtime system is optimized as well. Then, of course, there's library support. I'll talk more about these things in details.

Why do we need language abstraction? As a developer, you have your program, so we need something. I don't expect you to go and speak the language of every hardware out there, because it's always evolving, and it's always getting better. As long as you have a programming interface, that's all you need to know. Underneath it all, we take care of the abstraction for you. Then, with respect to this specialized hardware, there's the whole concept of parallelism, which is very different from your general-purpose computing. The parallelism is all about data-level parallelism. We're not thinking about how to optimize the cores, because you say you have 128 cores now. You're not talking about the core level, instruction level, we're talking about data-level parallelism. That's how the mindset changes, as soon as you go towards these exotic hardware. You're thinking about different terms. Sometimes a regular developer may not be able to provide those abstractions and parallelism at the code level, so we do that for you.

Every time you have a compiler, especially a dynamic compiler, you run it from one cloud provider or on-premise to another one. We have the optimizations there. Any new hardware that comes out, specialized or general purpose, we have those optimizations for you. Then, of course, there's the whole concept of managing the scheduling on these diverse hardware. If you had a GPU and you have some OpenCL code, then how does that happen underneath? Your runtime takes care of it. Then there's the whole memory management and data transfer. Basically, there's data transfer for your GPU, and there's data transfer for your CPU. It's totally distinct, like I said. The parallelism is different. The data access and feeding back to the CPUs is very different. It's A plus B. We do compiler optimizations. We also understand and we provide with runtime systems, so your JRE will be equipped with that as we go down into Project Babylon, and Panama. Of course, there's the library support as well. We want to make sure that anything that's existing, like the JNI, we want to make sure that these libraries work, and the frameworks work. Then the other thing is going down and looking at the future, we want to make sure these are not only supported but also optimized. Again, that's how Project Panama was formed, because we're trying to optimize JNI for that matter.

Case Studies - JNI + LWJGL

I want to provide real-world examples. We'll talk about various projects because it's like a story. Then I want to talk about unique insights and challenges with each. There was a learning opportunity with each of these projects. That's now where we are getting to Project Babylon. Let's zoom in on those. As I mentioned about the JNI interface with LWJGL, what we do is, we use that library. If anybody ever wrote a game code, they would know that you want to just enhance the graphics and then the compute operations as well. There's lots of gaming libraries there, LWJGL happens to be what Minecraft uses for Minecraft Java edition users. What I'm trying to do here is show how the Java gaming library and the Java application, how they interact. On your left, you would see a regular JDK. Of course, it has the JVM, it has a native method interface, the libraries and everything. While the LWJGL, what we do is that there's a JNI wrapper, and that's your glue code. It's like trying to communicate. For example, in this particular case, it could be OpenCL, OpenGL, OpenAL, which is the audio part of it, kind of like the whole experience. We write JNI wrappers for those, and that works through the JNI interface. Then that's how we speak with the later libraries and everything like that. Eventually, it goes down to your heterogeneous hardware. Let's look at an example. My kids are very much into gaming and into theater. In this particular example, I want to give a quick shoutout to Shrek the Musical. What I want to do is show you how I would use it with LWJGL. I'm just trying to create a simple 3D scene. First, we initialize the graphics library framework. Then we create a window, we create an OpenGL context, because that's what you need to do with respect to your LWJGL binding. Because you have to create the entire thing using these libraries, the API bindings and stuff like that. Of course, I'm creating this screen first. I want to set it to Shrek swamp green. Then we draw out our 3D scene. We not only draw out our Shrek's house, we also do Lord Farquaad's castle. Then we just figure out how to update the scene. Very simplest rendering, but there's a lot of process is what I want to highlight there. There's a lot of steps you have to do, because you're speaking to OpenCL. What you want to do is that the LWJGL will provide you all the bindings you need. You need to know what those bindings are. Now the Java programmer actually turns into this LWJGL guru.

As you may have imagined, and for those that have already used JNI before, they know the challenges and limitations. We're working with native APIs, so we need to be aware of the complexities that we're dealing with. There's the whole concept of memory management mismatch. The way Java and the JVM runtime is, it does the management for you. If you're talking to the GPU, you have to understand the GPU just thinking in terms of data. Now your managed memory is talking to non-managed memory, and you're just spewing data over here, while your JVM is actually executing. Now there's all of the synchronization overhead. Then with JNI, you have to think about the JNI wrapper. Now you're speaking Java and you've speaking C++, which is one of the most used JNI interface. Then you have to think about the target application as well, the native libraries as well. How do we make it interface with the glue code that our JNI needs? There's a performance overhead. We all know JNI issues.

Bridging Java and OpenCL: Project Aparapi

Of course, it was not new to anybody. What we did at AMD was to have a parallel API, which does this language abstraction for heterogeneous hardware, and it translates Java bytecode to OpenCL. This is how it will look if you today program with Aparapi. You have the Java application that understands and uses a parallel API. Underneath it all, you have this OpenCL enabled runtime. Anything that has OpenCL code and bindings, will directly go via the enabled runtime path, it will get optimized, so basically, from your Java bytecode. As soon as you see the Java bytecode for OpenCL stuff, it will go to the optimized path and get executed on the GPU. Now you see two distinct paths already. Now here, when I provide the next example, I want to introduce the story that I said about the adaptive optics. I see a way to actually do these parallel processing of those aberrations. Your secondary mirror, when you look at your wavefront, those aberrations, you need to correct them to be able to get a good image. You can do it in real time by utilizing these APIs, if you are going to program in Java.

Our parallel computing task is Adaptive Optics Wavefront Correction. We have a lot of sensor data. That's true for any astronomical object or any data gathering. It's always a lot of data. It's big data. It's rendering. It's all real-time. What we want to do is not only have the data, we want to adjust the data, like correct the data. Remember I talked about those actuators and stuff like that. We actually in real time, will deform the secondary mirror to actually get the correct image. We're correcting the image in real time, and we're sending the signals to the actuator. I'm not going to show all the code here. It's a wish list of things. We get the wavefront data, in this particular case it's a 2D array. For Aparapi, because it provides you the API, you have to initialize the kernel. The kernel is the way we talk to GPU. That's the language of the GPU. You initialize that, and then you say that, this is my logic, this is what I want you to do with respect to my kernel. That's in the run method. You get the identifications, because it's a 2D array that I mentioned. This is just a very simplistic way. This is where I'm doing my correction. What I'm saying is just multiply it by 2, like it's very simplistic thing. This is not the actual calculations you would do, you will do aberration correction based on very complex logic. Then, what next? I have this correction, and I'm going to go launch it on my GPU. All I have to do is make sure that I have the range, which is what is going to get executed on the GPU. You can talk about your data packaging. Then you send it off to execute. I told you two different things. One was for us to understand, like with the game library, and speak literally, bit by bit, to our actual graphics unit underneath, where here we actually sent in the data. We read the data, and we did some computation with the data, and we send it out already. We are still programming in Java. We're letting the kernel and the API take care of all the GPU details. This is already a great abstraction and a great progress from the JNI interface that we saw.

Still has similar challenges. We have the data transfer bottleneck. We are talking about CPU and GPU. We're talking about a larger dataset. It's total different mindset. Somebody has to take care of it. Basically, as developers, we pay the penalty, because if you have so much of memory pressure, other things tend to suffer. There's various ways the cloud vendor is trying to optimize it. They'll give you a special dedicated stack for GPUs and stuff like that. I'm not going to go in there, but there is a problem today. Of course, you have to do explicit memory management. The Java heap and everything is on one side, the OpenCL's explicit memory demands will be on the other. The third is memory limitation. That's an API shortcoming, actually. You cannot use all different types of memories for GPU performance optimization. That's an API limitation. The other API limitation is the whole concept of Java subset and antipatterns. Basically, when we're using the API, we're talking about intent. If you think about Java code, when do we talk about intent? This whole thing, we were saying that it's not for JVM execution, it's actually for me to tell the GPU to do x, y, z. We're talking about intent here. JVM is all about execution: optimizing execution, parallelizing execution. It's a total different concept. It's a slight mindset change. It spoils the sanctity of the Java language itself.

Project Sumatra - A Significant Effort

It was a brilliant step towards the next gen. With all those problems, AMD understood and identified these problems, and then they introduced Project Sumatra. As I told you in the timeline, GraalVM was already there. They used that to their advantage. Another thing they used to their advantage was the Java 8 Streams API. Unfortunately, Project Sumatra is no longer live, because it was an OpenJDK project. We had to keep up with Graal JIT. If you remember the history, we had the pluggable Graal JIT in OpenJDK, and then we went our separate ways. We'll figure out Project Leyden at a later time, and maybe have pluggable stuff for there. It was a great effort, and it set the stage for future projects. This is what we do. I was not directly contributing to this. I had left AMD by then. My mentor and close friends of mine contributed to this project. We have the Graal JIT backend on the right-hand side. Again, left-hand side, I always talk about what you have, the vanilla JDK, and the right-hand side is any additions, addons. With Project Sumatra, you had the Graal JIT backend, and you had something called heterogeneous system architecture. That's a runtime that was designed to handle anything that works on the GPU, for example. By this time, AMD had also evolved in its architecture process to have something called Accelerated Processing Unit, which actually is a coherency unit between CPU and GPU. You literally could send everything down the same data buses, and it will go to the CPU route as you need it, or it will go to the GPU side. Remember the memory issues that I was talking about. It was brilliant, provided you had the APU on your hardware. The other thing that was very interesting, and that's where the Graal JIT comes in, is the HSA intermediate language. As soon as you knew that this is GPU code, it becomes, via IL, so based on your bytecode, via the IL, you can make it portable. You could work from one. Say if it was in the cloud, for example, you can move it from one to the other as long as they had the accelerator unit or else it will run directly as your JVM code.

Let's look at parallel processing with Project Sumatra. We're going to do the same thing, try to render the wavefront correction. This needs AMD's APU, and it needs the support of the HSA. Because APUs actually support the HSA software stack. It's very simple. All you needed to do if you wanted to utilize that APU and the HSA stack, with basically Project Sumatra, you could use the corrected wavefront, like I said, and you just map them so you do parallel streams, because we talk about data-level parallelism here. Then you send it to the collectors and they will do the job. Now the project itself with respect to the IL and with Graal JIT, it's totally abstracting anything and everything. You don't need to know anything about the GPU to be able to send these wavefront corrections down, and it'll take care of that. Now again, like I said, if you don't have AMD's APU, then you will run on traditional CPU. Basically, you're just using stream API. It will work on the traditional CPU, just as well. Not just as well quite say, but I'm talking about it from the programming perspective, because, again, I talked about the memory mismatch and all those things. If you did not have a dedicated APU, it will go down the JVM path. Remember, I showed on the left-hand side. You will be able to still get the parallel work done. You will use just the regular processing kernels.

As you can see, there are some challenges still. One of the biggest challenges is that it's dead. It's no longer working. It's been discontinued. It has this legacy and impact towards the next generation of innovation. That's why it's so important to learn about that. It also opened the door with respect to Graal. That's where TornadoVM is, it's based on Graal. We have similar complexities, the mapping of Java's memory model. It's tied to HSA, as I said, and the APU. Then you need to understand and speak the language, especially where hardware is concerned. Then AMD moved on from APU anyway, so there was that.

TornadoVM - A Specialization for Accelerators

When this was happening, TornadoVM, which is by Juan Fumero, he is the architect of TornadoVM. He also looked at this and he said that, there's this whole space, and he wanted to take it further. TornadoVM is a specialization for accelerators. This is how it looks. I mentioned about Graal. It basically uses the entire runtime. It designs this execution engine where it speaks the language of your device drivers, of your schedulers. Remember I talked about how the scheduling needs to happen on the GPU, for example, but this is beyond GPU as well, it does FPGAs and other fun accelerators. Let's stay with the GPU focus. Then, you also have to understand how to talk to the memory. There's this whole pointer thing. Is this a Java object, or is it going to go and do something with the GPU? You need to understand, and there is this whole memory space. It takes care of that for you. TornadoVM does. It provides you with this idea of annotations and tasks. It says, this is my GPU task. As a developer, you just need to understand that this is my GPU task. This is my execution plan. These are the annotations. I want to have it parallelized, for example. Then it generates the bytecode. It optimizes the execution, which is what JVM does. It optimizes for the underlying hardware. TornadoVM does that for you for the accelerators.

Let's look at an example. Again, we'll stick with the wavefront distortion. Again, the key here is real-time processing. If you did it in the regular path, it will not be real-time because it takes a lot of execution cycles, and everything else will come to a halt. Your actual application will come to a halt. First thing we do is initialize the task graph. Remember, I talked about it, so all about tasks and annotations. You tell it that this is my task, this is the kernel that I want to execute, and you can provide it. TornadoVM also has specialized task graphs already for you. If it's one of those supported accelerators, then you can just pick whatever execution framework, so you don't need to specify the task graph for that. Then I want to say, this is the device I want to use. TornadoVM is smart enough to pick a device for you. In this particular case, because I want to make sure that I want to access certain wavefront distortion processing, if I have a smart processor in there, that gives me the corrected wavefront. It does the calculations for me via AI or ML, so I can do that. I could say, transfer it to that ML device. These are the wavefronts. Then I say, this is my task. It's a concept of kernels, for example. Then you just get to the host that needs to correct the data. Basically, I'm saying that I want a very specialist device that is very smart and capable, and is able to correct my wavefront aberrations. Then I transfer to whoever needs this correction. Literally, if I had four different units on my heterogeneous hardware, I basically made sure that my JVM and general-purpose code is executing, while my GPU and the specialized ML accelerator are able to do their work as well. It's really advanced level of parallelism here. After we do this execution, because we had already specified in a task plan, the execution plan, we want to make sure how we're going to execute that on the GPU, for example. We have this annotation called parallel annotation. Then we're saying that this is what you need to do with the correction. Basically, I'm dedicating these units of execution. This is very simplified. I'm saying that this is how you apply. It's like the vectorization part of it. You say, and this is how you apply this corrected data.

The challenges still exist. You have the mapping complexity. You have the data movement complexity. Then, of course, how do you keep up with evolving technology? Remember, Project Sumatra died because the hardware evolved, and the OpenJDK, the Graal JIT, and everything evolved. There's the software, hardware technologies that are ever-evolving. TornadoVM is still here, and it's here stronger, but there is a direction that it wants to head with respect to these foreign function and memory interface, and vector API as well.

Project Panama - A New Horizon

That's where we look into Project Panama. How is Panama a bit different? I told you the two components. You have the FFM, which is the foreign function and memory interface. Then you have vector API. Vector API is of course available to you, if you have a Panama vector API labelled. I think it's in incubator stage right now, maybe stage 4 or something. What it does underneath it, is your API will go and look at intrinsics. Then it'll look at the hardware and it will find the right optimized code for your API call. Similarly, FFM, although you have to define the bindings and everything like that. It's able to do that as well. FFM also has various other advantages, I'll show via code. Let's go back to our Shrek 3D rendering, image processing. I want to apply a filter to the data. It's some pixel data that I created, like it's this 3D scene. Of course, because it's an incubator module, you need to import that. You need to make sure when you're compiling it, you compile it with the incubator module. With vector API, remember, I talked about the vector units and the evolution through hardware, one of the things that is up to you, is to talk about the species. In this particular case, I know that I need a float. That's the float vector you see in there, and that's the species that I want. Then you talk about the chunks. Remember, I talked about processing? Whenever you're talking about processing, it's about data parallelism. How do you achieve that? You'd say that this is the chunk of data that I want to work with. That's what you do with vector API. In this particular case, I call them Shrek pixels. The species length is what, like I talked about the float. Then you load the pixel data, because we're trying to apply a filter. You load it from array. Again, that's an API, float vector from array. That's all available with vector API. Then you apply the filter. In this particular case, I'm just doing a multiplication bit here. There's lots of vector operations that you can perform. Then I store my result. Basically, I applied the filter, I changed the pixels the way I wanted to. Here I've achieved image processing. Very basic, but it's something. It's using vector API. I did this with JDK 21. Of course, I was talking about that data, so Single Instruction, Multiple Data. Helps with that too. It's a different way of thinking, but it's there today. You can use it. Vector API, most of the advanced hardware will have it.

The other way to look at this is talking about JNI, and how FFM is going to help in that field. For that, we'll go back to a little more complicated scenario with the wavefront corrections. What I'm trying to do is do precision control of the secondary mirror. Now I'm trying to incorporate everything in there. First, I need FFM modules. There's different modules that FFM will give you. Of course, here I'm just saying I will just do everything, so foreign.*, here. Also, there's this logging thing that I'm going to use, and I'll show you. The way you use FFM is that you need the native linker. You have this native program. Then you just do the native linker. Here, I just look it up. Remember, I talked about the pointers. We're dealing with two different pointers. We're talking about Java and we're talking about this native function. You need to go find, where's my native function pointer for adjustMirror. You get that. I was talking about the memory segments issues, because you have the CPU, and then you have the GPU. They both have their own thing. FFM does this memory segment where you can directly talk to the array, and then your JVM gives you the guarantee of operations, and you can specify the lifetime. You have the function descriptor. You say that this is my downcallHandle, so it's like a VAR handle right there. Then you get the logger. That's it. You're transferring the responsibility of the pointer manipulation and doing the downcall and the memory to the JVM, so that you're getting adopted by the JVM. That's the magic of foreign function, and memory, which is Project Panama.

 

See more presentations with transcripts

 

Recorded at:

Jul 19, 2024

BT