Hi, I'm Gil Tene. Thank you for having me here.
If we look at the common approaches to garbage collection, garbage collection has been around for decades now, and usually you can look at garbage collectors as always doing three common things, at least in managed runtime environments. The first is that they have to find all the live objects that are out there; the second is that they have to actually recuperate somehow, find the dead objects and recover their space; and the third one that is common, to every commercial JVM and every commercial managed runtime at least, is that they will relocate objects from one place to another at some point: sometimes very commonly, sometimes not as commonly, but they have to do it in order to compact and free away memory to defragment it over time.
So, those three things happen regardless of how a collector works. In some cases, for example a mark-sweep-compact collector, will do these three operations in three different steps: mark will find the objects; sweep will identify the dead stuff that needs to be recovered; and a compact would move away stuff to defragment either of everything or parts. Other examples would be a copying collector that will do all three things in one shot. It will take everything that's alive and copy it somewhere else, thereby doing the first and the third spot at once: identifying the objects and compacting them away. And an immediate side effect that doesn't take any thought is once they've copied everything away, they have recovered everything, they copied space that was copied from.
So, there's various combinations, let's say, mark-sweep-compact or mark-compact and copy are the most common ways to look at collectors and they're used in different combinations in different actual environments. For example, generational collectors: young generation collectors usually favor copying collectors; old generation collectors usually favor mark-sweep or mark-sweep-compact variation.
2. What about reference counting?
So reference counting is a technique that people use when they don't have a managed-runtime environment and the ability to precisely track references. It's commonly used in environments like C++, or even Objective C uses it quite a bit, for managing object pools. And reference counting has good uses but has a lot of limitations.
Very specifically, reference counting techniques are not able to recover cyclic object graphs, or not easily able to do that, and as a result, the technique is limited only to data structures that don't have cycles in them. So for example, a w/ linked list won't work or a graph might not work if it's not a directed graph.
And as a result, that limits the scope where reference counting can actually work - it only works if you don't use certain data structures. In managed runtime environments, almost invariably you don't see reference counting being used and instead, we use, across the board, what's called precise garbage collection.
The managed runtime is aware of every reference and because it is aware of every reference, it can actually start from all the base root references and paint everything that is reachable, know what's alive and as a result, know what's dead without needing to count anything. And it's true for all Java virtual machines, for all .NET virtual machines, today that precise garbage collection is the technique used.
Azul's collector is a mark-compact collector and it is a generational collector. We actually use the exact same mechanism, mark-compact, for both generations - both our young generation and our old generation - and what distinguishes our collector from pretty much every other collector out there is that we are a concurrent mark, concurrent compact collector. So, we perform these operations without stopping the application. We mark the entire heap or the entire generation concurrently, and then we compact the entire heap or the entire generation concurrently with the application, without stopping it.
The mark side is very robust, can keep up with pretty much any application throughput that you might throw at it, but the compact side is probably what's very unique here. We will move live objects while the application is running, and we don't have to stop the application or to find the potentially 3 billion pointers that point to those objects and fix them. We can do it concurrently and handle the application running into them and fixing them on the spot without waiting. And that's what makes our collector concurrent and robust at the same time.
The generational hypothesis really seems to be true for virtually all applications - there is a vast majority of the objects that are allocated that die very quickly after they were allocated. And while generational collectors are good at creating small pauses now and pushing the large pauses to later, that really was not our interest in a generational collector. We don't have pauses either way.
But they have this other important side effect of efficiency. Because you focus your collection on the young generation and you recover a lot of dead objects for very little work, the efficiency of the collector, or translated to throughput, the ability of the collector to keep up with application allocation rates for transactional rates, ends up being, at least, an order of magnitude better.
So, we will very typically see us being able to handle five, ten, twenty times as much allocation rate with a generational collector compared to without, or you can think about it in reverse order handling the exact same throughput with a lot less empty memory.
So, we don't need as much empty memory with a generational collector as we would if we were non-generational, if we are to keep pauseless operation going with the same application.
The answer is no. In fact, that's probably what makes us stand out. Our approach in designing the collector was almost the opposite of what most collector designs have been for the last couple of decades. Where most collectors have spent a lot of effort on finding ways to delay a bad thing, make it happen later and later and add filters in front of it; generational collection being a good example, and then collecting objects into free lists without compacting being another. But eventually things happen that you have to do the bad thing you've delayed. You have to deal with all the objects; you have to deal with fragmented heaps.
Our approach had been exactly the opposite. We went and did the hardest thing in garbage collection directly. And once we do that hardest thing, which is concurrent compaction, we find that there isn't much need for the other techniques, because most of their purpose was to delay rather than eliminate a big problem and we've already eliminated it.
So, as a result, we have a very simple collector. It always does what is considered the harder thing, which is mark-compact, and because it does that it, it really doesn't need a fall back. We do not have any ‘stop-the-world', "Wait, this is wrong, I can't handle it. Let me stop everything, create a nice compacted heap for you and now we can go again." We do it all the time and we do it concurrently.
When we say that the collector in Azul's product, we call it the C4 collector, has a wide operating range, we talk about metrics; metrics that the application actually might use to describe its range. So, heap size is a good example. The same collector works just as well at half a gigabyte as it does at half a terabyte, and it works the same way across that entire range. There are no new techniques, or different techniques used at different sizes.
Similarly, it can handle slow allocation rates, and not a lot of work, and it's able to handle tens of gigabytes per second of sustained allocation, whilst collecting behind that allocation at a sustained rate. It can also handle high mutation rates, high changes to the references, to the pointers in the heap, which in both allocation rate and mutation rate are usually linear to application throughput - if you double the transaction rate, you'll typically see double the mutation rate, double the allocation rate.
Now, what we mean by wide operating range is that, if you take our collector and you can run it to tens of gigabytes per second of smooth allocation, in keeping up with mutation rates over their very large data sets, as long as there's enough memory in the system, and memory is cheap, and as long as we have CPU power to deal with it, which obviously we do otherwise we wouldn't be generating all that stuff, then we can keep up with it. We'll go to 100 GB, 200 GB of data, we'll go to very high allocation and mutation rates without introducing a pause.
And that usually means that you don't have to worry about finding optimal points to tune for. Rather than having sort of an optimal behavior point for garbage collection where if the heap is too small, it's not good; if the heap is too big, it's not good; if the setting is too big on one side for some flag or too little on the other one, you have to find that exact peak, and then you're sensitive; you might fall off to one side or another if you change your code or if you change your load, with the Azul C4 collector, you have more of a plateau.
Once you've got enough memory to keep up with the problem, with the rate that you're working on, more memory never hurts; so, usually people will just use additional memory as a buffer and you have this nice smooth range where you can change behavior but there's no cliff to fall off; there's no sudden pause or some sensitivity that comes up, and the range is very, very wide.
In fact, the way we tell people to tune for our collector is, "Avoid all flags and just start it with a very large heap; much larger than you might think you need; get your application to work and then find out how much you need by cutting the heap in half, and in half, and in half until your application starts misbehaving". At that point, you've found the breaking point - how much you need; double or triple that memory and you're done tuning because you have a smooth range and you know you're far away from the breaking point. That usually translates into a lot less work in tuning, a lot faster time to market for applications, hopefully.
Yes, we do0 find that Software as a Service companies are one of the interesting sectors that really likes to use our products; both historically with our hardware appliances over the years and now with our Zing product software-only JVM for Linux. I mean, there are many ways to look at why that's important in Software as a Service environments. But I would say that Software as a Service almost invariably doesn't control the load they run; they run other people's loads.
So, while in a regular enterprise you might be able to ask your marketing department, "How much load do you expect to see over the holiday season?" and ,"Give me a warning when you issue a promotion", so I know that I'm going to expect a large spike of load and maybe I should buy some hardware for that, in Software as a Service environments the business relationship usually does not allow for that. And you're basically offering a capacity, you're offering a service level agreement usually and your customers are doing things they might not be telling you about, which means you have to deal with variable loads, spikes in loads; things you might not have foreseen.
The other big part that happens in those environments, from our experience with some of our customers, is that every once in a while they'll sign up a really big customer, and when they sign up a really big customer, suddenly there's a huge spike in new functionality and new load on their system and they need that smooth operating range to accommodate that.
Obviously you can't run the load without having enough capacity to run the load, but what Zing allows you to do is very easily use the capacity without having to pre-design or pre-partition, or run it into a lot of little pieces. If you have the hardware capacity, you could pretty much expand into that capacity without a lot of fear.
We've been working really hard to get exactly that done and it's taken us a few years. If you actually look at the evolution of our technology over the years, what we've always done is build scalable virtual machines, and pauseless garbage collection, the C4 collector, is part of providing that. We used to need custom hardware to do what we do: we actually built our own chips, and our own instruction sets, and our own processors. And about three years ago, we identified the roadmap trend in the x86 chip market that allowed us to transition to pure software. Other people were making chips good enough to run our stuff now.
With that transition, we took our physical appliances that had a Java Virtualization technology along with them that let us push our JVM from a regular operating system into our stack that was just better at executing things, and we did that with our Vega hardware. We took our first version of Zing, which is Zing 4, the fourth generation of our technology, and delivered a similar deployment mechanism. We had a virtual appliance instead of a physical appliance - that appliance ran on x86 hardware usually with a hypervisor.
And you would, again, virtualize Java on to that better stack. That stack included a specialized kernel, specialized virtual memory and physical memory management that our collector uses, various capabilities in there that the JVM actually needs and just weren't around in the regular operating systems you might run on.
In the past year, since we introduced the first version of Zing that was virtualized, we also have been working on bringing those same capabilities, rather than in a virtualized stack that is sitting on the side, as an enhancement for a Linux stack that runs on plain vanilla, unmodified Linux distributions.
And it's been a lot of work and a lot of interesting things that needed to be done to get that done, but it's done and it's here and now. And as of last week, you can basically download a Zing JVM for your Red Hat Linux or CentOS, and soon for other operating system packages for Linux, and you could just run the JVM straight in your environment, which makes it much easier to deploy.
Now, how were we able to do that? The detail is quite deep but we basically deliver the virtual memory and physical memory capability we needed as a pure loadable module that runs on top of vanilla, unmodified kernels.
We do it by hooking into, and working with, the underlying virtual memory and physical memory management that is there in Linux and we adapt specifically to the different distributions we run on - there are differences between them. But what we used to need a custom Kernel for, and a kernel with specific additions for, we no longer use in the Zing 5.0 generation. And in that generation, we deliver a module for each distribution we run on that basically gives us what we need to run a pauseless collector on Linux.
So, virtualized could be interpreted two ways here: there is the virtualized of running the stack on top of a hypervisor, for example; and there's the other form of virtualized which we call Java virtualization, which was separating a JVM from the operating system it runs on by pushing a back-end into what we used to have in our physical or virtual appliances.
So, with Java virtualization, which is a very nice trick used to deliver capacity and behavior into an operating system that doesn't have the features you need, we achieved full transparency. Applications actually run very well there and they were designed for high throughput, I/O heavy applications, but there's this inevitable or unavoidable proxy hop that goes around. Everything is talking to the original operating system point. We run what is effectively an I/O Proxy at that point forwarding I/O and other operations, and native operations, back and forth from our back-end that runs in our stack.
And, while we got that to run very smoothly, buffered it, streamed it, that hop costs something. It costs some processing, it costs some latency and it made some things like I/O operations that were synchronized or JNI calls to native code on the native operating system, take a round trip hit across this network hop that was at least 100s microseconds long.
All that is avoided by the native Linux deployment where there is no I/O hop to add. There's no virtualization of the JVM per se and all the potential overhead that existed there, and might have made it not as good for low-latency applications, for example, not as good for native calls or mixed language applications - all those downsides are just gone and all the upsides are still there.
Now, the other part of virtualization - running on top of a hypervisor - I wouldn't say that there's anything better about running it with or without a hypervisor. With Zing 5.0, we run just as well on either. Since we'd already optimized the Zing stack to run on hypervisors, on modern hardware that supports the right operations for virtual memory, which is pretty much any x86 hardware shipped in the last 2 ½ to 3 years.
Our Zing 5.0 for Linux runs just as well if the Linux is on a physical machine as if it was on a virtual machine, as long as the virtualization uses a modern hypervisor and the modern hardware features.
10. You also have the managed runtime initiative. Can you tell us a little bit more about that?
So, we introduced the managed runtime initiative a little over a year ago, and the purpose of the managed runtime initiative is actually to, in the long run, add capabilities to various parts of the system stack: operating systems, hypervisors, perhaps even hardware and chips, that would make managed runtimes better - that's the long term goal of that initiative.
In order to start it off and kick start it, we actually put out two pieces in open source form: one part was a kernel side of the managed runtime initiative, which included hooks for the Linux kernel and a module that uses those hooks to provide the features that we use in our pauseless collectors.
And since just that part alone would have been an interesting intellectual exercise, but not one you can demonstrate, we also put out a proof of concept demonstration of a pauseless collector based on OpenJDK. We put that out in open source form so anybody could take a look at the source and see how the two work together.
The purpose of that was really to evangelize for new features in the operating system, by demonstrating how much value we can derive from them in a managed runtime. And the features we are looking for are specific virtual memory, physical memory and to some degree scheduling capabilities that we think operating systems should add.
It is important to note that we're not talking about just Linux in that environment. The managed runtime initiative is not only about open source and we use open source as a demonstration vehicle for what could be done, but we would like to see these features appear in any operating system - Windows, Solaris, AIX, whatever new operating systems might come around - and obviously in Linux as well. But Linux is a great way to demonstrate it and Linux alone is important enough to evangelize into.
As we introduced the set of kernel hooks that really defined a new ABI that these modules could use for the kernel, we started working and trying to influence the upstream kernel versions that might appear in a few years, to include these features. There's a lot of good debate around whether or not the kernel should add these capabilities. We obviously think that since managed runtimes constitute a vast majority of what enterprise applications are built in now, it's an important enough field to optimize a system towards.
But the managed runtime initiative, hopefully over time, will get those right features in, get the right ABIs and the right OS features, so that managed runtimes, across the board not just Java, could use them in order to deliver things like pauseless collection, scalable heaps, elastic memory and the like.
So that's a good question. It kind of depends on what you consider a weak spot, or what you look for in a weak spot. I'm sure that if you take a specific collector design, you can write a specific micro benchmark that would exercise something that costs in that collector in a very tight loop, and costs less in other collectors. That would be a plain throughput comparison, but it only applies to micro benchmarks really. And the way we measure throughput is pretty simple. We think that throughput without some sort of a response time criteria is kind of irrelevant. Failing throughput does not matter.
And if you define what you require the responsiveness of the system to be, then usually the question is: what sort of throughput can you maintain while still holding to your requirements? For example, you might say that, you know, "This is an interactive application and I can accept 90% of responses being there better than one second but I can't accept any responses being longer than 10 seconds." That would be a common need for interactive applications. So, it might be some percentiles that you put the line at.
And within the criteria of meeting that, the question is: how fast can you run? If you can run faster but you failed that completely, it's usually not a very interesting point. So, it's maybe interesting for overnight batch applications, but even an overnight batch application actually has a response time criteria. It does need to complete overnight.
So, within a given, say, interactive level of many seconds of response time but not too many, we generally see that with C4, you're able to handle a lot more throughput, not less throughput, while maintaining the same response time. For example, if we take a portal application, we can support often ten to twenty times as many concurrent users, running ten to twenty times as many concurrent transactions, meaning ten to twenty times as high a transactional rate within an instance of a JVM running this code, on exactly the same hardware, on exactly the same software - that's faster not slower.
Now, that's reality and that's I think how things should be measured, but to honestly and directly answer the beginning question of: is there a weakness or is there something we pay for the ability to do what we do? The C4 collector includes a read barrier, a type of read barrier we call the loaded value barrier; it's a new class of read barrier. And that read barrier has a fast path that's executed on every Java pointer load from memory.
So, we do pay with an operation that is not done by other collectors on pointer loads. However, that operation is highly optimized, fits very well into the x86 pipeline that has plenty of room in its 4-issue engine to do the work. And we normally experience an overhead on a per thread basis of a few, a handful of percent compared to running without that barrier with our own optimizing compilers.
So, if we compare our code without the barrier and our code with the barrier, we would be a few percent faster if we didn't have the barrier. However, if we didn't have the barrier, we would pause, and if we paused, we wouldn't be able to handle the throughput at the SLA required.
So when we actually measure it in terms that people care about, which is, "How much can I do with this JVM, or this piece of hardware, or this piece of software, whilst still working", not, "How fast can I crash into a wall", we generally see much better throughput with C4 than with any other collector we can see.
So that's a very good question. I think that you pretty much need to classify a collector to answer that question and I'd classify C4 as a concurrent-mark, concurrent-compact collector; so, it performs those two operations concurrently. It's also a generational collector where both generations have that behavior. So, both the young generation collector has concurrent mark and compact, and the old generation is concurrent mark and compact.
If we compare that to G1 which is the new collector in Java 7, and with the balanced collector from IBM which came out in Java 7 implementation from IBM, I think, over the summer, I would say that those two collectors are actually very similar in classification; they're both incremental-compacting, old generation collectors and they're both 'stop the world', monolithic, copying, young generation collectors.
So the young generation in both of these collectors is a 'stop-the-world' copying collector that in one shot, copies the entire young generation live set from one place to another. That is no different than what, say, parallel GC does, or CMS does or other previous collectors do. The old generation is really where the difference is, and in the old generation they have - both have - a mostly concurrent marker; where the marker can remain concurrent up to a certain mutation rate, and beyond a certain mutation rate it falls behind the application in that. But it is concurrent for a certain range of throughput and then it has not a concurrent compactor but an incremental, 'stop-the-world' compactor.
You might need a little explanation of what that means. An incremental, 'stop-the-world' compactor is one that tries to not compact the whole heap in one monolithic stop, but tries to break down the compaction into a lot of small increments, in the hope that each increment could be made small enough to fit within some response time needs.
Now, if you can find the right parts of the heap where some regions could be compacted, and only a very small set of other regions need to be scanned for pointers into there, and you can do that in a fixed amount of time, and if you're lucky enough to do that, then you can compact regionally and incrementally and achieve some level of compaction.
But really that technique has two issues compared to a true concurrent compactor. The first one is that it cannot work for things that are popular. So, if any object or any region is pointed to from the vast majority of the heap, then compacting that region would be as costly as compacting the entire heap. You would have to scan every pointer out there.
So, you have to avoid all the popular things and unfortunately the popular things will accumulate. The issue with this, for example, take a common example of what sits in the old generation, a very large cache, and caches are pointed to from a lot of things, and they get used from a lot of different objects, and pointers to them fly around a lot. So, if you look at any region of a heap that contains cached information, that cache gets more and more popular, as far as counts from other regions, over time.
And unless you can at some point get rid of the popular regions, you're not going to go very far. So, as yet another filter, as a delay tactic to push out an inevitable full compaction - these are good techniques but inevitably, there will be a full compaction, where with C4 that compaction happens concurrently in every time.
The other, and very important, aspect of incremental compactors for the old generation is that they have an N-squared complexity problem. To be specific, if the heap doubles, the amount of garbage collection work needed to maintain the same response time quadruples, goes to the square of the growth of the heap. And that has to do with the fact that any one region in the heap has a statistical probability of being hit by some percentage of the rest of the heap and as the heap itself grows, the number of regions pointing into any one region grows with it, creating an N-squared complexity problem if you're going to do increments.
So, the technique might work for contained size heaps, but as the heap grows to modern server capacities, say, 50, 60 gigabytes or a hundred or maybe more, the complexity would grow to the point where it's unattainable with the CPU power that you have.
That's a good question. I'm describing them from my understanding of reading the material. I've read the Garbage First papers. I've actually looked at the code in the OpenJDK source code and in Hotspot. I haven't read either material or the code for the Balanced Collector, but I've talked to and attended sessions about its design, and it's remarkably similar to G1 in approach. So, for pretty much any of the parts that you can point to, as I've said they have a stop-the-world young generation copying collector, a concurrent marker that marks the entire heap, that is doing that in a multi-phase fashion, and then an incremental compactor that is region-based. From a classification perspective, they fall into the exact same category.
We're going to find out more as we go, but we are actually very excited to have joined the JCP. I am the new representative from Azul to the Executive Committee. And we think the JCP is going through an interesting revival or, at least, we're hoping to see that revival happen. Clearly, it's been kind of stagnant for a while and there have been a lot of political issues in it - some very visible departures, some stalls of the process and attempts to stop actual Java specs from moving forward until some issues around licensing are resolved.
I think that there's been a lot of bad taste in people's mouths left from that, and we're hoping to first of all get the process to work again, get the process to do what is was originally meant to do, which is allow people to introduce new specs/innovations into the language, into the platform, run them through the process and allow new versions to come through, enhance them or encourage them to happen.
We also think it's very important to include open source within that process. So, we've actually joined both the OpenJDK and the JCP, and we think that OpenJDK is a very good vehicle for moving implementations forward and in the open, as opposed to quietly and without being seen until a JSR comes out with stuff in it.
And not just open from a visible feature perspective: where people can actually look at the code, look at the implementation, critique it, contribute to it, and help change it.
So, I think that with the OpenJDK effectively becoming the reference implementation for the Java platform, we're going to see a much more open approach to moving the platform forward. We're hoping to see that going forward, and we think we can help.
15. Do you have specific things in mind you'd like to contribute back into the OpenJDK project?
That's an interesting question. Our current products, the Azul Vega products and Azul Zing products we just released, are all based on not the OpenJDK, but Oracle Hotspot licensed commercial code which, luckily, is very similar to OpenJDK, almost identical in the parts that we typically work on, but it is different; structured on to a different license.
We continue to ship that code: it is the way for us to ship our products to date. But we see OpenJDK as the long term basis for our products. We expect to completely transition our products to use OpenJDK in the long run as its base code, or reference implementation that we then add features to, rather than the commercial license source-based from Oracle for that.
And we're obviously very confident being able to do that, because we've been working with the code base for a long time. We know both sides of it fairly well.
As far as contributions, as we transition our product to OpenJDK, we will eventually contribute all the GPL code involved, but we expect to actually add contributions well before that. So, there are specific areas of improvements and enhancements to the Hotspot, mostly to the VM itself, that we have done that we would like to see just put upstream and for everybody to benefit from. It would help us reduce the amount of integration we have to do when we take a new release with bug fixes and put them in. For example, anything from supporting a fully 64 bit clean implementation without warnings around the board, just clean-ups and things like that are there, to enhancements to locking, addition of things like read barriers across the board, and various enhancements to do with the thread model and other things in the JVM that we've put in and we think would be useful for the community as a whole.
And, over time, you know, garbage collection interfaces and garbage collection implementations are obviously one of the things we're very strong in, so adding the right interfaces, defining the right library APIs and things, are very important to us.
I would say that between now and a few years from now, we're going to see a kind of continual transition where OpenJDK 6 is now available for us to work on. OpenJDK 7 is available in source code form but the TCKs for that are not yet out and we're hoping to see them come out very soon. [Editors note: they have been released since this interview was recorded]
And as both us and other members of the community start working actively in that code base - which is a new thing because we've really seen the code base mostly led by only one major group - as we add more and more serious contributions to this, I think we'll see an interesting community growth and ways of testing together, integrating together, that will get better and better over time.