Transcript
Bakker: I'm going to talk about how Netflix is really using Java. You probably know that Netflix is really just about RxJava microservices, with Hystrix and Spring Cloud. Really, Chaos Monkeys are just running the show. I'm only half getting here because a few years ago, this was actually mostly true, maybe except the Chaos Monkeys. This stack was something that we were building on in the last several years. Things have changed. Quite often, I have conversations with people at conferences like this one, where they're like, yes, we were using the Netflix stack. Like, which stack exactly are you talking about? It's almost never the stack that we're actually using. These are just things that people associate with Netflix, because we've been talking about our technology for so many years, but things might have changed a little bit. We're going to bust some myths. We're going to take a look at what we're actually doing with Java. Things are ever-evolving. Things are literally just changing all the time.
Background
My name is Paul. I'm in the Java Platform at Netflix. Java Platform is responsible for the libraries, frameworks, and tooling that we built around Java, so that all our Java developers have a good time developing Java applications. I'm also a Java champion. I have been in the Java space for quite a long time. In the past, I wrote two books about Java modularity. I'm also one of the first authors of the DGS framework, that's the GraphQL framework we use for Java. We'll talk quite a bit about DGS, and how that all fits in the architecture.
Evolving Architecture
Before we start diving into JVMs and how we use Java, and the framework that we're using, we have to understand a little bit better how our architecture has been evolving. That explains why we did things in a certain way with Java several years ago, and we're doing things quite differently today. What you should understand about Java at Netflix is that we have a lot of Java. We are basically a Java shop, and every backend at Netflix is basically a Java app. We have many applications. At the size of Netflix, there's lots of internal applications to just keep track of things. We're also one of the largest film studios in the world. There's a lot of software being developed just to produce films, basically, again, all Java. Then of course, we have what we call the streaming app, which is basically the Netflix app, as you probably know it. That is what we're looking at here. This screen here is what we call the LOLOMO, the list of list of movies. That is just one example of an application that is backed by Java. You have to understand that pretty much everything that I'm talking about, that is true for basically every backend in Java. We use the same architecture now for pretty much all our different systems, both internal and consumer facing, and we use the same tech stack everywhere. Although I'm giving that example, because it's just a large example to play with, it's much more universal than that.
The Groovy Era
When I joined Netflix almost seven years ago, we were in what I call the Groovy era. What you probably know about Netflix, and this is still true, is that Netflix has a microservices ecosystem. Basically, every piece of functionality and every piece of data is owned by a specific microservice. There's many of them, literally thousands of them. On the slide here, I just made it up, because it makes sense in my head. It's a much-simplified version of what we actually have in production. Think about this LOLOMO screen, this list of list of movies that we just looked at, at a previous slide, you're probably familiar with that screen, that to render that screen, we would have to fetch data from many different microservices. Maybe there's like a top 10 service that we need, because we need a top 10 list of movies. That's backed by a specific service. Then there's an artwork service that gives us the images as we show in the LOLOMO, and these are all personalized as well. There's probably a movie metadata service, which gives us movie titles and actors and descriptions of movies. There's probably a LOLOMO service which is actually giving us what lists to actually render, which again is personalized. I say that we have maybe 10 services to call out to. It will be usually inefficient if your device, let's say, your TV, or your iOS device will just do 10 network calls to these different microservices. It will just not scale at all. You would have a very bad customer experience. It would feel like using the Disney app. It's just not ideal. Instead, we need a single front door for the API where your device is calling out to. From there, we do a fanout to all the different microservices, because now we are in our network, we are on a very fast network. Now we can do that fanout without performance implications. We have another problem to solve, because all these different devices, in subtle ways, they're all a little bit different. We try to make the UI look and behave similar on every different device. All these different devices, like a TV versus an iOS device have very different limitations when it comes to memory, network bandwidth. They actually load data in subtly different ways.
Think about, how would you create an API that would work for all these different devices? Let's say you create a REST API. We're probably going to get either too little or too much data. If we create one REST API to rule them all, it's going to be a bad experience for all these different devices, because we always waste some data, or we have to do multiple network calls, which is also bad. To fix that problem, what we did is we used what we call a backend for frontend pattern. Basically, every frontend, every UI gets its own mini backend. That mini backend is then responsible for doing the fanout and get the data that that UI exactly needs at that specific point. They used to be backed by a Groovy script. That mini backend was basically a Groovy script for a specific screen on a specific device, or actually a version of a specific device. These scripts would be written by UI developers, because they are the only ones who actually know what data exactly they need to render a specific screen. This Groovy script would just live in an API server, which is a giant Java app, basically. It would do a fanout to all these different microservices by just calling Java client libraries. These client libraries are just basically wrappers for either a gRPC service, or a REST client.
Now, here we started seeing an interesting problem, because, how do you take care of such a fanout in Java? That's actually really not trivial. Because if you will do this the traditional way, you create a bunch of threads, and you start to manage that fanout with just minimal thread management, that gets very hairy very quickly, because it's not just managing a bunch of threads, it is also taking care of fault tolerance. What if one of those services are not responding quickly enough? What if it is just failing? Now we have to clean up threads and make sure that everything comes together nicely again. Again, not trivial at all. This is where RxJava and reactive programming really came in. Because reactive programming gives you a much better way to do such fanouts. It will take care of all the thread management and stuff like that you need to do. Exactly because of this fanout behavior, that is why we went so deep into the reactive programming space, and we were partly responsible for making RxJava a big thing many years ago. On top of RxJava, we created Hystrix, which is a fault tolerant library, which takes care of failover and bulkheading, and all these things. This made a lot of sense seven years ago when I joined. This was the big architecture that was serving most traffic. Actually, it is still a big part of our architecture, because depending on what device you're using, if it's a slightly older device, you probably still get served by this API, because we don't have just the one architecture we have many architectures, because it is nicer that way.
Limitations
There are some limitations, although this obviously works really well, because we have been able to grow our member base based on this architecture primarily. One downside is that there's a script for each endpoint. Because, again, we need an API for each of these different UIs. There are just a lot of scripts to maintain and manage. Another problem is that because the UI developers have to create all the mini backends because they are the ones who know what data they need, they have to write those. Now they are in the Groovy Java space and using RxJava. Although they're very capable of doing so, it's probably not a primary language that they are using on a daily basis. The main problem is really that reactive is just really hard. Speaking for myself, I've been doing reactive programming for at least 10 years. I used to be extremely excited about it, and tell everyone about how great it all is. It is actually hard, because even if with that experience, look at a non-trivial piece of reactive code, I have no clue what's going on. It takes me quite a bit of time to actually wrap my head around, ok, this is actually what's happening. These are the operations that are supposed to happen. This is the fallback behavior. It's hard.
GraphQL Federation
Slowly, we have been migrating to a completely new architecture, and that is, we're putting things to a different perspective. That's all based on GraphQL Federation. Comparing GraphQL to REST, one very important aspect of GraphQL is that with GraphQL, you always have a schema. In your schema, you put all your operations, so your queries and your mutations, and you define them, and you tell it exactly which fields are available from the types that you're returning from your queries. Here we have a shows query, which returns a show type, and a show as a title, and it has reviews. Reviews again is another type that we define. Then we can send a query to our API, which is on the right-hand side of the slide. What we have to do there, and this is, again, really important, we have to be explicit about our field selection. We can't just ask for shows and get old data from shows. Now we have to say specifically that you want to get a title and the star score on reviews on a show. If we're not asking for a field, we're not getting a field. It is super important because again, compared with REST, very basically, you get whatever the REST service decides to send you. You're just getting the data that you're explicitly asking for. It's more work if you specify your query, but it solves the whole problem of over-fetching, where you get much more data than you actually need. This makes it much easier to create one API that serves all the different UIs. Typically, when you send a GraphQL query, you will just get the result back encoded as JSON.
We're not just doing GraphQL, we're actually doing GraphQL Federation to fit it back into our microservices architecture. In this picture, we still have our microservices, but now we call them DGSs. They're just a term that we at Netflix came up with. It's a domain graph service. Basically, it's just a GraphQL service. There's really nothing special about it, but we call them DGSs. A DGS is just a Java microservice, but it has a GraphQL endpoint. It has a GraphQL API. That also means it has a schema, because we said that for GraphQL, you always have a schema. The interesting thing is that we have, of course, many different DGSs, many different microservices. From the perspective of a device, so from the perspective of your TV, for example, there's just one big GraphQL schema. The GraphQL schema contains all the possible data that we have to render, let's say a LOLOMO. Your device doesn't care that there might be a whole bunch of different microservices in the backend, and that these different microservices might provide part of that schema. On the other side of the story on the microservices sides, in this example, our LOLOMO DGS is defining a type show, with just a title. The images DGS can extend that type show and add an artwork URL to it. These two different DGSs don't know anything about each other than the fact that there is a show type. It can both contribute parts of that schema, even on the same types. All they need to do is publish their schema to the federated gateway. Now the federated gateway knows how to talk to a DGS because they all have a /GraphQL endpoint. That's it. It knows these different parts of the schema, so if a query comes in where we ask for both title and artwork URL, it knows that it has to call out to these different DGSs, and fetch the data that it needs. On a very high level, not that different from what you previously had, but there's a lot of differences in the details.
I'll also change our story here. First of all, we don't have any API duplication anymore. We don't need a backend for frontend anymore because GraphQL as an API is flexible enough, because of field selection that we don't really need to create those device specific APIs anymore. It also means we don't have server-side development for UI engineers anymore. That's great. We do get a schema to collaborate on. That's a big deal, because now we have closed the gap between UI developers and backend engineers, because now they can collaborate on a schema and figure out, ok, what data do we need in what format? Very importantly, we don't have any client libraries in Java anymore, because the federated gateway just knows how to talk to a generic GraphQL service. It doesn't need specific code to call out to the specific API. It's all just GraphQL. All it needs to know, how to talk to a GraphQL service. That's all. It's all based on the GraphQL specification. We don't need specific code to call to a specific microservice anymore.
What Does that Mean for Our Java Stack?
Now we get into, how does that change our Java stack? There's really no place anymore where we need Rx, or Hystrix, or such things, because previously, we needed this because we needed that specific code to call out, ok, I want to call this microservice and then this microservice, and at the same time, this other microservice. We needed an API for that. We don't need it anymore, because that's now taken care of by the GraphQL Federation specification. That's not completely true, because the federated gateway itself is actually still using a web client to call the different DGSs, and that is still reactive. However, it is not using any specific code for this microservice anymore. It's actually a very straightforward piece of web client code where it knows, ok, I have to call these three services, just go do it. It's all GraphQL, so it's very simple. All the DGSs and the other microservices in the backend, they're all just normal Java apps. There's not really anything specific about them. They don't need to do any reactive style of programming pretty much anywhere.
The Micro in Microservices
Before we dive deep into the rest of our Java stack, I want to speak a little bit about the micro in microservices, because it's another thing that people seem to be confused about how it actually works in practice. It is true that a microservice owns a specific functionality or dataset. More importantly that such microservices are owned by a single team. That is a really important part about microservices. It is all even more true with this GraphQL federated architecture, because it's now even easier to just split things out in different microservices and make it all work very nicely. However, don't be fooled by the size of those microservices, because a lot of those so-called microservices at Netflix are a lot larger, just looking at the code base, than the big monoliths that I've worked at, at many other companies. Some of these systems are really big. There's a lot of code there. Of course, when they get deployed, they might be deployed on clusters of thousands of AWS instances. There's really nothing small about them. That also answers the question, should I do microservices? It depends on your team size. Do you have like the one team that takes care of everything, and it's just a small team? If you would add microservices there, you're just adding complexity at that point for no good reason. If you want to split your team into smaller teams, basically, and just because of team size, then it also makes sense to split up your larger system into smaller pieces so that each team can own and operate one or more of those services.
Java at Netflix
Time to actually really get into the Java side of things. We now know, on a higher level, how and where we're using Java. Now we talk about how it actually looks like. We are now mostly on Java 17. It is about time. We are already also actively testing and rolling out with Java 21. Java 21 just came out officially. We're just using a regular Azul Zulu JVM. It's just an OpenJDK build. We are not building our own JVM, we don't have any plans to build our own JVM. Although there was a very interesting Reddit thread claiming that we do. We really don't, and have no interest in doing so. OpenJDK is really great. We have about 2800 Java applications. These are mostly all microservices of a variety of sizes. Then about 1500 internal libraries. Some of them are actual libraries, and many of them are just client libraries, which is basically just sitting in front of a gRPC or REST service. For our build system, we use Gradle, and on top of Gradle we have Nebula, that's a set of open sourced Gradle plugins. The most important aspect of Nebula, and I highly recommend looking into this, is, first in resolution of libraries. As you know, Java has a flat classpath. You can only have the one version of the library at a given time, if you have more than one version, interesting things happen. To prevent these interesting things from happening, you really want to just pick one, basically, and Nebula takes care of that. The next thing that Nebula does is version locking. Basically, you will get reproducible builds that you always build with the same set of versions of libraries until you explicitly upgrade. That makes it all very reproducible. We're pretty much exclusively using IntelliJ as our IDE. In the last few years, we have also invested a lot of effort in actually developing IntelliJ plugins, to help developers doing the right thing.
The Java 17 Upgrade
We are mostly on Java 17. That is actually a big deal, because this is embarrassing, but at the beginning of the year, we were mostly on Java 8. Java 8 is old. Why were we still on Java 8? Because we had Java 11, and then Java 17 available for a very long time already. Somehow, we just didn't move. One of the reasons is that until about a year ago, about half of our microservices, especially the bigger ones, were still on our old application stack. It was not Spring. It was a homegrown thing based on Guice, and a lot of old Java EE APIs, lots of old libraries that were no longer maintained. At the very beginning when we started upgrading to Java 11 initially, a lot of these older libraries were just not compatible. Then developers just got the impression that this upgrade is hard, and it breaks things, and I should probably just not do it. On the other hand, there was also very limited perceived benefits for developers, because if you compare Java 8 to Java 17, there's definitely some nice language features. Text blocks alone are enough reason for me to upgrade, but it's not that big of a deal. The differences between 8 and 17 is nice, but it's not like changing your life that much. There was more excitement about moving to Kotlin than we did in just upgrading to JDK.
When we finally did start pushing on updating to Java 17, we saw something really interesting. We saw about 20% better CPU usage on 17 versus Java 8, without any code changes. It was all just because of improvements in G1, the garbage collector that we are mostly using. Twenty-percent better CPU is a big deal at the scale that we're running. That's a lot of money, potentially. Speaking about G1, G1 is the garbage collector that we use for most of our workloads, at the moment. We've tested with all the different garbage collectors available. G1 is generally where we got the best balance of tradeoffs. There are some exceptions, for example, Zuul, which is our proxy. It runs on Shenandoah, that's the low pause time garbage collector. For most workloads, Shenandoah doesn't work as well as G1 does. Although G1 isn't that exciting anymore, it is still just really good.
Java 21+
Now that we have finally made a big push to Java 17, and we've got most services just upgraded, we also have Java 21 available. We've been testing with that for quite a few months already. Now things really get exciting. The first exciting thing is that if you're on Java 17, upgrading to Java 21 is almost a no-op. It's just super easy. You don't have the problems that we had from Java 8 to newer versions. There's also just a lot more interesting features. The first obvious one that I'm super excited about is virtual threads. This is just copy-paste, it's from the JEP, the specification from Java 21 of virtual threads. It's supposed to enable server applications written in a simple thread-per-request style to scale at near optimal hardware utilization. It sounds pretty good. This thread-per-request style, if you're using something that's based on servlets, so Spring Web MVC, or any other framework based on servlets, thread-per-request is basically what you get. A request comes in, Tomcat or whatever server you're using gives it a thread. That thread is basically where all the work happens, or starts happening for the specific request, and stays through that request until the request is done. That is a very simple style and easy to understand style of programming, and all the frameworks are based on that. It has some scalability limitations, because you can only have so many threads effectively running in a system. If you have a lot of requests coming in, which we obviously have, then the number of threads is just a limiting factor in how you can scale your systems. Changing that model is really important. The alternative to that is, of course, doing reactive again, so do something like WebFlux. That also gets you in reactive programming, again, with all the complexities that we already talked about.
Now, I think that virtual threads is probably the most exciting Java feature since probably lambdas. I think that down the line, it is really going to change the way we write and scale our Java code. I think that, in the end, it is probably going to further reduce reactive code, because there's just not really any need for it anymore. It just takes away that complexity. We have already been running virtual threads in production for the last month or so, experimenting with it a little bit. I'll get back to that in more detail. Then the other interesting feature in Java 21 is the new garbage collector or the updated garbage collector, because ZGZ is not new. That was already available in previous versions. They now made it generational, and that makes it give more benefits over G1 as a garbage collector has. That will make ZGC a better fit for a broader variety of workloads. It's still focused on low pause times, but it will just work in a broader variety of use cases. It's a little bit early to tell because we haven't done enough testing with this yet, but we are expecting that ZGC is now going to be a really good performance upgrade, basically, for a lot of our workloads and a lot of our services. Again, these things are a really big deal, where we could save a lot of money on resources. Shenandoah is also now generational, but that is still in preview. Again, we're going to just run with that and see what happens. Garbage collection is really just too complex of a topic to just know that, drop in this garbage collector with this flex, and it's all going to be magic and super-fast. Just doesn't work that way. It's a business where you just try things out and then you tweak it a bit, and you try it again, and then you find the optimal state. We're not quite there yet. We are expecting to see some very interesting things there. Then, finally, in Java 21, you just also have a lot of nice language features. We get this concept of data-oriented programming now in the Java language. It is really nice. It's the combination of records and pattern matching and things like that. Java is pretty nice right now.
Virtual Threads
Back to virtual threads. Although I said that this is a big deal, and is probably going to change the way we write our code and scale our code, it is also not a free lunch. It's not just that you enable Java 21 on your instances, and now by the magic of virtual threads, everything runs faster. It doesn't work that way. First of all, we have to change our framework library, and to some extent application code to actually start leveraging virtual threads, so step one. There are a few obvious places where we can do that and already started experimenting, so the Tomcat connection pool. Again, these are the pool of threads where it gives threads-per-request. That seems a fairly obvious place where we can just use virtual threads instead. Instead of using a thread pool, you use virtual threads. Before you enable that, you are already running some big services in production with virtual threads enabled. It doesn't automatically make things a lot faster, because you need to do other things as well to really leverage it. It also doesn't make things worse. If you can just safely enable this basically, sometimes get some benefits out of it, sometimes it doesn't really change it because it wasn't a limiting factor. That's something that you should probably start with. Async task execution in Spring that is, again, just a thread pool, and very often you get blocking code for other network calls there anyway. It seems to be a good candidate for virtual threads, so we enabled it there. Then a really big one that we haven't really gotten into yet, but I expect that will be game changing is how we do GraphQL query execution. Potentially with GraphQL, every field can be fetched in parallel. It makes a lot of sense that we would actually do that on virtual threads because, again, this is often work in code where you do more network calls and things like that. Virtual threads just make a lot of sense there, but we have to implement this and test it out, and it'll probably take a little bit of time before we get the optimal model there.
Then we have some other places that seemed obvious. For example, we have a thread worker pool for gRPC clients where the gRPC calls to outgoing services happen. It seemed like such an obvious place like, let's drop in virtual threads there. Then we saw that we actually decreased performance by a few percent. It turns out that these gRPC client worker pools are very CPU intensive. If you then drop in virtual threads, you actually make things worse. That's not a bad thing, necessarily. This is just something that we had to learn. It does show that this is not a free lunch. We actually have to figure out, where does it make sense, where does it not make sense, and implement virtual threads at the right points, basically. The good news is this is mostly all framework work at this point. We can do it as a platform team, and we can do it in open source libraries that we're using. Then our developers will just get faster apps, basically. It's good. In Spring 6.1, or Spring Boot 3.2, there's a lot of work being done to leverage virtual threads out of the box, that will come out next month. We will probably adopt that somewhere early next year. Then there's a really interesting discussion going on on GitHub, in GraphQL Java, about changing the GraphQL query execution, or potentially even rewriting it to fully leverage virtual threads. That is not figured out yet. It's a discussion going on. If you're in that space, that's definitely something to contribute to, I think. Then for the user code, because all this other stuff is mostly framework code, for user code, I think structured concurrency is the other place that we're going to see a lot of replacement of reactive code. Because structured concurrency is finally giving us the API to deal with things like fanouts, and then bringing everything together again. Structured concurrency is still in preview in Java 21. It seems very close to final, so I think it's at least safe to start experimenting with this and try things out. Then a little bit further down the line, we also get scoped values, which is another new specification coming out related to virtual threads. That is going to give us a way to basically get rid of ThreadLocal. This is again mostly framework related work. It's just a much nicer and more efficient way of something similar to ThreadLocal.
Spring Boot Netflix
I've already mentioned a little bit that we use Spring Boot. Since about a year or so we have completely standardized on Spring Boot. Up until a year ago, about 50% of our applications were still on our own homegrown, not maintained at all, Java stack based on Guice, and a bunch of very outdated Java EE libraries. We didn't really make a good push in getting everything on Spring Boot. All the new applications were based on Spring Boot already. That became very messy, especially because that old homegrown framework just wasn't maintained very well. We made a really big effort to just get all the services migrated to Spring Boot. That migration was mostly just a lot of blood, sweat, and tears of a lot of teams. It's just not easy to go from one programming model to another one. As platform teams, we did provide a lot of tooling, for example, IntelliJ plugins to take care of, where possible, the code migrations and configuration migrations and things like that. Still, it was just a lot of work. Pretty painful. Now that we are on Spring Boot, though, we have like the one framework that everyone is using that makes things a lot nicer for everyone. We are trying to mostly just use the latest version of OSS Spring Boot. We're going to be using 3.1, and try to stay as close as possible to the open source community because that's where we get the most benefit. On top of that, we need a lot of integration with our Netflix ecosystem and the infrastructure that we have. That is what we call Spring Boot Netflix, and is basically just a whole set of modules which we build on top of Spring Boot. That's basically just developed in the same way as Spring Boot itself is built, so lots of auto-configurations. That's where we add things like gRPC client and server support that's very integrated with our SSO stack, for AuthZ and AuthN. You get observability, so tracing, metrics, and distributed logging. We have a whole bunch of HTTP clients that take care of mTLS and again observability and integration with the security stack. We deploy all these applications with embedded Tomcat, which is pretty standard for a Spring Boot application.
To give an idea of the features, how that looks like. We have, for example, a gRPC Spring client. This looks very Spring-like, but it is something that we added. Basically, this is referencing a property file, which describes the gRPC service, it tells where the service lives. It configures failover behavior. That way, you can just use a Java API with an extra annotation to call another gRPC service. With that, you also get things like observability completely for free. For any request, either gRPC or HTTP, you get observability for free with tracing, and metrics, and all these things available. Another example is maybe integrate with Spring security, so we can get our SSO color. You get the user basically, that's called your service, even if there were many services in between in a cold chain. As I said, we integrated with Spring Security to also do role-based authentication based on our own authentication models.
Why Spring Boot?
You might be wondering, why are we using Spring Boot, why not some other more fancy framework? Because, of course, there's been a lot of innovation in the Java space in the last few years with other frameworks available. Spring Boot is really the most popular Java framework, that doesn't necessarily make it better, but it does give a lot of leverage when it comes to using the open source community, which is really big, of course, for Spring Boot, and accessing documentation, training, and all these things. More importantly, I think, is just looking at the Spring framework, it has been just so well maintained over the years. I think I started using the Spring framework 15 years ago. It is quite amazing, actually, that that framework has been so stable and so well-evolved, basically, over time, because it's not the same thing as it was 15 years ago, but a lot of the concepts are still there. It gives us a lot of trust, basically in the Spring team that also in the future, this will be a very good place to be basically.
The Road to Spring Boot 3
Almost a year ago, Spring Boot 3 came out, and that was a big deal, because Spring Boot 3 really just involves the Java ecosystem, I think, because the Java ecosystem was a little bit stuck in two different ways. The first reason is that if you look at the open source ecosystem in Java, it was stuck on Java 8, because a lot of companies were stuck on Java 8, and no one wanted to be the first one who would break that basically. Companies didn't upgrade because everything just worked fine on Java 8 anyway. Now, finally, the Spring team has said, we are done with Java 8, Java 17 is your new baseline. Now we force the whole community basically, to say, ok, fine, we'll do Java 17, and everything can start moving again. Now we can start leveraging those new language features. It also makes it possible that although it's just baseline on Java 17, we can actually also start using Java 21 with virtual threads under the hood. That's exactly what they're doing. The second part is the whole mess around Javax to Jakarta, thanks to Oracle. This is just a simple namespace change, but it is extremely complex for a library ecosystem, because a library can either use Javax or Jakarta, and that makes it either compatible with one but not the other. That's super painful now, because the Spring team is now saying, ok, if you're just doing Jakarta, now the whole ecosystem can start moving because it had such a big impact. We finally get past that point that they were stuck on. It is a big change to get on these new things still, so moving to Spring Boot 3 isn't fulfilled, and we've done a lot of tooling work to make that happen. Probably the most interesting one there is we open sourced a Gradle plugin that does bytecode transformation at artifact resolution time. When you download an artifact, a JAR file, it will do bytecode translation if you're on Spring Boot 3 from Javax to Jakarta, so it basically just fixes that whole namespace problem on the fly, and you don't have to change your library. That gets us unstuck.
DGS Framework
Then I talked quite a bit about DGS. DGS is not some concept, GraphQL Federation is the concept. The DGS framework is just a framework that that we use to build our GraphQL services in Java. About three or four years ago, when we started the journey on to GraphQL and GraphQL Federation, there really wasn't any good Java framework out there, that was mature enough for us to use it at our scale. There was GraphQL Java, which is a lower level GraphQL library. That library is great, and we are building on top of it. This is completely crucial for us, but it's too low level to use directly in an application, at least in my opinion. With v1 that is a GraphQL framework for Spring Boot, and basically giving a programming model based on annotation as you are used to in Spring Boot. We needed things like code generation for schema types, and support for federation and all these things. That's exactly what you're getting with the DGS framework. About, I think it's almost three years ago, we decided to open source the DGS framework. It's on GitHub. There's a really large community. There's lots of companies using it now. It's also exactly the version that we were using at Netflix, so we're not using a fork or anything like that. It's really evolved really nicely over the last few years.
You might be wondering if you are actually in the GraphQL and Spring space, you probably have seen that in Spring Boot 3, the Spring team also added GraphQL support, which they called Spring GraphQL. That was not ideal for the larger community, because now the community would have to choose between, ok, do I bet on the DGS framework, or do I go with Spring GraphQL? Both seem interesting, both seem great. Both have an interesting feature set, but a different feature set. What do I bet on? I could go and sell you the DGS framework, how that's better and better evolves, and faster, and all these things which are right now probably true, because we've been around for a little bit longer. That's really not the point, the point is that you shouldn't have to choose. In the last few months, we have been working with the Spring team to get full integration between those two frameworks. What you basically get with that is that you can combine the DGS and Spring GraphQL programming models and its features in the same app, and it will just happily live together. That's possible because we're both using GraphQL Java as the low-level library. That's how it all fits together. We just integrated the framework really deeply. We're still finishing that, and that is probably going to be released early 2024. At least that gives you that idea. It doesn't really matter if you would pick the DGS framework today. It doesn't get you stuck in there and not be able to leverage features coming from Spring team, because very soon you will just be able to combine both very nicely.
Questions and Answers
Participant 1: Are you guys still using Zuul?
Bakker: We are, yes. Zuul is sitting in front of literally every request. Zuul is just a proxy. It's doing a lot of traffic control, basically. It's not the API server that we talked about earlier. Zuul sits in front of either the DGS federated architecture or like the old architecture.
Participant 2: You talked about the upgrade for Java having a limited perceived value there. I think that's interesting. I think a lot of enterprises tend to have this mindset of if it isn't broke, don't fix it, [inaudible 00:44:02]. What did you do to change that perception, or was it just the Spring upgrade that kicked your guys about to do the upgrade?
Bakker: No, actually, the main story was the performance benefit. The fact that we could say that, you get 20% better performance. It depends a little bit on the service, how that number actually looks like and what it actually means. The number is real. The fact that you could say that, that made a lot of service owners more interested in it, but it also gave leadership higher up just to push like, this is going to save money, go do it. That was actually the most helpful thing. The Spring Boot upgrade came later, and also forces the issue, but it was after the fact.
Participant 3: A lot of advancements to OpenJDK, so from 8 to 17, did it directly go from 8 to 17?
Bakker: We had services running on Java 11 because the plan was 8, 11, 17. Java 11, we had services running there, it never really took off because there just wasn't enough benefit. We mostly went from 8 to 17.
Participant 3: Then that's one of the things depending on the collectors as he was talking about, there was some impact with respect to stop-the-world pauses and some background collections that's happening with Shenandoah and ZGC. There's a tradeoff, but a lot of improvements went into reducing the memory sets and everything like that.
Participant 4: You mentioned that 20% was what you needed, but how did you even secure the time to actually experiment with that? How did you convince stakeholders to say, we're going to spend some time doing an upgrade on some services, and then we'll demonstrate the values with that?
Bakker: There is the benefit of having a platform team as we have. If I look at my own time, I could do whatever I want. If I think there is some interesting failure to be had in experimenting with garbage collection, I'm actually not mostly doing performance work, there's actually other folks who are much better at that. It's just an example. If there is potential failure in there, if you can get a time to just experiment with it and play with it, basically, because our time of like one or two people is like drops in the water.
Participant 5: Did you see any difference in the memory footprint between virtual threads versus a traditional one for the same number of request-responses. The second is regarding the GraphQL versus traditional SOAP, because SOAP was superseded by REST back in the days when I was thinking that was very precious, and your network was very important if you don't have a large number of data going through easily. Now that data is cheap, so it has the disadvantage of the schema going between the client and the server. I see that GraphQL also had the same problem now that we have the other query and the schema, going between the client and the server. How do you see the REST, SOAP, and GraphQL in that conjecture?
Bakker: I think SOAP had, conceptually, a few things. For example, the fact that there is a schema, that was a good thing. It was so incredibly hard to use and complex, that the overhead of doing the right things was just too much. Then REST, at least the way everyone is using REST, went the other extreme like no schema, no nothing at all, nothing is defined. You just throw in some data and we're all good. I think GraphQL sits in the middle there. It doesn't have a lot of overhead for developers to implement the schema. It's very easy. It's much easier than SOAP was, just from using it. You do get a schema and that takes away a lot of the downsides of just having REST in the schema. It feels like it has found the sweet spot for APIs. Probably if I'm back here 10 years from now, I will be like, "GraphQL, a terrible idea. How did we ever get to that?" You know how that goes. Right now, it feels like a sweet spot.
There is a difference, that is why we have to be very careful about ending virtual threads where we replace traditional thread pools. Depending on if these thread pools are very CPU intensive or not, it does or does not make a lot of sense. The memory footprint doesn't seem to be a big factor. We haven't seen any significant bumps there at all. Again, it's all very early days, and we're just experimenting with everything. We haven't quite figured it out yet. It seems to be very straightforward from memory.
Participant 6: Then I was just wondering about your Kotlin usage percentage, and what that is looking like?
Bakker: It is fairly low. For a while we had a bunch of teams, including my own team, very excited about Kotlin. The DGS framework itself is written in Kotlin, although it's targeting mostly Java apps. That's my choice. We have microservices written in Kotlin, as well. The only downside that we see with Kotlin is we invest more in developer tooling, so IntelliJ plugins and automated tooling based on Gradle to help with these version upgrades with Spring, and all these things. That story is much harder for a platform team if you have to deal with multiple languages. Because either for an IntelliJ plugin, even if it's both from JetBrains, you need to write your inspections in IntelliJ twice if you want to use both Java and Kotlin. It's just a lot more work. It's just a lot easier for platform teams if everyone is just happily using Java. That doesn't make Kotlin bad, though. We have only seen good things about Kotlin and it works just pretty well. It's a great language.
See more presentations with transcripts