Transcript
Vacchi: My presentation is "Your Program as a Transpiler." First of all, a couple of words about who I am. My name is Edoardo Vacchi A couple of words about me - that's my handle, you can find me on Twitter and GitHub with that handle.
After doing research at University of Milan where I did mainly programming languages, research, framework stuff, or implementing compilers, interpreters, then I went to UniCredit - which is a large Italian bank, it's a famous bank in Europe as well - in the research and development department, when I did a little bit of research for languages for distributed computation, but mainly I worked on a streaming engine. Right now, I'm at Red Hat, where I work mainly on Drools and jBPM. Because I did my PhD at University of Milan, even though I left academia, it kind of stuck with me. That's why I was told so many times, "Your talks, your papers have to start with proper motivation." So, here’s motivation.
Motivation
My first task at Red Hat was boring at the beginning. It was one of those tasks that nobody wants to do, because it was a legacy piece of code that reads basically an XML file, a BPMN diagram, which is the thing at the top, for our diagramming tool. It was presented to me as a data model mapping problem. Basically, it should have been trivial, but in fact, it was much more challenging that you could have imagined. When you read an XML file, what you get it's a tree, while the thing at the top, it's a diagram, so it's not the exact same representation. Even though the two data models are basically very similar, we can not just do trivial mapping.
I started to think, "What if I could solve this problem in a way that it's similar to how programming languages are implemented?" I started to realize that maybe this kind of approach could have been useful for other people too. Language implementation is often seen as some dark art, a dark magic, something that only very skilled people can manage to understand. In fact, the design patterns that we use to design and implement programming languages are pretty simple. Actually, it's something that all of you can understand here, and probably you might want to use it, or you might be able to use it in your own programs.
Plus, learning about language implementation can give you a very different angle to look at your problems, and so it opens to you different solution spaces. Moreover, if you get a better understanding of how compiler works, you get a better understanding of how tools like GraalVM and Quarkus work inside. You get other benefits as well for tooling that you might be using, and you will probably might want to use soon.
The goals for today - our programs, even if we sometimes do not really realize, have often a pre-processing phase when you prepare them for execution. Then, there's the actual processing phase, which is the execution phase. I want you today to try and understand how to recognize this pre-processing stage inside your programs. I want to show you how to effectively implement this pre-processing phase.
Transpilers
The title of this talk is, "Your Program as a Transpiler." How many people have ever used or know what a transpiler is? Many people. I want to give you my definition. To give you my definition, I first have to give the definition of a compiler. A compiler is a program that translates code written in some programming language into a semantically equivalent program that is written into another programming language.
Usually, when we talk about compilers, they usually say that the compilers target languages at a lower level of abstraction. While, on the other hand, the term transpiler was probably invented because a source-to-source translator didn't really roll out the tongue very nicely. It's still a program that takes a program written in some programming language and it translates it into another semantically equivalent program written in another programming language. In this case, the target language is supposed to be at the same level of abstraction as the source language. What does this mean?
Because there's this difference between level of abstraction, I believe people tend to think that transpilers are kind of simpler than compilers, because lower-level languages are hard. They actually are not, in fact, lower-level languages are simple. That's why we use compilers, we don't want to write in a low-level language, because it will make us write a lot of code. In fact, they're so simple that they have very few instructions. That's why we use compilers, we want to write little code and get some machine to expand this very little code into a lot of code. That's the work that a compiler does.
Sometimes people say, "It's not a compiler. It's a transpiler, because it just takes syntactic sugar and expands it into an equivalent program." It expands it into more code. It takes little code, and expands it into more code, like a compiler. Proper compilers do low-level optimizations. My transpiler is not doing low-level optimization - I mean, it's not even low-level.
Well, there's a proper definition for a compiler that does optimization, which is optimizing compiler. We think that proper compilers should be optimizing compilers, because many compilers that we use, like GCC, for instance, are actually also optimizing compilers. That's not necessarily true, a proper compiler is not necessarily a compiler that produces optimized code.
I believe that the distinction is moot, and sometimes people use the word transpiler just to justify that they've wrote not such a good compiler. Writing a good transpiler is just as much work as writing a good compiler, because basically they do the same thing. Now, the question will be, "How do you write a good compiler?" The title of this presentation is, "Your Program as a Compiler."
Compiler-Like Workflows
Now, the question is, "Ok, we now know what a transpiler is, we now know what a compiler is, but what does this mean for my program?" Well, I think there are at least two classes of problems that can be solved in a compiler-like fashion. The first time is boot time optimization problems, you have a program that does some stuff at the beginning and it takes a lot of time, and you want to make this time shorter. If you guys went to the Quarkus presentation yesterday, it's something that has to do with that, but it's not something that we will see today. Today, we will focus on data transformation problems.
Data transformation problems - you have map of configuration keys and you want to read them into structured data, so you transform the thing that you read from a file into something that you will use later. Or think of a Spark job, when you write a Spark job, you're describing a transformation. When the Spark job gets executed, before it gets executed, it gets turned into an execution plan. This is the same thing that happens in a database, the query is transforming to that execution plan, and then it is executed.
Running Example: Function Orchestration
Another thing from academia - you should have a running example for your paper. My running example will be function orchestration. You are building an immutable Dockerized serverless function on Kubernetes in the cloud. I put enough buzzwords in here. What you're doing is describing some computation, and you have a bunch of functions, and you want to connect the output of a function into the input of another function, and so on, and describe a diagram.
The problem is, unfortunately, there is no standard way to describe function orchestration yet, at least there's no fixed standard, there are some drafts. Every single vendor is coming up with their own implementation. What you could do is define your very own YAML format, because YAML is the future. It could be JSON, but JSON is not ready, so YAML. Congratulations, you can now enjoy attending conferences worldwide. I think there's XKCD references hidden here, but I'll let you find it. Or, you could use a natural standard, a proper standard that's already defined, because you're describing a workflow. You could actually use BPMN, the thing that we saw at the beginning, which is short for Business Process Model and Notation. It is a standard that describes a proper workflow. It is XML, so it's not that fashionable, but basically it gives you tool to describe a diagram with all its features, its type of nodes, and the edges, of course, and so on.
The downside, unfortunately, is that nobody will invite you at their conference to talk about BPM, and yet I am here. The bonuses for choosing BPM is that it's XML-based - that's not the bonus, the bonus is that XML comes with a lot of tooling, so a lot of stuff you won't have to hand code. You have validation, you have parsing tools, you have a lot of stuff already done.
Then, the other bonus is that it's a proper standard. There's already a bunch of nodes that already defined, so you can already use those that are defined in specification. There's also optional specification to diagram, to lay out your workflow in a picture, in a diagram. All of this is for free, and also you get interoperability with other tooling, so you don't have to write your own designer. You can use a designer from some vendor, so, all good.
Step 1: Recognize Your Compilation Phase
The goals for today using this example is to find a way to read a BPMN workflow, execute that workflow, and then visualize that workflow, all of these in a compiler-like fashion. Step one, recognize your compilation phase. What's a compilation phase? It's your setup phase. It's the phase that you should probably do only once, and then you can do all of the other rest of your program, which is the proper execution.
Here are a few examples - configuring an application. You have to read a file with some configuration values. Your compile-time in this case will be to read the file, to read the configuration keys, validate these keys, make sure they're not misspelled, that the types of the values are the types that you expect. Then, once you have done all of this, you have a proper data structure that's validated and you don't have to validate each key, each value, each time you use it at run-time. That will be your compile-time.
Here's another example, a data transformation pipeline. The problem is, manipulate data to produce some analytics. Your compile time will be to define some transformations on your data and then decide the execution plan, then the run-time would be to evaluate the execution plan.
What about our BPMN example? In BPMN, as we were saying at the beginning, we both have an execution and the visualization. You don't have to do both at the same time, you could do only evaluation or only visualization. Your compile-time will be to read this BPMN file, analyze it in some way, and then produce some data structure that is well-suited for visiting, well-suited for evaluating. You will probably start with a start node and then go on from there.
The first part, the analysis part, might be the same, but probably the data structure that you're going to need might not be the same as the one that you use for visiting. As for reading BPMN into a data structure, since it's a standard and its XML-based, we have a schema - yes, XML has schema. Yes, there's proper tooling with the code generation and whatnot, so you don't have to write all the boring code that has to do with parsing. The types, you can infer those directly and generate the code from the schema.
Now, there's a problem. This is not a problem with this specification, but in general, it's a problem on describing a graph on a file. You will probably define your nodes, and then you have to define the edges between those nodes. There's no particular ordering usually imposed on how you write the nodes and the edges, so you might have forward references. In this case, we have an edge definition, which is called sequenceFlow, before the node definitions. This is not something you can control, because, in this case, it's an XML format. In the case it's a hand-written format, you cannot control what the user will write, and you don't want to put too much burden on the user.
Another problem that we have, as I was telling you at the beginning, there's a layout definition for BPMN diagrams. In general, if you are describing a diagram and you want to visualize it, you visualize it with the position of the nodes and the canvas. In our case, the positioning of the nodes and of the edges is separate, it’s a separate section in your file. Even here, you have cross-references, and the diagramming section may come at the beginning, at the end, you don't really know.
Step 2: Work like a Compiler
How to solve this problem? You can solve it by working like a compiler. How does a compiler work? When you compile a programming language, you start from a text representation of a program. The text representation is then fed to a parser, and out of the parser you get a parse tree, a tree data structure. The parse tree is then refining to another data structure which is still a tree, but it discards some of the things that are not really interesting, like comments, for instance. That's called an abstract syntax tree. Then it passes through intermediate representation up until you get to the compile phase. I want you to focus on this, the fact that you get through intermediate representations.
I believe that what makes a compiler a proper compiler is not optimization. It's not generating optimized code, but it's compilation phases. You can have as many compilation phases as you like, and in each compilation phase you can analyze a different aspect of your problem. Here's an example, you have a configuration file, you read the file from a path, you unmarshall the file into typed object, you sanitize the value, you validate the values, you coerce, you transform the values that are strings - probably it's coming from a text file - into typed values into memory, and then you're ready to use them.
Another example, produce a report, so a data transformation problem. You fetch the data from some different data sources, then you probably want to sanitize those values, you want to discard the values that are not valid, then you merge them into a single data stream, and then you compute some aggregates. Notice that this is step four, but actually it's hiding that this step might be four, five, six, depending on how many averages you want to compute. It might be. Then you generate the final data structure, which is the data structure that you will use to produce the actual report.
In our case, a workflow engine, you read the BPMN file, you collect the nodes, because they could come at any particular order, then you collect the edges. Now, you have the nodes already collected and then the edges that refer the nodes that you have just collected. Then you prepare a data structure that is suitable for visiting or for laying out.
What is good about compilation phases is that you get better separation of concerns, you get better testability because you can test every single phase, and therefore you can test every intermediate result. If you do everything in one single pass, then you can only test the input and the output. If the transformation is complicated enough, you will never know where you have screwed up in the middle. You can choose when and where each phase gets evaluated. Another interesting point is if your language, if your problems phase evolves with more requirements, all you might have to do - it could be enough to just add another phase.
A distinction that I want to make is phase versus pass. I'm using these terms, and they're not necessarily the literature term, but it's useful as a distinction. The phase, to me, it's something logical, I'm reading values, I'm converting values - this is logical. The pass is the act of visiting the data structure. You could do several phases in one single pass. Still, if you reason this way, you will still have a good idea of what's going on inside your program even if they're implemented as a one single pass.
I want to convince you that there's not a real cost in doing one pass or multiple passes. In fact, this is a myth, people tend to think that if you do, in one pass, many things, it's much more efficient than doing many passes, each one that does one thing - it's not. If you just take a look at a higher level, you will see that what dominates here is the operation that you do on the values. This is our configuration file example, you sanitize, you validate, you coerce.
Basically, the cost, the complexity is still three times 'n' in both cases. Yes, there's a hidden cost, but it might be much easier to understand the code on the right side. There are some cases when you cannot do one single pass. This is our example, the ordering of the nodes and the edges does not really let us doing one single pass, unless we do some tricks, some ugly stuff that then is really hard to maintain. It's better to do it in separate passes.
Here's the code that I have simplified a bit, but it's basically the code that you can get at that address up there. It's not actually jBPM, because jBPM would be too complicated for a simple example, but it's really what jBPM does internally. First, you will read the file using our parser that we have code generated, because it's XML, we have XSD. Then you can collect the nodes, you collect the edges, and then you prepare the graph for the visit by creating some suitable data structure. Then you visit it, and therefore you evaluate the workflow.
As for the layout, the first three phases might be exactly the same, the exact same code, while the number four and number five do some things a little differently, because now you have to draw the nodes and the edges on a canvas. In our case, we have this data structure, it's code generated from the XML schema. There's a root element and then there are subclasses, a FlowElement and then StartEventNode, EndEventNode, ScriptTask, and many others.
The way you could process this, if you had a language with pattern matchings like this, you have a function that takes the root element and then you pattern match against the types of all the subtypes. This is basically Scala. There's a new feature in Java called switch expression. They will bring this sometime in the future. I think, in Kotlin, you can do something like that with the when statement.
There's also the poor man version. There are two different poor man versions. One is instanceof in cast, which is valid, it's basically what a very basic switch expression will do. Or you could use the Visitor pattern, I won't go into the details, but it's a pattern that I hate. I really hate it, because it deals with the limitation of the programming languages that we use, currently Java, C#, they have some thing called the single-dispatch, that's why the pattern was invented. It's not really interesting. The idea is that you want to define the cases and then program only against those cases. I want to describe the transformation that I do on the StartEvent, on the EndEvent, on the ScriptTask, and so on. Here's the code that you will find online. The NodeCollector collects each node in a different way and decorates it with some other type, and same thing for the edges.
Step 3: Choose a Run-Time Representation
Now, final step, choose the run-time representation, this is really important. For a compiler, this will be generating code, for an interpreter which, again, has a lot in common with a compiler, it will be to evaluate. In our case, it depends. In the case of workflow evaluation, we want to evaluate the workflow. We want to visit the graph starting from the start node, and then go on, and on, up until we reach the end.
The most convenient representation to me is adjacency lists. For each node 'p', the adjacency list for 'p' is a list of all those nodes 'q' such that there is an edge from 'p' to 'q'. I would write it like this possibly, a map from node to list of nodes. If you can see, these crossed nodes are called gateways, there are two edges going outside. From this node, you would have one and two nodes you can reach, that's basically the idea. Again, the evaluation can be implemented using another visitor. As you can see, I have highlighted the points where you visit the outgoing edges that is the next node in the order. Again, this is simplified a little bit from the code that you will find online, just for readability, but it's basically the code that you will find there.
As for workflow layout, I want to really stress that in this case, the evaluation is much different. You do not really have to visit in some precise order. In my opinion, it's probably better to first layout the edges, so the arrows, and then, on top of those, draw the nodes, but that's my choice. You don't have to really follow the order of execution. You do not have to draw, first, the start node, then the end node, then, in the middle, all the nodes that are in the middle. It's not necessary, really. There are some cases where it's necessary, you will find it online. There are sub-processes in the example online, but we don't have time to do that today. That's the code for layout.
Bonus Step 4: Generate Code at Compile-Time
Bonus step four, generate code as compile-time. If you have structured your code in such a way that you can really understand what's the pre-processing stage of your code and what's the execution, the run-time of your code, you might have the chance to actually move that code that does the pre-processing part outside your program. In the compilation phase, you generate code and then you evaluate that code. The code that you have generated will be the only code that's remaining from the pre-processing stage. Your program, at that point, will just do what it was designed to do, that is actual processing.
I'm not saying this is necessary, this is a little step above what we saw today, and we don't have time to go into the details of that, but that gives me some space to talk a little bit for what we're doing with jBPM, and Drools, and also a little bit about OptaPlanner. I will focus on Drools and jBPM, because that's the things I'm working on.
Case Study
Drools is our rule engine, and jBPM is our workflow platform. By now, you will have understood what it does. The reason for the submarine at the beginning, it's the name of the initiative, "The Submarine Initiative." It comes from this quote from Dijkstra, "The question of whether a computer can think is no more interesting than the question of whether a submarine can swim." Just a funny quote to say that machines cannot really think, submarines cannot really swim.
What has this to do with all we've seen today? Well, all of this had to do with Graal because I've mentioned Graal and Quarkus. First of all I wanted to lure you into this presentation, of course - marketing - but also because it has to do with what we are doing. As we're saying, if you can do a lot of pre-processing of your program outside, you can get benefits. What kind of benefits? You might have seen Chris’ [Thalinger] talk about GraalVM and Sanne [Grinovero] talked about Quarkus. I won't go into the details of what GraalVM does, it's the next generation polyglot platform of Oracle.
The thing that I want to tell you a little about is its native binary compilation. You can get your Java programs and compile them into native binaries and get huge performance benefit from the startup point of view, they will start up very fast. The problem is you have some restrictions, basically, they all come down to restriction to reflection and restrictions to dynamic code loading. The idea is that you can overcome those restrictions by doing code generation. This is what Quarkus does. You can do your very own code generation routines, or you could write an extension to Quarkus and improve your application by using the facilities that Quarkus gives you to generate code at compile-time. Then you will get all the benefits that you might have heard about, a very fast startup time.
What does this have to do with Drools and jBPM? Both Drools and jBPM have their configuration format, they have their own languages. JBPM is something we have seen today, and Drools has its own rule definition language, the Drools definition language, or DRL, that lets you define logical rules. The idea was, if instead of processing these files every single time we start up, we can produce some model, some bytecode, that we can load at startup, and instead move the processing of this file outside the program run-time, then we will have a very faster startup.
This is what we did for DRL. This is what we are doing for jBPM. Instead of reading the XML file each and every time, read the file, we do all the processing and then generate the representation as code. Then, when our application starts, it just has to read the representation of the code. It just has to evaluate code instead of parsing, processing, so on.
What's the benefit of this? Well, the benefit is twofold. You get both benefits on a regular JVM, because you removed processing out of your run-time and at compile-time. The startup time for Drools and a jBPM application gets around one second on a JVM. If you do binary image generation on Quarkus, you get like milliseconds - from seconds to milliseconds. That's pretty impressive if you ask me.
Conclusion
Conclusion - processing phases, that's the conclusion. Of course, do more in the pre-processing phase which is your compile-time, and try and do less during your processing phase that will be your run-time. In other words, separate what you can do once from what you have to do repeatedly. Of course, if you have the chance, if you have the time, if you want, you can move all or some of your phases to compile-time.
A couple of links, the source code, and this is another link, it's the presentation that I did in Voxxed Days Milano, which is the other side of the story, that is boot-time optimization. If you go there, you'll see the slides and hopefully soon there will be the videos online. Then there's other links to Drools, submarine, so on.
See more presentations with transcripts