Transcript
Wang: My name is Stephanie Wang. Welcome to my talk about enabling remote query execution through DuckDB extensions. As the name implies, we will be talking a lot about DuckDB extensions, and a little bit about query executions, the internals of what happens when we run a query, and how we're able to enable remote query execution through DuckDB extensions. We'll probably go through a bit of information on what is DuckDB. Why are people so excited about it? We will zoom in on DuckDB extensions and understand how DuckDB query execution works locally. Understand why we want to enable remote DuckDB query execution, and how we are able to achieve it through DuckDB extensions. Finally, I will be sharing a little bit of my learnings and takeaways from developing with DuckDB and DuckDB extensions in the last year.
I am a founding software engineer at MotherDuck. MotherDuck is a company that we founded around a year ago, focusing on building a serverless DuckDB. Prior to coming to MotherDuck, I worked on BigQuery developer tools, mainly focusing on BigQuery client libraries for around three years. I worked on building sales and trading applications at Morgan Stanley for three years before that.
What Is DuckDB?
What is DuckDB? DuckDB is an in-process OLAP database. OLAP stands for online analytical processing. The main use case of DuckDB is focused on analytical use cases. The magic about DuckDB is that it can run inside Python or C++ and other languages. There is no external dependencies in this library. It also offers full SQL support. It can do very high-performance aggregations with the columnar vectorized execution engine. DuckDB was created at CWI, a Dutch research institution, by Hannes on the left, and Mark on the right. It is currently maintained by DuckDB Labs, and also DuckDB Foundation, as well as the community surrounding DuckDB. DuckDB has gained a lot of interest and popularity in the past few years. If you follow DB Engine Score, it has really skyrocketed over the last few years. This DB Engine Score is computed based on web content, Stack Overflow, job searches. On top of that, the GitHub repo stars now has reached over 10,000. To give you a basic idea, the Spark GitHub repository has around 35,000 stars, and the most popular repository in GCP has around 14,000 stars. This open source project is wildly popular amongst the community. Also, the DB Engine's monthly growth is around 40%, and PyPy download is around 400,000. There are about 100,000 monthly unique visitors to duckdb.org.
Common DuckDB Use Cases
You may wonder, what are people using DuckDB for, this in-process OLAP database engine? There are two general ways that folks are using DuckDB for. I will break them down into the first category, which is human interactions. The second category which will be machine interactions. On the human interactions front, most users will use DuckDB in Python or Node or R notebooks. These users are mainly analysts or data scientists. An example is querying a parquet dataset that lives in S3. For example, if someone has a dataset that is already in S3, you can download this dataset locally to your computer, and connect to DuckDB via Python or R. You can start querying and getting query results from it. You can then move on to doing your dashboarding activities. You can share your Python notebook work in progress or results with your colleagues, very easily. This is a popular use case because it allows analysts and data scientists to iteratively work on the data in DuckDB, interwoven with other libraries like pandas. The support for pandas is very advanced in DuckDB. The other type of use case is machine interactions. Either using DuckDB as a persistent or ephemeral execution engine, and this use case is mostly covering data engineers. For example, you can spin up DuckDB in EC2 VM, or even on-premise. You could write some lambda functions that include DuckDB, and use it as your database. This allows you to read data directly from S3, or from your other sources that you download locally. As part of the CI build, for instance, DuckDB is able to deliver some output, typically in parquet format, or in other forms of query result sets.
What's Under the Hood of DuckDB?
What is under the hood of DuckDB? How is it able to be in-process, so fast? Why are people so happy about it? Comparing some database engines, we're familiar already with the row-based engines. These are great for transactional workloads. Because it's very easy to insert new data into these row-based database engines, there is low memory usage for analytics, because you need to read each row at a time, and that really kills the performance. Some examples of row-based database engines include SQLite, PostgreSQL. In contrast to the row-based database engines, we have the columnar database engines. These are much better for analytics, because you can process only specific needed columns, because most data in a column is likely to be very similar. This allows you to potentially use efficient compression algorithms like dictionaries and run-length encoding for these repeated values. Good examples of these database engines are pandas, NumPy. DuckDB is built based on a vector-based query engine, and that's a little bit different than what came before. This is a type of engine that is optimized for analytics workloads. Specifically, an example is aggregations, and it utilizes your local CPU very efficiently.
I want to talk a little bit more about cache locality. Cache locality is important because it needs to be considered to enable high performance in query engines. The closer you are to the CPU, the data can fit, which is L1 cache here, the faster the data can be accessed. L2 is slower, and L3 is slower yet. Then after that, main memory or disk. The cache sizes for my machine that I'm using here right now, for example, is pretty large. Imagine a vector database is like a big book with lots of words, and the CPU is like a reader. SIMD is like having the ability to read several words at once, instead of reading one word at a time. This helps the reader, which is the CPU, read the book, which is the vector database, much faster, which also means much better performance. Regarding cache locality, imagine the book which is the vector database is divided into several smaller books, which are caches in our example. The reader, which is the CPU, can quickly grab a smaller book, which is a cache, and find the words that they need. Cache locality means that the words used together are in the same smaller book. This makes it easier for the reader which is a CPU to find the words they need without searching through all the smaller books, which is all your caches, which speeds up the reading process, and improves performance. This is the basic reason why vector databases are very fast and super snappy. DuckDB, in particular, can process data that is larger than memory. It has a native storage format, which is columnar and partitioned. It stores the entire database, not just the tables within the database, and it offers ACID-compliant updates, when you need to make changes to your database. When it comes to data that is larger than memory, it spills overs to disk.
To use DuckDB, it is quite simple. Here's an example of the CLI. You can go to the duckdb.org website, and download a CLI that is built for your local machine's architecture. You can simply just click on it and start interacting with the DuckDB executable. You can start by running a common SQL command like CREATE TABLE. You can start just working on your datasets from there. Super easy, no need to set up Cloud Data Warehouse credentials, create account, and anything like that. It's also very well integrated into Python and R, like I mentioned before. In Python, you can simply run a pip command to install DuckDB locally. You can import DuckDB, like you would normally do with any other Python libraries, and start writing Python commands and running SQLs using DuckDB.
DuckDB Extensions
Now maybe we can talk a little bit about working with DuckDB as a developer. As you've already seen, using DuckDB as a user, maybe as a data engineer, SQL analyst, or a data analyst is super easy. I want to just jump into the next topic about DuckDB extensions, and talk a little bit about how to develop on top of DuckDB. You will see that the process is also quite simple with these DuckDB extensions. First of all, you may wonder, what are extensions and why do we have extensions? Extensions are basically the parts of the code that are not built into the main source tree of DuckDB. This allows DuckDB to start from a lighter version of a database management system. The reality of a DBMS is that in most cases, people don't need all the possible features of a DBMS. For example, most of the DuckDB users, for instance, will eventually scan a CSV file. They probably want that to be a part of the core of DuckDB. Some of them probably do not need to ever scan Postgres files. Postgres is built as a separate standalone extension, away from the DuckDB core. Instead of shipping a feature that's only meant for a small percentage of DuckDB users, they're able to ship a lighter version of DuckDB that contains only what is considered to be essential for the core users of DuckDB.
There is no external dependencies for DuckDB. However, you can have whatever dependencies you want for the extensions as a developer. For example, you want to leverage certain libraries to parse things or do other stuff, you can feel free to bring those in as part of your extension. For DuckDB core, there is no external dependencies. Having extensions built away from DuckDB core, also allows them to keep the binary size and compilation time in check. This is quite important. If you're developing on top of DuckDB, you want to make sure that it's super-fast to compile the entire DuckDB, so you can run your tests and see how things are performing. It also facilitates binary distribution because you have very fast compilation time and very small binary sizes. This allows you to be able to distribute this DuckDB database very easily. It also allows for private closed source additions on top of DuckDB. That's something that we do at MotherDuck is we've developed very advanced DuckDB extensions, and that part of the changes are closed source and private. However, it leaves DuckDB core alone to be open sourced and completely separate. This allows us to develop without having to fork the main DuckDB repository. These are some of the motivations of why you now have extensions and what these are.
Some examples of the DuckDB extensions that already exist. First, these are distributed through the simple install and load commands, which means if you want to use a new DuckDB extension that's already distributed, you can simply run, install something, and then load something. The extensions listed down here are the very common and DuckDB supported extensions. Fts and httpfs, for example, are very common. It allows you to read and write files over HTTPS connection, and you can do a full text search with the fts extension. On top of that, you can read JSON files with the JSON extension. You can read parquet files with the parquet extension. Similar goes for Postgres, sqlite_scanner. They even have support for Substrait, which is a separate open source project that's focused on query plan standardization across the board. Of course, they have extensions for TPC-DS and TPC-H benchmarks as well. In order to use any of these extensions, like I said, is very simple. There are only two commands. The install command downloads and moves the extension to the DuckDB extension folder, which is shown over here. It's super easy. The load commands is the one that actually loads this extension to a DuckDB binary. To run it, you can just run load and then provide the default directory or a full path to the extension, if it's different than the default location of the extension. It may be different if you have built a custom extension in a separate location that you want to test. For testing purposes, whatnot, you could then load the extension from your other local locations, or it could be an S3 bucket that you've decided to upload your extension to.
There are two types of DuckDB extensions in general, signed versus unsigned. Any developer can create a DuckDB extension, and these can go to the unofficial Amazon S3 bucket, and these are considered unsigned extensions. That means the binaries has to be distributed by the producer of these extensions. You probably need to specify a different endpoint for installation. Same for load. There is no checks on the source code of these extensions from the DuckDB team. However, they do offer some boiler plate CI that does distribution of the extension for you. These are the unsigned DuckDB extensions. The signed DuckDB extensions on the other hand, are used for several core DuckDB functionalities. These are maintained and primarily developed by the DuckDB team. Like I mentioned earlier, it has support for reading from remote file systems like S3 over HTTP. It offers support for scanning over other data sources such as parquet, SQLite, and Postgres. It also can offer new scalar functions, aggregate, window, table functions, such as JSON, fts, and ICU. Through the ICU extensions, for instance, you can get time zone support and collation support. These are some of the examples of DuckDB extensions that are supported. The last one is the most recent addition, which is a geospatial extension that adds support for spatial data.
We just covered how to use DuckDB extensions as a user. How do you actually build a DuckDB extension? It's actually super easy as well. Similar to how you would use DuckDB, which is through the install and load commands, there are also two key things to building a new DuckDB extension. DuckDB exposes two critical functions to allow you to start exposing an extension. The first one is version, and the second one is init. As a developer, you would have to then just add your logic after the init, and make sure that your version is compatible with whatever DuckDB binary it is that you're building with. The version verifies the compilation and linking done against the same DuckDB version. This is critical because DuckDB is used as a library, so when you're developing on top of it by building an extension, you want to make sure that everything can link correctly, and it is using the same exact version to avoid any conflict or compilation errors. Init is the method that actually loads your extension, and allows all your logic to be processed after that. All the extensions are written in C++. Most likely, your extension will be written in C++ as well. The beauty of these extensions is that you get access to all of DuckDB's internal code, and you really start to appreciate how DuckDB is built like a library that is plug and play, and is very powerful.
Remote Query Execution, and Local Query Execution
Now let's jump into an example of how we're able to enable remote query execution through DuckDB extensions. We will first go over some basic query execution ideas and architecture. Then we'll talk about why we need to run queries remotely, now that DuckDB is able to run queries very efficiently locally. Then we'll take a look at how the architecture looks like for remote query execution. Then jump into a particular example for a deep dive, before we share some takeaways and learnings. Before we zoom in on to the remote query execution part of things, we can talk about the typical client server architecture for query execution. Typically, you would have a client-server architecture for running a query. You would have a client that is your application. Then you would have your database that lives maybe on your server side, which hosts your data. In between, you have the network that transports your API call and whatever you serialize into your binary data to the server. The database itself hosts the data, and also the query engine. The application logic is split between the application and the query engine. An issue about this particular model is that network can be slow and untrusted. Sometimes you will run into various issues by executing every single query over the wire. An in-process database, on the other hand, does not have those types of issues with network because everything is in-process in the applications process. SQLite is an example of an in-process database, where the database along with the data lives right inside some application's process. They all live in the same process. You don't have trust boundaries to traverse, and that allows you to execute queries super quickly. You have high bandwidth between all your components. Of course, the limitation is that the application and everything is done in-process, and you have to worry a lot about how to utilize the threads and the resources within your application. DuckDB is very similar in terms of architecture as SQLite. The difference is that SQLite is catered towards OLTP workload, which are transactional, but DuckDB is suited and built for really analytical workloads.
How does local query execution work in DuckDB? Maybe it helps to understand a little bit of under the hood mechanism of how DuckDB executes a query. DuckDB, in general, uses a very typical pipeline for query processing. First, a SQL query is input into DuckDB as a string. Then after that, the lexer and parser take this input string, and convert it into a set of statements, parse expressions, and table references. In the case of DuckDB, they use the Postgres parser. They have their custom transformer, which transforms Postgres Node statements into DuckDB understandable classes from there. Then, these statements will go into the planner, which then uses it to create an optimized logical query plan, which are also query trees. These logical query trees contain logical operations of a query execution. It basically describes what to do, and not necessarily how to do it. An example is, you would do a join, but at the logical planning step, you will now specify whether you're doing a hash join or a merge join. Then, after that, this unoptimized query plan goes into the query optimizer, which then rewrites the logical query tree into another equivalent more optimized query tree, which is also called an optimized logical plan. How does the optimizer do that? It uses a mix of rule-based and cost-based optimizers. Then, finally, this optimized query plan will go into the physical planner, which then converts it into a physical plan for query execution, and from there on, you would build pipelines and execution would follow.
Using DuckDB, which is in-process, we're able to create a hybrid database architecture where you can have an application that holds a local cache, and then you can then send requests over the network occasionally, if that's necessary, to the server, which also has an instance of DuckDB that could run queries there, has data, or decide to send query results back to the client. In other words, the data and query engine live on both sides of the network boundary. You don't have to pick and choose. You don't have to only have the database live on the application side, or the database live on the server side, you can have the database live on both the application side and the server side. This client-side query engine, DuckDB, in this case, can enable better integration with the application itself because it would know the application better. It also allows you to better utilize the cache locally. The server DuckDB can run queries, just like how the client side would do. This opens up opportunities for query plan level optimization. To zoom in a little bit on how we can utilize DuckDB extensions, and the architecture, which I had just mentioned. You can have a client, and this could be your Jupyter Notebook, it could be your CLI, it could be Wasm even, something else. This is where you can have your DuckDB instance running. Another example is the web UI, you can have a DuckDB client extension running there as well. They can talk to the DuckDB that's in the server that lives in the cloud. We can allow data persistence through another storage system, and then you can coordinate and orchestrate all the operations through a control plane. All of these clients can be doing similar stuff, sending requests to the cloud. You can cache things locally in one of these clients, and only make those over the wire RPC calls when it's necessary.
How can we enable remote query execution? DuckDB offers many different extension entry points. There is a parser extension, for example, that allows you to intercept the parsing stage. There are other extensions as well where you can intercept anywhere you find appropriate. Typically, if you want to, the parser extension allows you to intercept and parse grammar that are not parsable by DuckDB. This is nice, because if it's something that DuckDB already can parse, you typically do not need to parse those things again. If you want to add additional grammar support for your particular use case, then you can simply intercept at the parsing stage and introduce your own parser. An example could be if you want to enable multiple users using DuckDB scenario, to do something, then you may want to introduce a Create User grammar. For that to work, you'd have to parse the grammar yourself and transform it into something that DuckDB can understand, so we can go back to the query execution workflow. To do that, it's very simple, you can just introduce your parser extension. Like I mentioned before, you just need to implement the version method and the init method, and your other logic would go into the init step for loading your actual extension. After that, there is an optimizer extension that DuckDB offers where you can intercept the query plan and introduce your own optimization logic. This could be that if you want to tell DuckDB that if you see this type of logical operator that requires data remotely, then do not execute this query locally, but instead execute this query remotely using the server-side DuckDB instance. That's an example.
To enable this optimization logic that's custom, it is probably likely that you'll be introducing your own operators as well, and that is achievable through using the DuckDB operator extension. This allows you to create customized execution code and you can also patch any serialization issues, because, like I said, DuckDB's architecture is in-process and fully local. They do not have the need to have to send requests over the wire, which means they do not have to serialize any of their data. For you to actually be able to run a query on the server side, you will need to perform this serialization step. You may have to then introduce your custom logic to serialize something into binary data format, so you can send that over the wire. You can use this to create local or remote logical operators, for example. What happens after that is that you can use these local and remote physical operators, which are produced after the logical planning step to build pipelines for actual query execution. In our case, we couple pretty tightly with the DuckDB pipeline construction. What we need is to really intercept the DuckDB scheduling process and achieve a low-level execution workflow. This allows us to, for example, parallel operators as we wish, and also handle data flow, so basically, sending data across the wire as part of the DuckDB query execution flow that they didn't have before. You can also fetch data, get data as part of the physical planning step. You can make direct RPC calls over the wire to fetch data chunks or perform any create, update, delete operations using your custom physical operator. Finally, there's also a DuckDB storage extension entry point where you can virtualize your catalog, and that can simplify the bind stage quite significantly, because you can confine everything to your virtual catalog when you create new database objects.
The motivation really, like what this allows us to achieve is, first, you can have faster execution speed, because the vast majority of data lives in the cloud. It makes sense to run computation closer to where the data resides. This can significantly reduce the amount of data that needs to be transferred across the network, the wire. It shields us from issues like unstable network, and also slow I/O. On top of that, part of the remote query results can be cached locally. If you're smart about it, you could reuse these query results locally, to just leverage the local computation resources as opposed to having to do everything in the cloud. An example of using this is CTE. You can always CREATE TABLE. Then you can reuse the table or maybe make that into a materialized view, so you do not have to manually invalidate the table when the data in there gets updated. There is some cost reduction possible as well with enabling remote query execution through DuckDB. Because we're already able to leverage the local compute resources, so we don't have to run everything in the cloud, and that potentially can save some money in managing cloud resources. There could be some benefits regarding compliance, where sometimes users have data that they have locally, and they may not want them to leave their laptops. DuckDB enabling this type of hybrid execution allows them to just keep data where it is. The same idea applies for in the cloud scenario. If you have your data in the cloud, you can have that just there. You don't have to move it around in order to execute a query.
Example
Maybe it helps to do another deep dive on an example of running a remote query using this DuckDB query engine. An example here offered is to detach a database. DuckDB recently introduced the support for multi-database interactions, where you can attach to and use multiple databases at once using a single DuckDB connection. From our end, we wanted to make sure that we can also detach a database which is previously attached, both locally in the DuckDB instance, and remotely in our DuckDB instance, that is in the cloud, because you want to make sure that you do this in both places. What happens is, first, we want to execute this SQL command locally. That basically comes for free, because DuckDB already supports this syntax. All we have to do is to check in our extension whether this database is attached in the local catalog or not, and if so, we will just detach this database locally. If it's not, then we make an RPC call over the wire to detach this database remotely, when it is found in the remote catalog.
To make this work, there are two ways. The first is to hook into the DuckDB storage extension. The second is to go through the DuckDB Optimizer extension. To go through the DuckDB storage extension, we would essentially have to modify the DuckDB code, which is not very difficult, because DuckDB is very friendly to community contributions. If anything makes sense, they usually accept your contribution. In this particular case, we figured it might take more cycles for that to happen, so we went with implementing it using the DuckDB Optimizer extension, which essentially allows you to achieve the same effect. What you will do is you can intercept the query plan, when you see a logical detach operator. Then you can substitute this logical detach operator with your custom, local remote or, say, hybrid detach operator by extending the logical extension operator. This logical extension operator is something that we actually introduced to allow this level of flexibility where you can introduce any types of logical operators, that make sense for your execution workflow. In this logical operator, you can then make a create plan call to produce a custom physical local remote or, say, hybrid detach operator. As part of this physical operator, you can then perform the database detach logic by making that remote RPC call. If it makes sense, you can fetch data as part of get data, or you can receive back a Boolean indicating whether the detach happened successfully or not. Or maybe not. Maybe you just handle an exception on the client side and the server throws some type of exception. As you can see, it sounds complicated, but, really, because there are so many entry points into DuckDB, it is very flexible for you to intercept the logic and do your thing, and then fall back to the normal DuckDB execution flow.
Learnings and Takeaways
Finally, some learnings and takeaways. Hopefully, you've gotten the idea that DuckDB extensions are very powerful. You can do really crazy and advanced things with these extensions. There are lots of entry points where you can enter into the DuckDB code base, and really use it as a library to compose whatever you need, and then achieve whatever workflow you would like to achieve. Then fall back to DuckDB execution and get the results that you want. It's very easy to get started. What I did to get started was to check out some popular extensions. An example is the sqlite_scanner extension. This is an out-of-tree extension, which means the code lives outside the core DuckDB source tree. You can also check out the httpfs extension, which is an in-tree extension. I think at this point, most likely, you will be developing an out-of-tree extension. Probably the SQLite or Postgres scanners are more relevant, but truly, both in-tree and out-of-tree extensions are very similar. They both have to implement the init and version methods. Then you would have your loading of the extension, and then you can do anything you like within your extension, and then just call DuckDB code wherever appropriate. There's also an extension template that DuckDB Labs recently worked on, that helps you get quickly started with building an extension. You can also check out the automatic platform building, distributing CI that DuckDB offers, so you can get your binary distributed very easily.
A note is, DuckDB is evolving very quickly, and has a very vibrant developer community. If you feel like you're developing your extension, and you need something changed in DuckDB, and we've done this many times over. For example, we needed to add more serialization logic for making RPC calls, and we've done that with them. Or, if you want to make a method virtual or public so that you can get access to it, if it makes sense, if it's a common utility method, you can just go into the code base and make that change as appropriate. Really, anything else. Adding this extension operator, for example, is something else that we did that made sense for the open source community, as well as what we were doing at the time. You can feel free to make those contributions to the core DuckDB code base. With that said, DuckDB is evolving very quickly, which means your extension may no longer work if you don't keep it up with the DuckDB versions that keep upgrading. A way to counteract that could be to build out CI, that test against DuckDB versions, and make sure that your extension is up to date with the code. Recently, for example, they've introduced their own unique pointer type, which overtakes the standard C++ library unique pointer. You would have to update your code accordingly. It also helps to keep an eye out on the DuckDB release notes. They will usually point out these critical changes. That helps you keep your extensions up to date. Finally, really, sky is the limit. There's so much you can achieve with these extensions. It's super fun to play around with it, and see how you can leverage the entire DuckDB code base, to achieve anything, really, that you would like to achieve with your extension. Let your creativity soar.
See more presentations with transcripts