Transcript
Ilyushchenko: In reality, I'm not going to talk about QuestDB specifically. I just wanted to talk about the space of time series databases, what's going on in the space, the history of it and stuff like that. Maybe I'll show you QuestDB as well, and how it fits in the whole scenario. What are we going to learn? I wanted to go through the evolution of OLAP and time series databases, in terms of roughly your last 20 years, what's been going on. Maybe we're going to find out what data model the databases seems to converge on. I'll mention QuestDB a little bit in the end, how it basically fits in this ecosystem.
My name is Vlad Ilyushchenko. I'm from London. I've been here 20 years. I am co-founder and CTO of QuestDB. My responsibilities are that we build things that people want. We went to Y Combinator a few years ago, and that's their mantra. We try not to go on a tangent and make sure that we build things that you guys want. It's not always easy to find out what you want, but that's what we're trying to do. The other thing is, building databases, as it turns out, is very hard, and we need to do it as a team. That's what I do at QuestDB, make sure everybody is enjoying the process, because it's quite hard work.
Introduction to Time Series Database
This is maybe a bit unorthodox introduction to time series database. This is from what we see that organizations use time series databases to help acquire and extract information from nascent data. Nascent data is the data that just comes into existence. Some people call it real time, new data, but nascent is a good description of it. Those are pictorial examples of nascent data, what it is. Ultimately, we see role of the database is to deliver data from the source to the user. That's basically what database does. The sources of this data in the top left corner could be financial data, like trades, prices, events that are going on in the market. In the top right corner is industrial IoT. This is more of hardware applications.
People in this space, what they're trying to do is to monitor the process and implement predictive maintenance, or see what's about to break, or break, so maybe will break in the future. A quite different thing about this particular approach is that usually data is being used in the custom software. It's pulled out of the database, and they run custom algorithms to basically infer some data for the production process. Data can be used in transportation or mobility. We see these use cases arise from, for example, if people develop rocket engines. These rocket engines, or engines, they're dyno tested, they're tested from time to time. During the test a lot of data is collected from a test run, and then these tests are compared.
From run to run, you compare the test. Maybe you detect anomalies in a test. The characteristic of this data is very high volume. There's a lot of data. Literally, some airplane engines, they produce an area of 140 million data points a second, that kind of stuff. It's very tricky to collect this data. The data probably we are all most familiar with is logs and application metrics and whatnot. This is also time series data, or new data that originates from these places.
OLAP and TSBS Design Journey
The time series data is part of the big data scene, if you accumulate a lot of time series data that turns into big volume, big data. There are other databases in this space, and those are OLAP databases. OLAP databases, people use them to typically extract information from the existing data. There's a subtle difference between what time series databases do and OLAP databases do, but I'll get to that later. The design journey of both time series and OLAP databases. Very early databases, they had the quite monolithic design to them. What that means is, the database looks after its own data. It defines ingress and egress of the database, and just manages the whole scenario. You don't see how the data looks in the storage layer of the database, or you don't have a good insight into how a database does things.
Then from monoliths, the database started to move to shared disk DBMS engine. This is slightly different. I'll talk about all of these things in a little bit more detail, but I'll leave it there. I'll talk about shared disk. Then the next step was Lakehouse DBMS engines. Let's dive into it. This is what monolith database is. A user treats query engine as a black box, so database owns disk and database also owns the catalog of your data. In this scenario, if you were to say, I want to store a lot of data across my databases. You have to split data into shards so you can host data on different servers. Doing that limits your CPU capability, because your CPU capability is linked to the number of shards. If you need more CPU, you have to reshard the data. This is a painful process to do depending on how much data your database has.
This limits elasticity, because you cannot add computer resource on the fly because it doesn't have access to your data. Other potentially old things with the databases of this type, they handle all of the ingress and egress points to the database. If you want to load data in a database, you have to go through it. If you need to query data from the database, you also have to go through it. Basically, all the protocols and code the database has could be a limiting factor, depending on what you want to do. Obviously, data in a database is vendor locked, so to get your data out from one database to another, you can do it, but it's a painful process, more often than not.
In 2010s, time series databases, TSBSs, were all monolith. They came into existence by replacing OLTP. OLTP, Postgres, Oracle, and stuff like that, were not perhaps quick enough on ingress side, so people wanted to get more data in a database faster. People invented TSBS. At the same time, in the same decade, the OLAP databases have already moved from monolithic to shared engines. They moved from what data warehouse was, to use another type of architecture, the shared engine architecture. Shared disk engine, what is it? The difference here between the previous slide is that the data is no longer on a disk of the actual physical computer. This data is in object store. You guys are probably familiar with object stores, as S3, every cloud provider has it, so HDFS and stuff like that.
The advantage of the shared disk architecture is that it unlocks compute capacity of the database. If you want to add another server to help you with aggregating data and stuff like that, you can easily do that, because they all share the object store. There's a difficulty, the things that databases do have changed as well. Object store is not your local disk, so it doesn't have random access. It's restrictive to get to and from, so you cannot update it in place if you wanted it to. The databases need to build layers on top of object store to manage access to the object store itself, to not overstep throttling or to reduce cost. The databases have to have a local disk, perhaps, to back object store, have a local disk cache.
At the same time, elastic compute is an advantage. However, with the shared disk engine, the egress and ingress are still handled by the database itself. All of the endpoints to get data in and out are still in a database. Data on S3 is not necessarily stored in open format. It's going to be in proprietary formats. Every vendor does it differently. Some people make it also difficult to get data out of the shared store. Implementations of shared disk engine, they force you to go through the database still. OLAP databases in 2010, they were already on the shared disk model, but TSBS are still behind, still monolith.
The next engine after shared disk is what people call now Lakehouse Engine. The fundamental difference between Lakehouse and shared disk infrastructure is that the object store now has the catalog itself, so data catalog. Data in object store is stored in open formats. OLAP databases, left, right, and center, they're able to read Parquet files from S3. That's pretty much what it is. The database architecture changes yet again, because the catalog is owned by something else, something external. The Parquet files, they're different to internal format. They're slightly less optimized. The upside is that data can be put directly to shared storage, so you can upload Parquet file to S3. You can have database read it, should you need it to. The other upside is that if you want to download data from the system, you can just pull Parquet files back, so you don't need to go through database.
The role of the database becomes a little bit different in this environment, is that the database is becoming more of a help to the user. For example, if you were to upload 100 files to object storage, you might want database to help you find information in these files, rather than download files and find this information yourself. Should your data access granularity be the same as your files, then you don't need to go through a database entirely.
In the 2020s, ultimately, a lot of OLAP databases, such as Snowflake, they moved to Lakehouse Engine architecture. Time series databases are really far away from this. Also, in the past 20 years, it seems as if OLAP databases have doubled down on query performance, whereas time series databases implemented ingress performance better than OLAP databases. OLAP databases at the same time have a lot more advanced query engines. These query engines are effectively able to distribute SQL execution for workloads that need it. Time series, they don't. There is no time series database with a distributed query engine. At the same time, time series databases are a bit simpler to operate, because when you have a distributed query engine, so that system is either cloud bound, so it's in the cloud, or it's very difficult to set up and operate and monitor, and stuff like that.
QuestDB - Introduction
There's a story about QuestDB. The company was founded just on the cusp of 2020's ending, in 2019. Broadly speaking, we are a column store with a SQL engine that can do JIT compilation and some other ancillaries. It makes sense for time series database to look towards Lakehouse Engines. There are multiple advantages to it. Storing data on S3 is incredibly cheap, so you can store a lot of data. Storage is tiered. It costs very little. The open formats such as Parquet, the high compressing format. The advantage for database users is that, in some cases, both egress and ingress can bypass the database entirely. You can have a scenario where you can upload Parquet files and have the database query them. You don't need to go through a database to do it. You don't need to convert formats. You don't need to have special drivers or anything like that. Parquet files in database reads them.
The database role becomes a little bit of a help, so where database is able to search your files should you need it to, or just stay aside if you're happy to get files in and out. One thing that Lakehouse Engines also tend to do is to focus on unstructured and semi-structured data as well, which time series don't typically do. This is JSON data, array data, and stuff like that. Our goals change because we've been in this business five years, and we need to look around to see what our goal should be. It makes sense for us to try and address the gap between time series database and OLAP databases, and that is by having high performance ingress, what is essentially a column store. A column store is a data structure that is built for queries. It's not built for writing data too, it's built for queries.
Having high performance ingest in such a data structure is a fairly unique thing to have. We want to build a distributed SQL engine so we can run business intelligence workloads across multiple servers should we need to. We want to have high performance egress. High performance egress, I'll talk about that a little bit more, but it becomes quite important in the age of particularly AI, when data is extracted from a database into Python and processed in bulk somewhere else, outside of the database. We also wanted to make things quite simple for the user, so eliminate needless complexity. The database itself, we built it from scratch so there's no dependencies on it. It's built ground up to be a high-performance database. I want to keep it this way, even though we're going into the Lakehouse era and distributed queries and multi-nodes and stuff like that.
QuestDB - Ingress APIs
How do we do that? The ingress APIs. What's quite interesting about ingress APIs is that most of the data in new data is row first. Data that originates from census, contains multiple columns on one line, is row first, it's the opposite of column first workloads that OLAP favor. We want to provide maximum ingress performance on this type of workloads. The ingress should also balance hot and cold partitions. By that I mean automatic partitioning, and I'll get into that. It provides schema elasticity. What schema elasticity is, is that you can, for example, add a column on the fly to the database. You don't need necessarily to have a separate out of hours script that adds a column before you can change your publishing code that publishes values in this column.
The database effectively guarantees consistency if you've got 5 publishers or 10, trying to add the same column to the same table. They're all going to succeed, and one column is going to be there in the end. The ingress should provide data concurrency and high availability. What that means in a high availability scenario is that we want our ingress to be highly available in terms of, you can send data to more than one node. If you fail to send data to one, you can send data to the other. Here, there's some scenarios. For example, you send data to one box and this box doesn't reply, you basically have a timeout. There is no certainty whether the data that you send is actually in a database or not. Instead of your application failing and waking somebody up in the middle of the night, you can just send data to another node.
The guarantees the database provides is that this data will be inserted only once, even though you didn't wait for the previous batch in one machine, and you send the same batch to the other machine. Data will be just once, pretty much, and you don't have to worry about fixing the process, should this happen. How does ingress work? It's a simplistic diagram. What I wanted to emphasize here is that we use write ahead log. The database writes data to write ahead log first. Write ahead log, for all intents and purposes, is an append-only structure.
There's only basic validation that is performed when data is ingested. This write ahead log is basically fronted by different network protocols. You can put data in write ahead log from CSV which supports Influx line protocol. We could send Arrow in, and, basically, different data formats. Some data is row first. Some data is column first. Ultimately, all of this data goes into WAL, and WAL is asynchronously applied to the column store.
Ingress/Egress Paradox
The reason I mentioned row first and column first is because a lot of databases experience this ingress/egress paradox, where they're trying to insert row first data into column first datastore. This might not be a problem when you just do basic benchmarks, but it manifests itself as a problem in scenarios where your tables become quite wide, if the table structure has got 500 columns, or maybe you've got a lot of tables with small number of columns. Ultimately, what this converges to, you do a lot of random writes to disk. These random writes start to bog down quite rapidly. To avoid this, what we do is we have a mixed WAL, write ahead log format: one is for row first data, one is for column first data. The row first workloads are written row by row, and column first workloads are written column by column. We don't entirely remove random writes, but we mitigate it quite well.
QuestDB - Balancing Hot and Cold
Balancing hot and cold data. What typically happens during heavy ingress workloads, is that disk space is consumed quite rapidly, basically. That's what happens if you send a lot of data to the database. The job of time series database, and what QuestDB does, it automatically partitions your data on timestamp as you put data in, and offloads these partitions to object store as you create them, and recompresses them in Parquet.
The files become available for reads, but ultimately, database is able to sustain high volume of load without running out of disk or costing too much, or firefight the disk space as data is ingested. The new data comes in a write ahead log format. These write ahead logs, they can be replicated to S3 as well. That enables another compute node to have visibility of the old data that is already in Parquet files on S3, plus be able to reconstruct the hot partition that is local to the machine, so you can have multiple compute nodes looking at the same recent data.
QuestDB - Data Concurrency
Data concurrency. You guys are familiar with MVCC? It actually wasn't apparent to me for a number of years that databases do not necessarily overwrite the copy-on-write systems. Every time you modify something, the database copies data. Then, database tries to manage the size of the block that copies. Incidentally, from this, a terminology called write amplification comes in, so you write a little bit, but because database needs to copy-on-write bigger blocks so you can have more physical disk reads versus how much data you've actually written. The same goes the other way for reads.
As any other database, QuestDB can read data that's been written quite transactionally. The column store itself is also ordered by time. The database will be sorting data as it writes to it. It also keeps it consistent. It provides the concurrent update to Parquet files itself, so all partitions that have been deposited to object store, if you send data to them, the system will update them as well. It also provides concurrent metadata changes. Data republish after network glitches, is what I mentioned before, so you can republish the same block without duplicating data in a database.
QuestDB - Metadata Concurrency
How does metadata consistency work? With data written to write ahead logs, when a transaction is about to be committed, we have an entity called sequencer. The sequencer is very similar to what other databases have. They store it in ZooKeepers, and stuff like that. The sequencer has the metadata of the table, so it has transaction numbers. It has number of columns. It has a list of partitions, and where those partitions are. The idea here is that if the column is added concurrently by two actors to different WAL files, there's a version control.
Basically, before WAL is concluded, we check the version of the write ahead log versus what sequencer has. If the write ahead log version is lower than the sequencer itself, that means there was a concurrent update to the metadata that the WAL has to reconcile the entire transaction. The advantage of doing things in this order is that if reconciliation were to fail, the error can be reported back to the user as an error, or write ahead log can successfully reconcile the data. For example, if a column is added which is irrelevant to this transaction, it will populate column with nulls and insert it in a database. This mechanism effectively provides the ability to add columns pretty much on the fly, without locking the database or taking the database into off-peak mode of some sort.
QuestDB - Egress API
Now, egress API. This is what we saw throughout the years, what people do with time series data. One use case for egress is to power BI software. It's analytics. It's aggregations. It's top of this, bottom of that kind of pie charts, and stuff like that. These analytics sometimes they're zoomed out as well, so a user can zoom it out that causes the database to go over larger volume of data. This is a type of workload that requires effectively distributed queries, because everybody wants data fast, no matter how much data you have. If somebody zooms out five years up, then database needs to do a lot of calculations today. That's the prime case for OLAP and distributed query engines. Some other workloads are time series workloads.
Time series workload is basically time series charting. People plot a chart of something that is more likely to be very recent. Trips into the past from charting applications are quite rare. What's important there is to optimize query execution on a single CPU, because going across to multiple machines could be prohibitively expensive depending basically on the query. Other types of workloads, such as Spark or Python, sometimes these workloads need to execute data on SQL, definitely. A lot of other times, these workloads just want to access object store directly. They don't need to go to the database. This is where database bypass becomes interesting and important in this scenario.
The workloads that offload quite a lot of data to the database actually necessitate a different type of egress API, such as Arrow Driver, ADBC, they're called. The problem these drivers solve is to get a large volume of data in or out of the database over the network. It requires multiple network connections. A database is not necessarily going to scale linearly as you add more network connections to them, but doing it over a single TCP socket is not going to happen. This is what ADBC driver provides. Your output data is split into column sets. Those items are still columns, they're just smaller in size. These columns are sent back-to-back over the wire to the client, and then they reassemble into larger columns.
All of this is abstracted via SQL execution. Typical examples here for time series workload, somebody would just query data on an interval, say they want data for past months. Or maybe some filtered data on the predicate, they want data for past month for this particular time series. It's common in time series database to put multiple time series in the same table, so you'll be filtering a time series out of a table to a client. That could be quite a large dataset. Sending it over multiple network connections implicitly makes a lot of sense, and this is what ADBC facilitates.
QuestDB - SQL Engine
The SQL engine is effectively a reflection of the workloads that are put on it. It's typical for OLAP and it's what we do as well, is to execute group-by, order-by, and where clause in parallel. Data is split into chunks. Those chunks evaluate in parallel. Data is merged back. To optimize this, we also compile some of the predicates into assembly, which is AVX2 assembly, basically to avoid costly function executions. This helps with analytical workloads. It helps with time series charting, and helps with BI workloads. The SQL language extensions, such as sample by and limit, we offer these things. I'll show examples later on, but these are simplifications that take advantage of data being ordered by time. Sample by is a group-by on a time interval. Limits allow you to pick particular data in the time range quite easily. I'll show you how to do that. The SQL engine also leverages metadata and data layout for query optimization.
A good example of this is that, if you have data that is, for example, rank length encoded in a column, it's not necessary to materialize this data back in memory to perform a workload search on it, so you can search compressed data without uncompressing it. This is what optimizer tries to figure out, the data layout and what kind of execution algorithm to pick. Interval search is used to send bulks of data either to Python or maybe to chart. It also leverages the layout of the data itself. The data internally is represented as a column first for query engine purposes. Such data makes it relatively simple to offload it to ADBC driver, because it's a column first driver.
QuestDB - SQL Extensions
This is examples of sample by queries. You could potentially write a group-by statement here, by truncating date to hour. This is simplification, and does the same thing. In this example, also there's interval query, instead of expressing it via day between x, y, z, this is seven-day interval of, 1st of June, 2018. This is semantically easier to write SQL. This is example of AS-OF join. AS-OF join is a time series join. It joins the data on a fuzzy time interval. A typical example would be to find a value from another table that happened before. Or you've got table A and B, you find the value in table B that happened just before the value in table A. That's what AS-OF join does. This is time axis travel that also takes advantage of the order of the data.
The first example just shows the last 100 rows from a dataset. This illustrates a negative limit you can use on the queries. The second example uses keyword, latest. It finds latest version of the data in that particular day of 2018, 5th of April. This is axis trips database, and the cab_type has one of two values. Ultimately, this would go mid dataset and take the latest data for a given time interval, so you can version your data, essentially. This is quite quick, and takes advantage of the order dataset.
Conclusion
A Lakehouse seems like the engine everything is converging to. OLAP and time series databases converge on the same target. OLAP are improving the ingestion pipelines. They're trying to implement streaming workloads into OLAP databases, and TSDBs are trying to improve data storage, make queries distributed, and deal with large volume of data at the same speed as OLAP do. Ultimately, we don't know the future, but they might become one, because they're all moving in the same direction. In the meantime, time series databases and OLAPs co-exist in the way that time series database ingests new data quite well, and this data then offloaded into a bigger database and analyzed for posterity in OLAP database.
Questions and Answers
Participant: How resilient is QuestDB to handling out of sequence events? You obviously have streaming ingress, but do you also have streaming egress on Quest?
Ilyushchenko: It's not streaming egress, but we're building this ADBC driver for large volume egress. It's not quite streaming. The capability as a streaming egress can be achieved through the change data capture. These write ahead log files can be shipped somewhere else and analyzed. They can be done like that.
In terms of out of sequence events, internally, we call them out of order. It's the same thing. Our database is not an exception, they do copy-on-write stuff. You are allowed to send data as far back as you want. Some patterns would perform quite well. It depends how much data we need to rewrite. There are some degenerative patterns, basically, they're going to go in very slowly. It depends.
For example, if you're causing small transactions into like a top of a partition, and we overwrite in a big partition over and over, you can get into these kinds of scenarios. Broadly speaking, yes, you can send out of sequence events. We try to, in some cases, split partitions if we detect there's this kind of event, so the big partition gets split. We overwrite only smaller chunks. Ultimately, there's a possibility to get database in the corner with these things.
See more presentations with transcripts