BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Why a Hedge Fund Built Its Own Database

Why a Hedge Fund Built Its Own Database

Bookmarks
51:15

Summary

James Munro discusses ArcticDB and the practicalities of building a performant time-series datastore and why transactions, particularly the Isolation in ACID, is just not worth it.

Bio

James Munro is Head of ArcticDB at Man Group. ArcticDB is a high-performance data-frame database that is optimized for time-series data, data-science workflows and scales to petabytes of data and thousands of simultaneous users. James was previously CTO at Man AHL between 2018 and 2023.

About the conference

Software is changing the world. QCon London empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Munro: I'm James Munro. I'm going to talk about why a hedge fund built its own database technology. I run ArcticDB at Man Group. Man Group is the asset manager. I was a physicist originally. Spent some time doing plasma physics, electron molecule scattering, things like this. That ended up being useful to do simulations of plasmas for semiconductor processing. This is marginally relevant to a later bit in the talk.

Eventually, I gave that up, joined Man AHL, which is a very systematic quantitative hedge fund manager, mostly hedge fund manager. I worked as a Quant Dev on a bunch of their asset classes, team, strategies, and portfolio management. Became manager in 2016, then CTO in 2018. Then really, that's where I feel like a lot of what I'm going to talk about starts to come in where I'm focusing on particular things for Man AHL. I did that for 5 years, before becoming head of ArcticDB. I actually moved from being the biggest user of ArcticDB, running the team that was using ArcticDB, and demanding features left, right, and center, to being the owner of it, which is an interesting twist.

Technology Empowered Asset Manager

I'm going to give some context to sketch out the story, and then talk about the real whys of building your own database technology. Man Group is a large alternative asset manager. Alternative means not the usual massive markets, but alternative investments designed to give uncorrelated returns. We're quite large for that, so over $160 billion of assets under management. We've been doing that for a while, 35 years.

Actually, Man's much older. It's named after someone called James Man, who founded it in 1783. It's a very old company, and has been doing brokerage and merchanting and stuff for most of its history. For the last 35 years, it's been predominantly asset management. Lots of clients. Mostly sophisticated clients. In fact, you have to legally be a sophisticated client for most of the stuff. Things like pension funds who are looking for other sources of alpha for the people with those pensions. This is a really broad range of investment strategies.

Things like macro funds, trend following funds, multi-strategy things, discretionary things, credit, loans, all sorts of things, real assets. There's a whole diversity of things going on. Anything in alternative asset management has been something that Man Group's interested in. All of that has now been supported by the single technology and operating platform. That's where the story starts to get interesting for us, I think. Because of the size, there's a scale challenge immediately. We trade an awfully large amount for an asset manager, so $6 trillion or more a year. It depends on the vol of the markets, the level of risk we're taking, but that kind of scale.

How do you get from $160-odd billion to $6 trillion? That seems like quite a lot of a jump. Actually, the several-fold comes from leverage. If you're running hedge funds, you're normally running leverage. That's the first factor. The other several-fold from a trillion to $6 trillion or $7 trillion, is actually because you're an active asset manager. You're using data and information on the markets to decide whether you should be long, short, which markets you should hold.

It's that that you're using to trade. That means you have to be coming in and out of positions. You're doing that several times a year, on average. That's where that scale comes from. Then, we're looking for a whole diverse range of ways to find alpha for our clients. That's where you need to trade lots of markets. Anything liquid we will trade basically. Then we've got all these different customers with different problems. We're trying to design it for all of them as well. This is real alpha at scale challenge. Then, trading, you've got to do this cheaply, you've got to do it efficiently. It's a real drag on this activity if you're not getting into the markets really cheaply.

Ten Years of Investment and Refinement

Another little bit of context is about the story for Arctic itself, so just at a high level. This is something we actually started back in about 2011 when we were looking at Python as a platform as well, and we were looking for data solutions that would work for us for that. We built a first version of Arctic. I'm going to try and explain the why for both simultaneously, so this won't come up that much again. We made a first version of this Arctic database, actually purely in Python. It was backed by Mongo at the time, which is a fast document storage database. We did that.

Then we actually open sourced that in 2015, which got reasonably good usage by our peers in the financial industry. That was first half of the story. Really, we started to hit limits with that. Mongo became a bit of a scaling obstacle for us. We were running hundreds of Mongo servers. Apparently, we had the second largest Mongo solution for a while. We needed another way to level up on the performance as well, so we rewrote the core in C++. We actually got rid of the separate database layer completely, and now it connects just to storage. That's one of the stories I'm going to talk about. This was a way therefore that, by changing the way we do storage, so connecting directly to object stores like S3, we could get better performance and better scale.

Then also, by changing the runtime C++ instead of Python, we could get better performance, what we're doing there as well. That was a way to many times our performance level on our scale. That now is used almost for everything at Man Group, in the front office at least, so for the market data and all the data we use to predict markets, or risk, or whatever we're doing, or cost, and for all of the AUM that Man Group trades. It's also used across finance now by banks, other asset managers, data providers.

Why Build a Database?

Let's get into the why. Why build a database? At one level, it sounds crazy. There are thousands of databases out there, and they've already been built by people who want to make a successful database for some reason. Is it really the case that none of them fit what you're trying to do? That's like a core of the question. At some level, this sounds a bit crazy. It reminded me of this quote, the distance between insanity and genius is measured only by success, which after a bit of Googling I found out was made by someone called Bruce Feirstein, who apparently wrote some of the James Bond stories, an American author.

This is the DALL·E interpretation of that. You've got the slightly crazy person on the left and then maybe the more genius person on the right. That was the way around I had it in my head anyway. A younger looking Bruce Feirstein in the middle, and he's measuring his head size for some reason. I got it. Then you get these weird things like the arm coming in from the left on the table. I don't really understand. Is it crazy? I don't think it really is because, for one thing, there's some level of specialism here. We're dealing with a lot of high frequency data. I'm going to talk to you about shapes and sizes of data too. That's a normal thing for people to go and solve with specialist tools. There are proprietary databases you can buy for tick data.

You'll find that most, at least tier-1 banks have built their own proprietary database solutions for tick data for similar reasons that I'm going to talk about. There's a specialism story there. Actually, we built something that's quite general. I think another part of that is most of those thousand databases out there already have been built for a purpose. There are very few databases that are built in a vacuum and then address the purpose. Actually, you should really consider it normal that when there's a breeding ground for some innovation that someone comes up with something that's a little bit different.

Getting into the why, I wanted to try and connect the dots from what Man Group is trying to achieve with this alpha at scale story. Back to the technology pressure for doing something like ArcticDB. This is quite a nice way to explain it. This is an economic article from American Economic Review in 2020, where they were looking at research productivity. You might think that maybe low latency trading systems are the real challenge of running a systematic quant hedge fund. Turns out, not really.

We're not a high frequency trader. That's one element of it where that would be more of a pressure. Although the performance of that does matter, and the technology you need to solve high frequency trading, a really good execution does matter. Actually, you'll find that a lot of the time you're competing on this research productivity, for the trade ideas themselves, and how you manage risk and optimize portfolios, rather than the low latency aspects of it. I'll explain that a bit more. One step back from generating the alpha itself is actually the research to generate the alpha on the data. It's the research productivity for your quant team that you're focused on, for those ideas and for that portfolio construction. This chart is of that for Moore's Law. This felt familiar for me, because I've been working on semiconductor processing in the mid-2000s there.

I'm a little part of that green line. What it shows is that for Moore's Law, in that sector, at least, we know the semiconductor densities doubled every couple of years. Somehow that's 35% compound growth rate, according to the economics, but I didn't quite get that. That's been an exponential growth rate for the computing industry. It's been a massive revolution: the IT revolution, the technology revolution. The cost of doing that research has been going up steadily. I was definitely acutely aware of that, because of the level of detail of simulating those plasmas, etching those semiconductors that I was doing. It was atomic level simulation, back in the mid-2000s.

That was very much at the tail end of the concerns. The number of people just to eke out smaller feature sizes on these things was just getting huge. That works when your payoff is exponential. It doesn't work in every market. If that was one person in 1971, it's now like 18 people in 2014, it's probably a lot more now. Research productivity is a key challenge in that sector. The paper's conclusion is it's a key challenge across sectors. Some of those sectors don't have those exponential payoffs. It's a key challenge in quant as well for markets. For one thing, you've got more data coming in. Another thing is you've got competition, and you've got market efficiency going up. These edges become harder to find. This research productivity ends up being your biggest challenge.

What is Alt Data?

Data's going up, no need to explain that. In asset management, in finance, generally, though, it's been a slightly more nuanced story. High frequency data, low latency market data ticks, that frequency, that's been going up. You can collect trillions of rows per day, if you want. Billions of rows per day is typical for asset managers like ours. Also, there's been another part to the story, which is all this other data starting to get used to predict markets and risk. Generally, we refer to that as alt data or alternative data.

That's all the data that consumers are generating, and people are generating, which is all sorts of different types. It could be weather data, images, could be data on how green you are. All of this stuff. This stuff's been growing. There's a dual sided story here where the volume of this data has been growing. This plot here is by a data like catalog company.

They basically go and index datasets for people to try and connect you to the valuable data you might need, called Eagle Alpha. They've got these nice plots of all the datasets in their catalog over time. Actually, it's also diversity of data challenge for a place like us where you're dealing with all sorts of different data types. You just really need to be able to get through that as a process really quickly, as well. You need agility with data, not just the ability to handle it performantly.

Choosing Python for Data Science in 2011

Then, another part is I wanted to place this story in time as well, because if I think back in 2011, when I joined Man Group, they were just beginning a move to Python. I want to use this Python for data science now. Anyone is choosing at this point. People might be on other languages at the moment. Anyone who is choosing is choosing Python for data science. That's what you get taught at school and everything. That wasn't really the case in 2011. I think that if you're in a quant hedge fund in 2011, you're looking at other languages like R, or things like MATLAB, sometimes C++ if you're doing lots of data.

These were the typical choices. These were the skills that people had in the building. Python wasn't the obvious choice for data science at that point. Also, you were presented with the challenge of wanting something you could use in prod. It wasn't really a choice for that at the time either. Obviously, lots of people have been choosing Python since then. You can see that this is a Stack Overflow language popularity chart. You can see lots of people chose Python, and it became very popular on Stack Overflow, and obviously very popular globally. Back then, also, Python 2 was like normal, Python 3 was out but it was like hard, still a pain.

TensorFlow didn't exist. PyTorch didn't exist. Lots of tools weren't that popular yet. pandas really got popular mid-teens. It was open sourced originally in 2008. This was still a pushy choice. You're going to move everything into Python. You're going to do all these data science there. We spent time supporting that community, particularly PyData in London, and working on our tooling on top of what was available, including ArcticDB.

Conceptual Journey

I want to explain another part of the conceptual journey, which is that even if you're an individual, and you're managing your investment portfolio, this is the process you follow. Even if you're a very sophisticated asset manager, this is the process you follow, whether you like it or not. You're bringing data in, as an individual might just look at my phone, for a price of something. I might be bringing in billions of market data ticks. You're trying to get on board that data. You're coming up with some decision, some idea about what to trade.

Then you need to decide how to trade it, this portfolio construction. Maybe it's not the only position you hold. A simple example is you want to hold American stocks, but maybe you don't want to be exposed to the dollar, so you might need to hedge something. In a complicated case, you've got to decide on your weighting tool, the assets you might hold. Then you want to trade it cheaply. That might affect the instrument you trade, but also how you trade. This is whether you're an individual, or whether you're a global corporation, this is the process you follow. I think the other parts of the story is that, this just really gets complicated extremely quickly. I mentioned the thousands of datasets that Eagle Alpha had indexed there.

There's a variety of data as well. You might be looking at company reports, like documents. You've got tick data. You might be looking at consumer transactions. You might be looking at environmental, social things. You're also doing all sorts of statistical methods. It's not like you're choosing one machine learning algorithm that's on trend. You're doing basic statistical and science. You might be doing deep learning.

You might be using ChatGPT. You're almost using every method under the sun and you're getting hands-on. You need the tools that let you do that. You need to be doing all sorts of portfolio construction. I mentioned that we've got lots of solutions. You need to be doing risk. You need to also do the high frequency end of execution, and you need to do that well. You've just got this really diverse set of problems, and almost teams devoted to all of them. There are hundreds of quants within Man Group, for instance, and they all want to be solving different problems. If you're all solving the same problem, you're probably doing something wrong.

DataFrame Use-Cases Across a Systematic Trading System

Now we're getting into the nuts of it, I think the detail of it. If you're a systematic trader, which means that you designed an algorithm that's going to do the work for you, you're not choosing stocks yourself. You're designing the algorithm, and you're probably very automated, so from data through to execution. I think this is the architecture for the ages called the lambda architecture. Invariably, you invent this lambda architecture, where you have streaming data pipeline, and a batch data pipeline. Streaming data is all about the high frequency stuff.

You're bringing in tick data. You're saving it into tables or DataFrames. You're also going to be downsampling it into bars, typically one-minute bars, could be lower frequency, higher frequency, and doing analytics on it for later use. Also, you don't want to do the kind of work that requires and connecting to non-streaming APIs for everything else. You've got a batch workflow. That's the other side of the lambda architecture, where you're bringing in either fundamental or alternative data for all the things that you care about. You're putting that into DataFrames or tables. You need to catalog that because there's thousands of things. We've got this internal datalake architecture we call codex.

The goal is to bring this stuff into your model, your strategy, where your algorithm is going to execute, and it's going to use this. It's going to be able to do backtests to tell you how this thing evolves over time. It's going to do a portfolio risk and trade optimization. That's where the science is focused on building that algorithm. Then you'll send that instruction out to trading, the eventual decision. That could happen in a variety of ways.

Then you're going to need to do analytics on that. You're going to need to do analytics on all of this to see what's going on. You're bringing this stuff into tables. What's happened at Man Group, and actually at many alternative asset managers is the table has become the DataFrame. The DataFrame's become the unit of moving data around, almost like a document, in that you care about the unit as a whole. You're reading and writing new ones every day. You're just shuffling an awful amount of data around and you're doing analytics on all of it every day. That's a key story that your unit of operation is the whole DataFrame.

Scalable and Accessible

Then from the technology side, back in 2011, but this is still something we care about, we were thinking about how to get past the server bottlenecks, all of this stuff causes. We tried a number of proprietary databases and open source databases at the time. We found all of that required often just creating hundreds of these servers. I talked about the Mongo story a bit. The reality is that a single user can generate enough load with some of these data science models to swamp your entire system, to swamp potentially dozens of servers, which is the inverse of a relationship you might have with a website where you've got thousands, potentially millions of users for a single server.

That's a huge cost and a huge operating burden as well, just the cognitive load, and the work required to maintain that, even in a modern serverless setup. Then, also, we were moving to Python. That was a decision we had made. We wanted to make a really trivial API for people. We wanted to just make it as easy as possible to work with this data, and DataFrames as a unit. Something that is almost like OneDrive now where you can just share data trivially, and work on it like that. Also, was a high-performance time series database. This kind of like not having an idiosyncratic tool for this but simple Python API that let people work with data like this. All of these things coalescing drove us in this direction.

Real-World Data

Then, is there anything that could have still managed to not make us build our own database? I'll explain a bit of that now, with some of the real-world data challenges, and my opinions on it. Time for a DALL·E picture. Shapes that are too wide, too long, or ragged. I wasn't liking the images, so I did cubism on the end, try and get a cubist image. I'm not sure this is cubism but it's got cubes in it. Here's someone struggling with real-world data. A real example, bond data. I think it's an interesting example, because people might not be so familiar with it.

Bonds are actually a bigger market than equities, like multiple times, three times globally as of when I pulled this data in, and actually a higher amount in the U.S. because most bond trading is done in the U.S. It's a huge market, actually, but it's a lot less liquid than equities. We all know about equity trading. It's credit, basically. It's much slower and harder to trade as OTC market. This has been fertile ground for quant, this is because it's less liquid, but really big, it's something that's hard to do. It's something hard to get right. It's a really good thing to be on the front edge of.

The data behind this ends up being somewhat challenging as well. I've got at the top like how you might think about normalized data typically. Then I've got at the bottom, one of the ways we actually work with this data. In fact, we have many shapes for all the things we do. At the top, this is Python pandas DataFrame. Then I'm pulling out the bond data. I've got dates on the left, so it's like time series. I've also got IDs of the bonds. Those are the Cusips and ISINs, the common IDs. You've got things like price, but you've got lots of measures for the bonds, that's the fundamental data, things like duration. This is a normalized way of bringing in data. This is how you often get data. It's not actually how you want to work with data. That's one of the steps I think you've got to take.

At the bottom is a pivot of that. You're just pulling out price now. It's one measure. You're just working with one measure, because you want to do calculations on price. If you do a calculation on price, it's not the same calculation you're going to do on volume. You got IDs along the top, and you got times down the left still. Now you can do time series analytics. If that's stored in a columnar fashion, then that's incredibly fast. Also, you've arranged your data naturally for doing cross-sectional analysis, because everything's a portfolio now.

Trading one asset rarely makes sense. You're really arranging yourself for the kind of work you want to do on this data. This ends up being 400,000 historically tradable bonds from this dataset, which is many gigabytes. It's five rows, because I pulled the tail out. It's a few thousand rows in total of daily data, but it's many gigabytes of data. Wouldn't fit on this laptop, for instance, in RAM. You also need the tools to deal with this. You need something which is happy to have that many columns, which is not a typical SQL schema. It's more like something you need to treat as a block.

This is really the story of the normalization tradeoffs, and maybe personal opinion as well. Generally, people have been taught to design schemas on the left. I'm not saying they're wrong, but I'm saying you need to think. Schemas on the left are typically normalized. The reason you're doing that is because you're looking at the way the data is structured, and you realize not every timestamp exists. You realize that sometimes assets come and go. Apple didn't exist forever. Many companies go bust.

Measures change, because the scheme of your data often changes, so you've got to normalize it all, because you think that's the right thing to do. It's the thing you get taught to do at university, if you do computer science. Actually, your users care about something else, often. I don't want to be too generalistic. Here, I've given another case where it's not like assets IDs along the top, but it's like, you've broken up your problem into assets. Because, typically, maybe you've got an application where people are just pulling in individual assets. You don't actually want the performance cost of building the table on the left, but also the performance cost of scanning the whole table every time you want one asset out of it. Suddenly, you want the columnar storage on an asset.

Two, your measures might change, or you might have something else, you might be doing calculations producing the columns, but you want that stuff aligned. You don't want that to be distributed across your large table. Actually, go to where the user is, as much as possible, and then do the normalization you think is appropriate for that. That's a real part of it, and has made us design it for the flexibility of DataFrames that are the shapes the user want to be.

Two aspects of that, it's just like, one, we can have big tables, but let's make sure those tables are columnar on the things that people want to read. That could be billions of rows, it could be trillions of rows per tick data, or it could just be decades of daily data. Also, users are going to want to do cross-sectional things, and so support hundreds of thousands of columns. We have a user on ArcticDB doing a million columns. Make sure it works for that.

Then, rather than rely on your normalization to deal with missing data, just accept the missing data. Let your data be ragged. Let your data be sparse in places. Build a tool that's performant for that. Again, just an example, because you're going to have many use cases, but this is asset IDs on the top. Then the idea is that some of these assets have gone away, maybe they went bust, and new ones are coming along. If it's bonds, they've expired, and new bonds are being issued, or maybe it was missing data because the server went down. Just build a tool that lets you work with that, because that's how the quants are going to work with it. I think that's actually probably actually quite true of a lot of data science as well.

Architecture

If you believe the why, why build a database, a new one, then how do you go about solving for all this? Time for another DALL·E picture. Our mascot, the polar bear is looking for a way through a complex architecture. My take on this is you can give up one of the ACID mostly. We know atomicity is the idea that either something happens, or it doesn't happen, like it fails completely or happens completely. Renaming a file on a file system is a common example of that. Consistency, all the versions are good versions.

You don't see half complete versions or things which break, things we expect to be immutable. Durability, so there's copies of the data. You've got good uptime. One hundred percent, you need these things. The one that's missing is isolation. If you want to give up anything for the sake of agility and performance, I think you can do a lot by giving up on some of the isolation. Isolation means that you're helping the user coordinate transactions, and they might be doing it from multiple places. A very classic example of this is your shopping basket. You've got stock. You're buying something online.

That shop has stock of that thing, and it can't sell it twice so it needs to check in order when it puts things into shopping baskets, so it doesn't oversell it. In trading, this matters. I've got positions in something, I need to know that the last person who wants to change that table, me, if I want to change that table saw that value, and no one else is operating based on that value until I've done my change. That's the serializability of individual users transacting on the database. If you're moving to a world where you're actually doing analytics and data science, instead of trying to operate an order book in a trading system, then that's something you can easily give up, because you're actually asking the user who's already got millions of these DataFrames, so like documents to the user.

Just to think about how they change that DataFrame, rather than trying to write to it from multiple places. Really, the point here is that isolation needs coordination. That can be a serious cost problem, maintenance problem, because suddenly you need databases, because you need a queue, or locks, or you're limiting yourself to these obscure data types that are eventually consistent. You can do this without separate database servers. You could implement this on AWS S3, but it would be really slow and inefficient, because you're really asking a lot of the abstraction. You could implement a queue. You can implement locks. There are many databases out here that do this without giving up the database server. If you're happy to implement a lock-aware system, then you can do a lot with performance and flexibility, and efficiency of running a database without servers.

Actually, this is not a new idea. There was a 2008 paper on building a database on S3. AWS S3 came out in 2006. Just a couple years after and people are already excited about the technology. The author looks at how you could implement databases on S3, and what you'd have to give up, given the constraints of you're being reasonably efficient. You can see the diagram on the left is basically offering up the chance to run it without a service, because although there's a line between the client and the database logic, the record manager and the page manager, that line can go away.

You could run it, all the database logic in the client. The conclusion is that it's shown atomicity and all client-side consistency levels can be achieved, whereas isolation as strict consistency cannot. What they mean by strict consistency here is that writes follow reads. It's the same challenge of having your write be due to the data you just read. In SQL language, this would be select for update. You're locking the read until you've updated. That level of strict consistency is a challenge to do performantly. The one that's really just below that, which is that writes follow reads, which is that, if you've done a write and that write is complete, that everyone gets the data at that point, is possible in AWS S3 and many S3 implementations, and lets you do a lot.

This is not really rocket science, once you're happy to take this approach. It's stuff that we probably all know about a little bit. A little bit about how this is achieved in ArcticDB with its data structures. If we're going to respect the atomicity and consistency, and just delegate durability to the storage, like not even worry about that, just use good storage technologies, then we need a data structure that's going to support that. Ideally, like an immutable data structure, where you just add versions, rather than modifying previous versions.

User 1 is reading version 5, and then you're making an update. Actually, what you need to do is create a new version of the data. Then you need to reference, so you've got some symbolic link here. I'm pretending that you've got a piece of data called Apple. You're symbolically linking that to the new version as an atomic operation, after you've completely written version 6. Then any new users are getting version 6. You can start to build database semantics on top of this.

Beyond having a reference name for things and a versioning structure, you're going to have some indexes in your database. You're going to have some data. I'll come to that. Really, I've said this a number of times, but no database servers. They're all gone. No job queues. Everyone can operate independently on the storage system, but that's shared storage. It could be a shared file system, could be S3, and people are able therefore to work together. As long as they figure out how to update DataFrames together, individual DataFrames. Typically, there's an ownership here, because there's millions of DataFrames and thousands of libraries.

What does this mean for your concerns in your database? I think this is an important point, because it's really not just about isolation, but it's about where you put your concerns. The things you care about, where do they sit? Traditionally, APIs, normally like quite a light API, because a lot of the work is happening on your database server, the security is definitely happening there, executing your queries, managing your transactions, building your indexes, updating your indexes is a lot of the work.

Then you've got to split responsibility on the capacity, both for performance and how much data you've got between the servers and the storage. The resiliency, the durability is split up across both. You don't have any single point of failure in a modern distributed system. What happens when you make this truly serverless, where you're given up a little bit on transactions, but just with that isolation concern. You do have a heavier weight client. The API is simple, but it's doing a lot. That's actually doing the indexing and the execution of your queries. Then the security and capacity, resiliency all delegated to your storage. If you're using a world class storage system, like many of the S3 ones, you've got incredibly featured capacity, scalability possibilities, and security possibilities, and incredible resiliency. That stuff just works.

Then, you don't have to worry about these database servers. Then the other nice feature, which you might think is a bad feature, but I think it's actually a good feature at an organizational level, is that the work you're doing at your database scales with your clients. If this is a web server, that would seem like a terrible idea. Like, I'm doing all my work in my web browser. If I'm trying to do data science, then the reality is the users doing the most work are running on the biggest machines, and they can afford the most database work as well. You're actually naturally scaling your workload with your user's workload, which is not just about reading or writing data, but the science in the middle. You've got your S3 caring about everything else.

Benefits

Another way to look at the benefits is that you end up with something, because you're not running up these servers, and not having to care about your servers in the normal meaning of serverless is probably a good trend. Not having them seems like an even better option. Just have somewhere to store your data. In this case, my example, AWS S3, make an S3 bucket. Then configure your credentials. Install ArcticDB as your path in client. It's incredibly simple to set up an almost infinitely scalable database.

Then connect to it. That's much the same as you might connect to any database, just a slightly different API. Then you've got database semantics on top of that. You're reading and writing and updating and deleting data from DataFrames. A win on the back of this is that because you're using these immutable data structures, I'm allowing for deletes, but even deletes create new versions. Even though you based your design on immutability, you've got the ability to change things.

You've also got the ability to rewind time because all these old versions still exist normally. You get this nice feature, this actually ends up being really critical for data science as well, because you need to go back to different models all around. You need to look at how the data changed. You need to look at how your outputs changed, and how that related to the model you're running to do good science. This ends up being a necessary and nice output of this architecture.

Summary

For all the reasons discussed, we ended up with this fully client-side database machine that does all the work of deduplicating data, compressing, tiling, indexing, working with the storage system to create a fully featured shared database infrastructure on what could be shared file systems, cloud storage, or it could be very performant flash drives in your local data center, which is the way we run it. I said there was more to the data structures, but, obviously, you're building your indexes, and you're compressing what might be tick data, what are the alternative data, and so columnar storage in your data layer, and you're chunking it. In this way, you can use the indexes to format the bits you want.

Demo

I actually VPN'd into the corporate infrastructure here, so I'm just going to do some imports, and then explain a few terms. This is the idea of the namespaces, the bucket level in S3 parlance. You've made some storage bucket that you've got permissions for. Then there's the dataset level we call libraries. You might put your U.S. equity data in one and your European equity data in another, or your weather data in another. Then you've got lots of DataFrames in there, could be millions, for all the things you individually care about.

Each of those items is a DataFrame. Connect to the research cluster here, so 30,000 libraries in the research cluster. There's a separate production cluster where we run our stuff in real life for trading. Then I'm going to get our toy example to show you how things work. I'm getting a library. I'm going to read from one library and write to another. There's a source library and a library. I'm listing the symbols. These are the DataFrames in the library, this is 4.

It's tiny examples, just so you can see what's happening. Basically, you read and write DataFrames. I'll read Amazon 1. I'll write Amazon, and I added in bits of metadata, just so I can keep track of things. You've got some database functionality, so there's not just a document store. You can append data. Let's do that. I've made my DataFrame a bit longer. I've read a piece there. I've written it, appended it to Amazon. There's the slightly longer DataFrame.

You can update the middle, and it will be updating the indexing and the versions for all of this. When I did the original write, you can see that I've run this demo a few times, version 625. Here, I'll do an update in the middle of the data. If you've been watching very carefully, very quickly, you can see that the middle of the data has changed. Also, the time travel. All of these things are new versions. I can get the last version. Then I can go back to a version number, or I can put in a timestamp and you just find a version at a time. That was the original version. All these are toy examples.

This is running in a JupyterHub notebook. It's two virtual cores. It's actually very small. I'm not doing myself any favors by running on a tiny VM. I'm going to get an example with lots of floats in, 100,000 columns, 100,000 rows. I'm just going to pluck out three columns, and a few months of data. That's the turn of the DataFrame there. Then it works with tick data as a clear use case. Here, this is actually Bloomberg data. Level 1, that means like Bid and Ask data from Bloomberg.

Our level 1 dataset there for equities has 66,000 equities. I'll pull out some columns, just one day of data. It took about a third of a second, 1.3 million rows, just to prove the data came back. Then there's query functionality. This is the classic New York taxi dataset for yellow cabs. Here, I'm trying to find out who are the really heavy hitters. This is like pandas star query functionality. Where are people tipped so much that it's 95% of the total amount? They've tipped $100 in a $5 fare? There are those, people do that. There you go. It feels and works like pandas, which is the skills that people have learned in Python, to work with data.

Just to give you a little bit of a feel for how ArcticDB works in practice for users. This is used in [inaudible 00:47:45]. Here, I'm running 40 gigabytes per second, out of this flash storage on the right, no database servers, straight to Python users. Typically, they're running clusters of machines to do their calculations. If using a fast networking of like 40 Gigabit E like networking, then that's like over 10 of those. You're using a lot of networking to do this. This is just like every day. This is multiple days sustained. It's like billions of rows per second. It works.

Quick shoutout to D-Tale, which is a nice pandas visualizer, and actually our most popular open source tool. We're not doing this alone. Bloomberg have been helping us with this. They're a user of it in BQuant which is their quant data science Python tool they sell, which has data already in it. QuantStack which are a French company, who've done a lot of work on Jupyter and conda-forge, mamba, micromamba, things like this. They're a pretty cool open source focused company. With their help, we're building this out.

Questions and Answers

Participant 1: What's the optimization happening on the pandas DataFrames, which we obviously know are not very good at scaling up to billions of rows? How are you doing that? On the pandas DataFrames, what kind of optimizations are you running under the hood? Are you doing some Spark?

Munro: The general pattern we have internally and the users have, is that your returning pandas DataFrames are usable. They're fitting in memory. You're doing the querying, so it's like, limit your results to that. Then, once people have got their DataFrame back, they might choose another technology like Polars, DuckDB to do their analytics, depending on if they don't like pandas or they think it's too slow.

Participant 1: Or they are stored in a single node of the cluster or they're distributed across?

Munro: They're distributed across the storage, which is your storage provider's problem, and something they're actually pretty good at solving. Generally, actually AWS will dynamically distribute your data, to make sure it's meeting demand.

 

See more presentations with transcripts

 

Recorded at:

Aug 20, 2024

BT