InfoQ Homepage Presentations Stateful Cloud Services at Neon: Navigating Design Decisions and Trade-Offs

Stateful Cloud Services at Neon: Navigating Design Decisions and Trade-Offs

View Presentation

Speed:

49:38

Summary

John Spray discusses the complexities of stateful cloud service design, using Neon Serverless Postgres as a case study, where to put data and how many copies, ensuring availability, scaling a service.

Bio

John Spray leads the storage team at Neon, building the persistent storage services behind Neon's serverless Postgres platform. His background is in high scale data-intensive systems, spanning filesystems (Ceph, Lustre), streaming (Redpanda) and databases (Neon). He enjoys writing performant Rust and C++, focusing on the tradeoffs in turning high performance designs into robust systems at scale.

About the conference

Software is changing the world. QCon London empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Spray: I want to talk about stateful services, because they're different. The considerations that we apply to deploying a usual application on Kubernetes don't apply in the same way when you're deploying something stateful. For me, stateful means usually involving storage. You can have a stateful application which is an in-memory cache or something like that. Typically, when we say stateful, that's a euphemism for a database, or a file system, or a queue, something that stores. You need to have a choice of storage tech.

I'll talk about how to choose what should go in object storage, what shouldn't. Deploying is different. I'll talk about how deploying in Kubernetes works when you have to deal with a stateful service. You also have to think more about throughput with a stateful service. A typical storage service has a user expectation that you saturate the disk or saturate the network, and that plays into your choices as well.

Background: Neon

My background is in file systems like Lustre and Ceph. Streaming systems like Redpanda, which is a Kafka equivalent message queue. Most recently, Neon, where we build a serverless database product. This talk isn't about Neon, but I will use it as an example. This is a very high-level block diagram of what our service is. On the left-hand side, there's a Postgres database, the open-source database that you know and love. Its writes and reads go through our bespoke storage layer, which is also open source, but a separate piece of software written in Rust.

In fact, two separate pieces of software. We have something called a Safekeeper, which is a distributed, replicated log of incoming writes, and pageserver, which lays out your data and makes it available for read at a per page granularity, from the pageserver to Postgres. Behind all of this is S3. You can think of our storage layer as a mapping between very fine-grained, small, latency sensitive online transaction processing IO, and bigger, high throughput, but higher latency storage in S3.

Physical Trends

I'll talk a little bit about background to our technology choices. The Economist had a few articles about cloud computing, and I thought this was an interesting juxtaposition of headlines, that users don't need to think about physical underpinnings of the cloud, but advances in physical storage are what made the cloud possible. The takeaway is that while your users don't need to think about hardware, you as a developer probably do. Everything we do as software developers sits on top of a hardware development. We're used to things getting faster every year. Drives get bigger. Drives get faster. What's interesting is tipping points. That's what affects our choices of architecture.

The tipping points I've seen recently in my world are that the density of servers has become so great that the need to scale out storage systems is much less than it used to be. We used to have to lash together thousands of spinning disks to get a good, high performance file system. Now you can have a 1U server with a petabyte of flash in it, and a couple of 400 gigabit network interfaces on the back. That's a change to your architecture. If your business hasn't kept up with the rate at which storage has improved, which for most businesses, it hasn't, the rate of improvement in storage is explosive. Even if your data is big, the hardware is probably growing faster. The other tipping point is drive performance itself in terms of the number of IO operations, so that's 4-kilobyte read or writes, we can do per terabyte of storage. It's got to the point where we have spare IOPS.

The little M.2 gumstick in your PlayStation probably has hundreds of thousands of IOPS, which you don't really need. Even big enterprises with a lot of work to do will often struggle to find true needs for millions of IOPS on a storage product. They might use millions of IOPS due to some inefficiency in an application, but it's unusual for there to be a true business need for that many IOPS. What do we do with the spare IOPS? If you work for a storage company or a database company, you might do one of several things.

Something like Pure Storage or VAST Data selling hardware appliances, they use this surplus IOPS to do compression, to do dedupe, to get more data onto a given piece of storage. We at Neon use those extra IOPS to give versioning, so we let users store a continuous history of all of their transactions and roll back to an arbitrary point in time. That means that the products built on top of the hardware are also having qualitative changes, that we're not just getting bigger and faster, we're getting smarter as well.

Durability

What is durability? What is persistence? It means different things to different people. The obvious example is that RAM is not durable, but a drive is. A drive isn't really durable either. If I write something to my laptop, I might leave it on the bus, the drive might fail. Usually, we're talking about multiple drives. Historically, that might have been a RAID array. More often in the cloud, we're talking about multiple drives, perhaps in the same availability zone with some replication on top of them. For me as a database person, I don't really think of something as persistent until it goes to multiple AZs, so, ideally, multiple physical buildings, to the point that if you have a fire in your data center, which is a real thing that does happen, that your data would survive that incident.

Finally, cross-region replication, which is the picture of the planet Earth at the bottom, where you go across regions, either for regulatory compliance reasons, or because you want to be closer to your users, or because you are just super paranoid about loss of a region. There's an economic consequence to that choice of durability. Physical drives at the top of the table have a super low cost per terabyte today. That $30 per terabyte is over a 3-year lifetime of a drive. Those are retail prices. That's an enterprise grade NVMe drive.

The number is so low that you almost don't think about it, compared with what you pay your developers or what you pay for your infrastructure. The numbers for cloud services are higher. As you jump from a physical drive to S3, you get more for it. You get region level durability because you're replicating across AZs, or rather, Amazon's doing that on your behalf. If you want a cloud instance that has a fast drive in it, the number jumps up even more.

The cost per terabyte of an i3en instance is substantially higher than the cost per terabyte of S3. If you're going to use that as the basis for a storage system, you had better have a good reason. S3 Express is a new product, but I think it shows the direction things are going in. That's a version of S3 which is cheaper to use for large numbers of tiny operations. It's much faster. You can almost use it like it was a disk. The interesting thing I think about the pricing is that it's very similar to what it costs to just buy a storage instance. You're not really getting S3 pricing. You're getting pricing as if it was a drive, but you're getting an S3 interface on top of it. EBS isn't that interesting for future architectures, just because it's so expensive per terabyte and because it's not durable across AZs.

If you have a fire or an incident in an AZ and it kills your EBS block device, you wonder why you were paying so much money to have it replicated behind the scenes. This isn't everything. Of course, there are other cloud providers, there are other services within each one. There are things like Glacier. Those, of course, are for colder storage, longer-term storage. What I'm interested in is the storage for online applications and for platform services that run in a cloud native environment. This isn't just relevant to people like me building these services. This is relevant to you if you're buying a service like this, if you're trying to choose between products. It's relevant to you if you're operating a service like this, if you're an SRE. If you want to look critically at a product like the product that I make, you should do it within this cost context.

There's one more aspect of cost that often gets overlooked, and that's replication egress costs. If you have copies of data in three different data centers, and you're paying a per gigabyte cost for the traffic between them, it sometimes comes as a surprise that if you saturate a 10-gig network link, the amount you pay for egress can be more than you paid for the servers that you're renting. That doesn't mean you shouldn't ever do it, but it means that when you're thinking about an architecture or maybe somebody is trying to sell you a product that answers the durability question by replicating everything between AZs, you should always ask, who's going to pay the egress fees?

A couple of other things to watch out for if you're building or buying cloud products that use a lot of storage, many tiny object writes can build up to a AWS bill shock, quite easily. The example I use from my streaming days is if you had tens of thousands of streams and you wanted a relatively low recovery point objective for getting that data securely into S3, and you're writing each one every second. Not every millisecond, every second. It's modest. S3 will let you do that many writes per second. You would end up with a million dollar plus bill just for your PUTs, not even for the capacity, not for the instances you are running on, just for the PUTs.

You can't optimize your way out of that. The architecture just has to avoid doing it. The other thing to avoid is a service that just transplants an on-prem product and runs it in a VM in the cloud with an EBS volume attached to it. Unless that's some super irreplaceable piece of software, the cost of a dedicated instance and attaching a large EBS volume to it is likely to wipe out any profit that you might make building and selling a cloud service, if you're doing that. That's great news for your cloud provider. They'll make money, you might not.

Architecture Trend

This builds into a trend in architectures for stateful systems. I'm starting from before cloud native here, but building up to where we are today. In the 2000s, what I would call a Gen X storage system, we had lots of hard disk drives. We were building huge clusters, things like Hadoop FS, things like Lustre, which I used to work on. It's not that you needed thousands of drives, it's just that they were so slow that you had to use that many to get the performance that you wanted. In the 2010s, we had SSDs. They're much faster. We also started to see people build hybrid systems that could run on-prem or in the cloud, and they would typically have integration with object storage as an option. You could tier to object storage for higher capacity, but your local drives were still your primary storage.

Today, the Gen Z storage systems are using object storage as their primary storage. Yes, they still use SSDs sometimes, but those are like a cache or a buffer, and your primary storage is S3, or your choice of object store. That's necessary to have a cost optimal design, which, of course, we all care about cost, but especially people building Software as a Service products, who are trying to make a profit on top of their infrastructure costs, care about cost even more. Here's Neon as an example of these architectural trends. On the left-hand side, we get SQL statements from a user. They're using us just like a normal Postgres database.

Those are getting translated into streaming writes when the user makes a change, which we write to a triplet of nodes with local NVMe storage. We are taking that egress cost, but only for the initial ingest of the data. It gets written onwards to our pageservers, which use NVMe drives as a cache. These are the two worthwhile uses of local drives in a modern architecture, as a write buffer, where you're willing to eat the cost in order to get lower latency, and as a cache where you're translating between small IOs that your application wants, and bigger IOs which you can make to S3. Of course, when I say IO, I mean a PUT or a GET.

Kubernetes (cloud native computing)

What does all this have to do with cloud native computing? I would use the more general definition of cloud native computing, which is, anytime you're building something to run in the cloud. It doesn't have to be on Kubernetes. The question will always come up like, do I run these services as part of my Kubernetes deployment? Should they only be platform services that live outside of it? Is it safe to run these in Kubernetes rather than bare metal? I am going to question whether one should use Kubernetes for services like these. Kubernetes has a default bias towards stateless workloads, so you have to make an extra effort to persuade it to do things like pinning a pod to a particular machine with a local drive. You have to make an extra effort to understand what a persistent volume claim is. That kind of thing. Your pods don't have drives by default.

Some of the operational aspects are also biased toward a stateless workload. Kubernetes is trigger happy when it comes to killing pods. Typical application, if something fails a status check, it's a good idea to kill it. It's a stateless pod, it'll come right back up again. That's a good thing to do. For a storage system, like a database, a clustered database, or a file system, it is not smart to go around killing unresponsive nodes. You can have a cascading failure. What you ideally want to do if a node is unresponsive, is to talk to your clustered storage system, and say, would you please move some workload away from it? Because your workload isn't pods, and that's where the real impedance mismatch comes in between a typical stateful system, which is a distributed system, usually, which has some concept of workload of its own, whether that's a database or a Kafka partition, or whatever it is, and Kubernetes which thinks in terms of pods.

Where, for a storage system, you'll often just have one process or one pod, per physical node or per drive. You're not scheduling huge numbers of pods. You're scheduling some other workload on top of a smaller number of pods. Kubernetes also has a runtime overhead, which I think doesn't get talked about enough. It's not that it's slow. It's a good, efficient system that makes good tradeoffs to provide the level of abstraction that it provides. There is an overhead. If you have an application that would otherwise saturate the network link or an application that would otherwise have very low tail latency, and you run it inside GKE, and outside GKE, you can measure the difference. In some cases, I've seen it go as high as 10%. That might not be a literal 10% loss of bandwidth, but it might be, you can only run it up to 90% of the capacity before your tail latency starts degrading because you've got other things running on the box that are competing for your CPUs.

Again, this is me coming from a background in lower-level systems, but this is absolutely something you should consider when running some stateful service on top of Kubernetes, because this costs you money. If you've got a million-dollar AWS bill, 10% is $100,000. That's potentially an extra member of staff. Measure it, and if it is going to cost you more to run inside Kubernetes, make an informed decision about whether that money is buying you something. Are you getting enough advantage from Kubernetes to justify the overhead?

Kubernetes has a system for running stateful services, it's called StatefulSet. It works. It does what it says on the tin. It lets you have the equivalent of a ReplicaSet, but with a tighter binding to some persistent volumes, where those persistent volumes might be the drives on physical nodes. It's not quite as smooth as you might hope. This is just a little screen grab from the official Kubernetes docs for StatefulSet. I think it's great that they have this level of detail in the documentation. I'm not trying to criticize this.

This is a legitimately hard problem to solve that they have. When using rolling updates with the default policy, it's possible to get into a broken state that requires manual intervention to repair. Why is that part of the docs rather than being some bug that was fixed years ago in Kubernetes? It's because it's difficult to map a system that cannot be torn down and put back up again, because you would lose data, into a pure declarative way of managing a system.

If you're doing rolling upgrades, if you have some newer nodes, some older nodes, maybe they have different formats of data, you can get into corner cases where the naive version of a rolling upgrade is hard to recover from. Upgrades are the hard part. A system running in a steady state is easy to operate, but a rolling upgrade is harder. Here's an example with three nodes. Let's imagine this is a replicated log across three nodes. Maybe it's running Paxos. Maybe it's running Raft: well-known consensus algorithms, and I'm restarting the top node. It's going to go offline for a bit. When it comes back, it's going to be behind. Assuming there have been some writes in the meantime, it's got a less than up to date copy of the data. Fine. My system is still online. I restart the node in the bottom left, and now I've got one node that has the latest data and another node that doesn't quite have the latest data.

Depending on the underlying consensus algorithm, this might put your system into an unwritable state, because you might need to have the latest data on two nodes in order to append to that latest data. Not the case for all systems, but the case for some systems, and not an unreasonable requirement for the underlying storage system to have. Why did we shoot forward to restart the second node before the first one had fully replicated? The restart of nodes in a StatefulSet is driven by readiness checks.

The readiness checks that you might write are usually, can I serve an IO? Am I up? Can I accept user traffic? It's not, do I have the latest copy of data in a way that satisfies the requirements of the distributed system of which I am a part? Kubernetes doesn't have a hook for, "Hello. I'm a member of a distributed system. Please don't restart me until my cluster says it's ok." It can cause an availability blip if you haven't very carefully written a readiness check that does exactly the right thing.

In the case of node replacement, it can be worse. I've seen a real-life incident in which there was data loss on a storage system that was running inside Kubernetes, where the cloud provider required a node replacement to take place. In order to stay on a supported version of Kubernetes, you had to replace the nodes. That included dropping whatever was on local storage on those nodes, which was the customer's data. It didn't do it all in one, it did a graceful rolling restart.

In this case, imagine if, when restarting the top node, we didn't come back with 80% of the data, we came back with an empty disk. Then we proceeded to restart the second node to an empty disk. We restart the third node to an empty disk. Great, you've just blown away all your data. That's not Kubernetes' fault. It's possible to configure it in a way that it doesn't do that, but you must be very careful. You must know exactly what you're doing.

None of the engineers involved in the incident that I'm thinking of were foolish people. Another thing to look for when building a solution that involves stateful services is make sure the people who understand the distributed system within it are talking to the people who are deploying into Kubernetes. For folks who live and breathe Kubernetes, there can be a perception that the system you're deploying should just do the right thing. You should be allowed to kill nodes any time, because it's cattle, not pets. Yes, until you blow away all three copies of your data because you had to do a node replacement, because GKE forced you to.

One option to deal with these problems is use an operator. It's a good idea. There are operators like Strimzi for Kafka, Rook for Ceph, and other file systems. They, more or less, solve this problem, if someone else has gone out and figured out how do we safely do a rolling restart of these systems in a way that is smarter and more tailored than a generic StatefulSet. If you're building your own system, you probably don't have an operator. You might have to build one. That's extra work. Kubernetes isn't serving you out of the box. Even if you're adopting a product that comes with an operator, that's one more thing for your SREs to learn. Their generic Kubernetes knowledge isn't going to teach them exactly how to use the operator for a particular stateful service. None of these concerns mean you can't use Kubernetes, but they should feed into your decision.

This is a summarized matrix of how I make the decision for my team whether we should use Kubernetes for storage. I say, we want multi-cloud portability. Kubernetes offers that. Great. Definite source of value. We want to go run a binary on a node somewhere. Kubernetes has a convenient way of doing that. Great. We want to schedule our tenants, which are our end user databases, onto our pageservers, which are our storage nodes. Kubernetes knows how to schedule pods onto physical nodes, which we don't need at all because we just have one pod per node.

Kubernetes knows how to run a StatefulSet, but we want to do our upgrades and our operations in a way which is aware of our workload, and does things like migrating things between nodes during an upgrade, which you can do with Kubernetes, but it takes extra effort. The bottom line for us, for our team, is that we don't have a strong motivation to adopt Kubernetes. We might do anyway eventually, if it provides sufficient operational convenience for our team, because we have a lot of other services that do run in Kubernetes. It's not something that we would rush because customer data is the most precious thing we have. It's something that we should think carefully about before doing. Neon loves Kubernetes. We use it for the vast majority of our services, but our storage services, not right now.

Object Storage in Practice

I talked about the trend toward object storage, and this actually really helps with running stateful services in a cloud native environment. If you are not using replication between local disks, you don't have that fraught moment of restarting the nodes in a replica group and making sure you don't lose any data. It's better if you can just write data to object storage. What does that look like in practice? That's what our pageserver does. Our incoming writes have a concurrency of about 10,000, so we have 10,000 streams of writes coming into one server. The entries in those streams are about 100 bytes big. We translate those to S3 PUTs, which are up to about 256 megabytes big, and we write perhaps between 1 and 10 objects per second.

On the read side, we're serving sub-millisecond reads based on what's on local storage, and promoting layers from S3 to local storage, on-demand. There is a whole discipline around caching that I'm glossing over here, and figuring out how to decide what sits on local disk and what's safe to offload to S3.

Fundamentally, that's what it means to write a service like this. You can have an application that just consumes S3 directly, that's fine, especially for analytics workloads. If you're using something like DuckDB, which is a fantastic piece of software, you don't have to have an online transaction processing database in front of your S3. If your application does require that, and almost all businesses have at least some applications that do require a fast OLTP database like Postgres, and you want the cost efficiency of S3 storage, you end up wanting some component like this. My claim is that, this isn't just what we happen to do. This is a reusable pattern that I see more products using in order to deliver cost effective stateful services in the cloud.

In this model, local disk is just a cache, and that avoids all of the sleepless nights that come with orchestrating a restart of a bunch of nodes. Of course, it also keeps you on your cloud native train of treating nodes as cattle rather than pets. It's not as simple as that. The story doesn't end there. If you're building a system like this, or again, buying a system like this, you have to bear in mind that while S3 might be practically infinite, the nodes you're using as caches aren't.

In between the user and S3, I have these physical pageservers, and they are real-life computers with real-life finite sizes, so you need some mechanism to distribute that caching work across multiple physical nodes in order to let the user take advantage of the capacity they expect. This isn't just for systems that are explicitly built for very large capacities. This is really a table stakes thing. Your users are used to S3. They're used to being able to store as much data as they want.

If you have a size limit, it's perceived as a bug, so you have to solve that problem. The other problem you have to solve to meet expectations is that while your node's disk might be disposable and ephemeral, because it's just a cache, a sufficiently cold start is perceived by users as an outage, and rightly so. It is a de facto outage if they have tens of seconds or a minute or something like that, where they can't read their data. Avoiding a cold start on a system that keeps an 8-terabyte disk full of cached data, requires more than just writing reasonably fast code. You have to design it into your system. You have to have some concept of keeping a secondary location warm so that if something happens to one of your nodes, you can point your users with the other one.

Case Study: Neon Pageserver

Again, to use us as a case study, we started out with a prototype pageserver, which I would very much call a pet. It was principally a local storage system that knew how to store this Postgres-aware data structure, which is a little bit like a modified LSM tree with history. It tiered to S3 but the S3 was an afterthought. That was my 2010s storage system example. One tenant lived on exactly one pageserver, which meant that large tenants were out of luck if they wanted to scale beyond a certain point. What we had to do to get this ready for production, for a generally available product, is turn it into cattle, which means S3 is the primary storage. Local disk is just a cache. Node death is handled by cutting over to a warm secondary cache. This is not as hard as building a true replicated distributed system, because you don't have any strict consistency guarantee requirements.

You just have to have another node that has approximately the same set of objects from S3 cached on local disk. We share the data from one tenant across many of these cache nodes. That's not just necessary in the case of super large tenants, which are bigger than a single physical disk, it's also necessary to balance load across machines. Users expect to get more than their fair share of hardware. They expect bursting. If I actually divided the number of IOPS of all my disks by my users, they would have a very disappointing level of performance. It's essential that they can burst. In order to enable that, we have to spread the users out in a somewhat statistically uniform way, and that means sharing out the data. While S3 is one big monolithic store, the way we provide a gateway to it isn't.

The same thing really has to be true of any system which implements this model. It's the right model to get the right level of storage efficiency from a cost point of view, but you have to look for answers to the questions, how do you handle a node failure? What happens when a node goes offline? Do I have a cold cache? If I have a cold cache, how long does it take to refill it? Is it minutes? Is it hours? If you're trying to refill a modern drive, you're probably talking about hours, because although drives are very fast, it still takes a long time to refill them. You should also ask, what happens as I scale? In theory, my storage is infinite because it goes in S3, but is it really scalable once you factor in whatever frontend service you're using on top of S3.

S3 is also shared storage, which is a term that has almost fallen out of use. In the days of storage area networks and physical data centers, we had what we call dual ported hard drives, where you really could have two servers writing to the same disk at the same time. You had all kinds of interesting mechanisms for solving that problem and avoiding corruption if two wrote to the same piece of storage at the same time. We forgot about this for a while, because in cloud native environments, we typically had local disk or we were using some higher-level managed storage service. The problem comes back when you build a scalable system on top of S3.

Briefly, the multi-writer problem typically occurs where you have a node fail, but only fail from the point of view of the rest of your system. Let's say it's failing at status checks. You've scheduled a replacement node for it, but this zombie pod, I'm saying node, but it's synonymous with pod, if you imagine one pod per node in a storage context. This zombie server is still physically running. Nothing has cut it off from the network. Again, in traditional infrastructure, you might have what's called a fencing mechanism to cut off a node from the network or cut its power, something like that. You don't have that in Kubernetes. You often don't have that in the cloud in general.

This failed node can rise from the dead. Let's say it comes out of some pathological driver bug that made it pause for 20 seconds, long enough to get a replacement, comes back and does a write to S3. If these two nodes were responsible for keeping the same index object up to date in S3, this could corrupt your data, or at the very least make it unavailable while you unpacked all this and figured it out. Again, real example. I've seen this happen in a production system where somebody had made an over-optimistic assumption that if a node wasn't responding to heartbeats, that meant it was definitely dead.

Fortunately, there's a pretty general solution, which is to cheat and rely on there being some other component in your system that can hand out a monotonic number, which you might call a term or a generation number, and you include it in your object names. In Neon, for instance, you might have a index.23 and an index.24, and when our software comes up and figures out which one's true, we just take them on with the highest number.

I'm glossing over a lot of detail in the arguments for why this is correct, but the key points are, it's necessary. If somebody is presenting you a design that has failover in nodes that write to S3 and they're not giving you an answer to how they solve this problem, you should probably probe that. Secondly, you don't have to be a full-on distributed system to do this. I've seen this done in systems that had like Raft, or etcd equivalent built into them, and they would use that for generating numbers.

I've also seen it done, and actually the way we do it at Neon is just with a tiny relational database sitting off to the side, and the ACID semantics of that database are sufficient to hand out these generation numbers that we use to make multi-writers safe. We don't usually aim to have multiple nodes writing to the same object at the same time. At scale, we have to solve that problem, because a one in a million race is not rare. We have hundreds of thousands of tenants, so we have to solve that problem.

What's Inside the Objects?

I've talked about objects in a very general sense, it's worth talking about what's actually in those objects. You might start out with systems that write JSON or some other format. Parquet, Arrow are columnar data formats which do columnar compression. They're very efficient. They're very good. They're primarily used in OLAP, analytics processing at the moment, but that doesn't mean they don't have applications elsewhere. If you're building a storage system that has a lot of objects containing your data, and you have indices that point to that, that index could probably be a Parquet file. We're increasingly seeing use of Parquet similar formats outside of the analytics space, outside of the data science space.

You don't have to be writing a data lake for this to be the right choice of format for your data. Iceberg is a newer technology that is typically used on top of parquet or another similar format, and it basically provides an indexing layer for multiple files that you can query using traditional SQL. Again, it's an OLAP technology. It's not a replacement for a traditional OLTP database, but it's something that you should think about. If you have a developer on your team that's saying, I've got this great idea, if we're going to put our data in S3 we're going to have an index. It's worth saying, have you thought about using Iceberg? Because it's a pretty neat, packaged, generic way of doing that stuff.

Which Storage Technology?

For your technology choice, you should be looking for using object storage as your primary storage. Or if you're analyzing a product that you're thinking about using, you should be asking how it uses object storage. It's generally the right thing for cost. We talked about egress fees across AZs, and remember, Amazon doesn't charge itself those egress fees. Building something yourself that does replication across AZs is generally going to be less cost effective than using a service that your cloud provider provides.

In the runtime environment, think carefully before adopting StatefulSets. Be careful how you write a readiness check. Test it. I'm not claiming that there is a really compelling alternative to Kubernetes other than bare metal. Running on bare metal to begin with, is a good conservative starting point. I would take the guidelines about places to look for risk in Kubernetes as steps on your journey to moving your stateful services into Kubernetes, rather than as a reason not to do it.

Questions and Answers

Participant 1: If I understood the product of Neon correctly, you're running Postgres with an S3 backend. Why would I do that? What's the reason? What problem does it solve?

Spray: The reason for using a disaggregated storage backend, whether it's S3 or something else, is so that you don't have to decide at the time you provision your database how big it's going to be. When somebody creates a database on our platform, which takes under a second, we're spinning up a tiny VM which has access to a big shared pool of storage. As they write more data into that, there is no step at which they have to say, I want to switch instance types or I want to attach a different disk to it to accommodate more storage. It's about making it dynamically scale and providing a serverless experience to the user. We have a pretty fast cold start on the compute side as well.

By default, we'll shut off the Postgres database after about 5 minutes. You can disable that if you don't want it. Then, again, we'll do a sub-second cold start to Postgres later. A big part of enabling that is that you don't have a local disk cache that has to be warmed up on Postgres. It's all sat there as part of our storage backend. It also enables us to do things that Postgres can't. We have a full history of all the transactions that somebody writes within a set time window, such as 30 days. If you drop a table that you regret dropping, you don't have to roll back to last night's snapshot. You can go back to precisely the point before you did that drop and recover to that point. That could be built inside of Postgres as a feature.

The reason for building it in our backend is so that we can couple it with the serverless experience and the autoscaling that comes with that. Again, our backend services are open source as well.

Participant 2: Is the cost savings primarily from the fact that you have the middle layer, or like the NVMe disk batching up the large amounts of tiny writes.

Spray: That's the necessary element that you'll see in most modern products like this. They act as some type of conversion between the low latency that the user needs and the fine granularity that the user needs, and the much coarser granularity that we need to send into S3 to have a cost-effective storage story. It's a little bit like, to use a more traditional example, you used to have systems that would buffer things up before issuing a big write to a hard disk drive. You might have a disk controller that has a battery backed cache and builds up a load of stuff and writes it to a slower but cheaper piece of storage.

Participant 3: When you write to Neon, when the operation is complete, when the data is in Safekeeper, which means on the read, we need to merge data from Safekeeper and from pageserver, or the write is complete when it reaches pageserver, or basically S3. If the latter, it means that the latency is basically no less than latency of S3.

Spray: That's exactly right. If we waited for the object to hit S3 then we wouldn't be providing an improvement to the user. The ACK to the user, the completion of the transaction happens when the data is on two out of three of these servers. That's a sub-millisecond latency. It's the time it takes to go to one Safekeeper, hop to another AZ, and hit two NVMe writes, and then for the ACK to make it back to the Postgres.

How does a read work if we're ACKing as soon as it's hit here, but we're not waiting for it to hit the part where we serve reads from?

The answer to that lies in Postgres, which has its own page cache. Even if you're running vanilla Postgres on a local hard drive, it has the concept of a cache of recently written pages, and we benefit from that as well. The latency for something to be ingested from the Safekeeper to the pageserver is something like low seconds usually.

In principle, it's sub-second, but we don't monitor for that. It's like a few seconds. Within that few seconds, it's highly likely that that page that was just written is still in memory on Postgres. It's very rare for our storage backend to see a read of a page which was just written a few milliseconds ago. That's just not how Postgres behaves. That's something I'm very grateful for as a storage person. It provides a great deal of smoothing to the user experience that we have the excellent caching that's built into standard open-source Postgres.

Participant 4: Do you think there's a gap in the market for a stateful based orchestration model or bare metal is just fine?

Spray: I've asked myself, what would good look like? What tool would I want? I don't have a great answer for that. I think you absolutely could write something that would be perfect for systems like this. I think it would struggle commercially. I'm not sure I see a market for that, given that most products like these have already built what they need. I think it would be super interesting to have an opinionated Kubernetes derivative that knew how to provision onto bare metal and had an even more conservative version of StatefulSet in a canned way. A lot of the issues that I see with teams adopting Kubernetes for stateful services are not because it fundamentally can't do it, but because you have to know how to configure it just the right way.

It could be that just an opinionated version of Kubernetes combined with a performance-oriented way of provisioning would be a super useful tool. The other thing that I would love is a more generic scheduler. Within our product, we have a scheduler which is somewhat similar to what is going on inside Kubernetes, but we're not scheduling pods onto nodes. We're scheduling tenants onto pageservers. I think that maybe there is scope for a libraryized version of a high-quality scheduler with things like soft constraints. Clearly, the code exists. I'm not saying no one's ever written that, but it's making it valuable across all of these different projects, doing similar things, is the challenge.

Participant 5: Regarding the problems with the StatefulSet, could this be solved by using other storage engines outside of local storage? Because, for example, with NVMe over Fabrics, or the older iSCSI, like, it can be very fast even with latency with NVMe over Fabrics, so storage will not be local storage, and it removes the problem if you change the node.

Spray: The reason that I don't recommend that is cost. If you're building your own infrastructure, if you're on bare metal in your own data center, then that can be a good way of doing it. If you're paying list price on a mainstream cloud provider, then you hit this EBS number. If you're using a replicated system, you potentially end up paying twice, because there's built in replication in your shared storage backend, but also you're running multiple replicas.

Participant 5: Also, there is some projects for storage on Kubernetes such as Longhorn that are planned to be used by SUSE and Red Hat solutions, so OpenShift and currently Harvester. I think it can be also very interesting to look at, because it's about sharing local storage and making replication inside a Kubernetes node. If a node goes out, the, I think it's an operator, replicates the data into other local storage. Also, for Postgres, there is a tool like CloudNativePG, which creates its own resource. It's not a StatefulSet. It's custom resources for the purpose of handling database problems.

Spray: You've just called out a couple of really good projects. The buzzword for sharing local storage and doing that under the hood is hyperconverged infrastructure. Back in 10 and a bit years ago, that's the problem I was working on solving with Ceph. It's a super hard problem to get right, to figure out how to share the resources on a node, between what's running on that node and the work you want to do for your peers. I don't want to labor this too much because it's a very opinionated take.

There are financial aspects to it as well, that you can buy storage more cheaply per terabyte from somebody who sells you a storage solution. EMC doesn't pay the same for drives as we pay for drives. Amazon doesn't pay the same for drives as we pay for drives. If you build this type of storage infrastructure out of retail servers, it's hard to make it cost competitive. That's not a disadvantage to the software project you're pointing out. It's more just an observation that I see limited adoption of that at scale for financial reasons.

See more presentations with transcripts

Recorded at:

Oct 17, 2024

John Spray

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?