Transcript
Kreger: I want to get started with a true story of something that happened to me. Back in 2017, my manager messaged me and said, can you support an event in the Midwest? I responded as any engineer would, what, when, where? He responded with, the International Collegiate Programming Contest in Rapid City, South Dakota in two weeks.
I groaned, as most people do, because two weeks' notice on travel is painful. I said, ok, I'm booking. Two weeks later, I landed in Rapid City, two days early. Our hosts at the School of Mines who were hosting the International Collegiate Programming Contest wanted us to meet each other. They actually asked for us to be there two days early. They served us home-style cooking in a conference hall for like 30 people. It was actually awesome, great way to meet people. I ended up sitting at a random table. I asked the obvious non-professor to my right, what do you do? As a conversation starter. He responded that he worked in a data center in Austin, which immediately told me he was an IBM employee as well.
Then he continued to talk about what he did. He said that he managed development clusters of 600 to 2000 bare metal servers. At which point I cringed because I had concept of the scale and the pain involved. Then he added the bit that these clusters were basically being redeployed at least every two weeks. I cringed more. The way he was talking, you could tell he was just not happy with his job. It was really coming through.
It was an opportunity to have that human connection where you're learning about someone and gaining more insight. He shared how he'd been working 60-hour weeks. He lamented how his girlfriend was unhappy with him because they weren't spending time together, and all those fun things. All rooted in having to deploy these servers with thumb drives. Because it would take two weeks to deploy a cluster, and then the cluster would have to get rebuilt.
Then he shifted gears, he realized that he was making me uncomfortable. He started talking about a toolkit he had recently found. One that allowed him to deploy clusters in hours once he populated all the details required. Now his girlfriend was no longer upset with him. How he was actually now happy, and how his life was actually better. Then he asked me what I did for IBM. As a hint, it was basically university staff, volunteers, and IBM employees at this gathering.
I explained I worked on open source communities, and that I did systems automation. I had been in the place with the racks of hardware, and that I knew the pain of bare metal. Suddenly, a few moments later, his body language shifted to pure elation. He was happy beyond belief. The smile on his face just blew me away, because he had realized something. He was sitting next to the author who wrote the toolkit that made his life better.
Personally, as someone who made someone's life better and got to hear the firsthand story, how it made his life better, the look on his face will always be in my mind. It will always be inspiration for me, which is partly the reason I'm here, because I believe in automating these aspects to ensure that we don't feel the same pain. We don't have to spread it around.
Context
The obvious question is, who am I? My name is Julia Kreger. I'm a Senior Principal Software Engineer at Red Hat. I'm also the chair of the board of directors at The OpenInfra Foundation. Over the past 10 years now, I've been working on automating the deployment of physical bare metal machines in varying use cases. That technology is used in Red Hat OpenStack and Red Hat OpenShift, to enable the deployment in our customer use cases.
It's actually really neat technology and provides many different options. It is what you make of it, and how much you leverage it. I'm going to talk about why bare metal and the trends driving the market at this point. Then the shift to computing technology, and the shift underway. Then I'm going to talk about three tools you can use.
Why Bare Metal, in the Age of Cloud?
We're in the age of cloud. I don't think that's disputed at this point. Why bare metal? The reality is, the cloud has been in existence for 17 years, at least public cloud, as we know it today. When you start thinking about what is the cloud, its existing technologies, with abstractions, and innovations which help make new technologies, all in response to market demand on other people's computers.
Why was it a hit? It increased our flexibility. We went through self-service on-demand. We weren't ordering racks of servers, and waiting months to get the servers, and then having to do the setup of the servers anymore. This enabled a shift from a Cap-Ex operating model of businesses to an Op-Ex model for businesses. How many people actually understand what Cap-Ex and Op-Ex is? It is shorthand for capital expense and operational expense.
Capital expense is an asset that you have, that you will maintain on your books in accounting. Your auditors will want to see it occasionally. You may have to pay taxes on it. Basically, you're dealing with depreciation of value. At some point, you may be able to sell it and regain some of that value or may not be able to depending on market conditions. Whereas Op-Ex is really the operational expense of running a business. Employees are operational expenses, although there are some additional categories there of things like benefits.
In thinking about it, I loaded up Google Ngrams, just to model it mentally, because I've realized this shift over time. One of the things I noticed was looking at the graph of the data through 2019, which is all it's loaded in Google Ngrams right now, unfortunately, we can see delayed spikes of the various booms in the marketplace and shifts in the market response.
Where businesses are no longer focusing on capital expenditures, which I thought was really interesting, actually. Not everyone can go to the cloud, some businesses are oriented for capital expenses. They have done it for 100 years. They know exactly how to do it, and to keep it in such a way so that it's not painful for them. One of the driving forces with keeping things local on-prem, or in dedicated data centers is you might have security requirements. For example, you may have a fence and the data may never pass that fence line. Or you may not be able to allow visitors into your facility because of high security requirements. Another aspect is governance.
You may have rules and regulations which apply to you, and that may be just legal contracts with your vendors or customers that prevent you from going to a cloud provider. Sovereignty is a topic which is interesting, I think. It's also one of the driving forces in running your own data center, having bare metal. You may do additional cloud orchestration technologies on top of that, but you still have a reason where you do not trust another provider, or where data may leave that country. That's a driving reason for many organizations. Then latency. If you're doing high-performance computing, you're doing models of fluid dynamics, you can't really tolerate latency of a node that might have a noisy neighbor. You need to be able to reproduce your experiment repeatedly, so, obviously, you're going to probably run your own data center if you're doing that sort of calculation.
The motives in the marketplace continue to shift. I went ahead and just continued to play with Google Ngrams. I searched for gig economy and economic bubble. Not that surprising, gig economy is going through the roof because the economy we have is changing. At the same time, economic bubble is starting to grow as a concern. You can actually see a little slight uptick in the actual graph, which made me smirk a little bit.
I was like, change equals uncertainty. Then, I searched for data sovereignty, and I saw almost an inverse mirror of some of the graphing that I was seeing with Cap-Ex and Op-Ex. Then I went for self-driving and edge computing. Because if you can't fit everything onto the self-driving car, and you need to go talk to another system, you obviously want it to be an edge system that's nearby, because you need low latency.
Because if you need to depress the brakes, you have 30 milliseconds to make that decision. There are some interesting drivers that we can see in literature that has been published over the last 10 years where we can see some of these shifts taking place.
The Shift in Computing Technology Underway Today
There are more shifts occurring in the marketplace. One of the highlights, I feel, that I want to make sure everyone's mentally on the same page for is that evolution is constant in computer. The computers are getting more complex every single day. Some vendors are adding new processor features. Some vendors are adding specialized networking chips. A computer in a data center hasn't changed that much over the years. You functionally have a box with a network cable for the management, a network cable for the actual data path, and applications with an operating system.
It really hasn't changed. Except, it is now becoming less expensive to use purpose-built, dedicated hardware for domain specific problems. We were seeing this with things like GPUs and FPGAs, where you can write a program, run it on that device, and have some of your workload calculate or process there to solve specific problems in that domain. We're also seeing a shift in diversifying architectures, except this can also complicate matters.
An example is an ARM system might look functionally the same until you add an x86 firmware-based GPU, and all of a sudden, your ARM cores and your firmware are trying to figure out how to initialize that. The secret apparently, is they launch a VM quietly in the substrate that the OS can't see. There's another Linux system running alongside of the system you're probably running on that ARM system, initializing the card.
There's also an emerging trend, which are data processing or infrastructure processing units. The prior speaker spoke of network processing units, and ASICs that can be programmed for these same sorts of tasks. Except in this case, these systems are much more generalized. I want to model a network card mentally. We have a PCIe bus. We have a network card as a generic box. It's an ASIC, we don't really think about it.
The network cable goes out the back. What these devices are that we're starting to see in servers, that can be added for relatively low cost, is they may have an AI accelerator ASIC. They may have FPGAs. They may have additional networking ASICs for programmable functions. They have their own operating system that you can potentially log in to with applications you may want to put on that card, with a baseboard management controller just like the host. Yes, we are creating computers inside of computers.
Because now we have applications running on the main host, using a GPU with a full operating system. We have this DPU or IPU plugged into the host, presenting PCIe devices such as network ports to the main host operating system. Meanwhile, the actual workload and operating system can't see into the actual card, nor has any awareness of what's going on there, because they are programming the card individually and running individual workloads on the card. The software gets even more complicated, because now you need two management network connections per host, at least. That is unless the vendor supports the inbound access standards, which are a thing, but it's going to take time.
To paint a complete picture, I should briefly talk about the use cases where these devices are being modeled. One concept that is popular right now is to use these devices for load balancing, or request routing, and not necessarily thinking like a web server load balancer, but it could just be a database connection load balancer, or for database sharding. It could actually decide, I am not the node you need to talk to, I'll redirect your connection.
At which point, the actual underlying host that the card's plugged into and receiving power from, never sees an interrupt from the transaction. It's all transparent to it. Another use case that is popular is as a security isolation layer, so run like an AI enabled firewall, and have an untrusted workload on the machine. Then also do second stage processing. You may have a card or port where you're taking data in and you may be dropping 90% of it. Then you're sending it to the main host if it's pertinent and makes sense to process.
This is mentally very similar to what some of the large data processing operations do, where they have an initial pass filter, and they're dropping 90% of the data they're getting because they don't need to act upon it. It's not relevant, and it's not statistically useful for them. What they do get through, then they will apply additional filtering, and they only end up with like 1% of that useful data. This could also be in the same case as like the cell networks.
You could actually have a radio and the OS only sees Ethernet packets coming off this card as if it's a network port. The OS is none the wiser to it. With hidden computers that we're now creating in these infrastructures, I could guarantee these cards exist in deployed servers in this city today, need care and attention as well. There are efforts underway to standardize the interfaces and some of the approaches and modeling for management. Those are taking place in the OPI project. If you're interested, there's the link, https://opiproject.org.
Automation is important because, what if there's a bug that's been found inside of these units, and there's a security isolation layer you can't program from the base host? Think about it for a moment, how am I going to update these cards? I can't touch them. I'll go back to my story for a moment. That engineer had the first generation of some of these cards in his machines.
He had to remove the card and plug into a special card and put a USB drive into it to flash the firmware and update the operating system. To him, that was extremely painful, but it was far and few between that he had to do it. What we're starting to see is the enablement of remote orchestration of these devices through their management ports, and through the network itself. Because, in many cases, people are more willing to trust the network than they are willing to trust an untrusted workload that may compromise the entire host.
Really, what I'm trying to get at is automation is necessary to bring any sanity at any scale to these devices. We can't treat these cards as one-off, especially because they draw power from the base host. If you shut down the host, the card shuts down. The card needs to be fully online for the operating system to actually be able to boot.
Tools You Can Use (Ironic, Bifrost, Metal3)
There are some tools you can use. I'm going to talk about three different tools. Two of them actually use one tool. What I'm going to talk about is the Ironic Project. It's probably the most complex and feature full of the three. Then I'll talk about Bifrost, and then Metal3. Ironic was started as a scalable Bare Metal as a Service platform in 2012 in OpenStack. Mentally, it applies a state machine for data center operations, and the workflows that are involved to help enable the management of those machines. If you think about it, if you wheel racks of servers into a data center, you're going to take a certain workflow.
Some of it's going to be dictated by business process, some of it is going to be dictated by what's the next logical step in the order, in the intake process. We have operationalized as much of that into a service as possible over the years. One can use a REST API to interact with the service and their backend conductors. There's a ton of functionality there. Realistically, we can install Ironic without OpenStack. We can install it on its own. It supports management protocols like DMTF Redfish, IPMI. It supports flavors of the iLO protocol, iRMC interface, Dell iDRAC, and has a stable driver interface that vendors can extend if they so desire.
One of the things we see a lot of is use of virtual media to enable these deployments of these machines in edge use cases. Think cell tower on a pole, as a single machine, where the radio is on one card, and we connected into the BMC, and we have asserted a new operating system. One of the other things that we do as a service is we ensure the machine is in a clean state prior to redeployment of the machine, because the whole model of this is lifecycle management. It's not just deployment. It's, be able to enable reuse of the machine.
This is the Ironic State Machine diagram. This is all the state transitions that Ironic is aware of. Only those operating Ironic really need to have a concept of this. We do have documentation, but it's quite a bit.
Then there's Bifrost, which happened to be the tool that that engineer that I sat next to in Rapid City had stumbled upon. The concept here was, I want to deploy a data center with a laptop. It leverages Ansible with an inventory module and playbooks to drive a deployment of Ironic, and drive Ironic through command sequences to perform deployments of machines. Because it's written, basically, in Ansible, it's highly customizable.
For example, I might have an inventory payload. This is YAML. As an example, the first node is named node0. We're relying on some defaults here of the system. Basically, we're saying, here's where to find it. Here's the MAC address so that we know the machine is the correct machine, and we don't accidentally destroy someone else's machine. We're telling what driver to also use. Then we have this node0-subnode0 defined in this configuration with what's called a host group label.
There's a feature in Bifrost that allows us to one way run the execution. When the inventory is processed, it can apply additional labels to each node that you may request. If I need to deploy only subnodes, or perform certain actions on subnodes, say, I need to apply certain software to these IPU or DPU devices, then you can do that as a subnode in this configuration. It's probably worth noting, Ironic has work in progress to provide a more formalized model of DPU management. It's going to take time to actually get there. We just cut the first release of it, actually. Again, because these IPUs and DPUs generally run on ARM processors, in this example, we provide a specific RAM disk and image to write to the block device storage of the IPU.
Then we can execute a playbook. This is a little bit of a sample playbook. The idea here is we have two steps. Both nodes in that inventory are referred to as bare metal, in this case. When it goes to process these two roles, it will first generate configuration drives, which are metadata that gets injected into the machine so that the machine can boot up and know where it's coming from, where it's going to, and so on. You can inject things like SSH keys, or credentials, or whatever. Then, after it does that first role, it will go ahead and perform a deployment. It's using variables that are expected in the model between the two, to populate the fields and then perform the deployment with the API. Then there's also the subnode here, where because we defined that host group label, we are able to execute directly upon that subnode.
Then there's Metal3. Metal3 is deployed on Kubernetes clusters and houses a local Ironic instance. It is able to translate cluster API updates, and bare metal custom resource updates, to provision new bare metal nodes. You're able to do things like BIOS and RAID settings, and deploy an operating system with this configuration. You can't really customize it unless you want to go edit all the code that makes up the bare metal operator in Metal3.
This is what the payload looks like. This is a custom resource update, where we're making a secret, which is the first part. Then the second part is, we're creating the custom resource update to say, the machine's online. It has a BMC address. It has a MAC address. Here's the image we want to deploy to it, checksum. For the user data, use this defined metadata that we already have in the system and go deploy. Basically, what will happen is the bare metal operator will interact with the custom resource, find out what we've got, and take action upon it, and drive Ironic's API that it houses locally in a pod to deploy bare metal servers for that end user. You're able to grow a Kubernetes cluster locally if you have one deployed, utilizing this operator, and scale it down as you need it, with a fleet of bare metal.
Summary
There's a very complex future ahead of us with bare metal servers in terms of servers with these devices in them. We have to be mindful that they are other systems too, and they will require management. The more of these sorts of devices that appear in the field, the more necessity for bare metal management orchestration will be in play.
Questions and Answers
Participant 1: I work with a number of customers who do on-premise Kubernetes clusters. The narrow play is to spend a truckload of money on VMware under the hood. Then that's how you make it manageable. It always felt to me kind of overkill for all the other elastic capabilities Kubernetes gives you if we could manage the hardware better. Do you really need that virtualization layer? Do you have any thoughts on that with the way these tools are evolving?
Kreger: It's really, in a sense, unrelated to the point I want to get across regarding IPUs and DPUs. What we're seeing is Kubernetes is largely designed to run in cloud environments. It's not necessarily designed to run on-prem. Speaking with my Red Hat hat on, we put a substantial investment in to make OpenShift be based on Kubernetes and operate effectively, and as we expect, on-prem, without any virtualization layer. It wasn't an easy effort in any sense of imagination. Some of the expectations that existed in some of the code were that there's always an Amazon metadata service available someplace. It's not actually the case.
Participant 2: What I understood from one of the bigger slides was either like Redfish, or IPMI, or one of the existing protocols for management was going to be interfacing to the DPU or IPU port management through an external management interface facilitated by the server? Is there any thought at all to doing something new instead of sticking with these older protocols that [inaudible 00:30:57]?
Kreger: The emerging trend right now is to use Redfish or consensus. One of the things that does also exist and is helpful in this is there's also consensus of maybe not having an onboard additional management controller, baseboard management controller style device in these cards. We're seeing some consensus of maybe having part of it, and then having NC-SI support, so that the system BMC can connect to it and reach the device.
One of the things that's happening in the ecosystem with 20-plus DPU vendors right now, is they are all working towards slightly different requirements. These requirements are being driven by market forces, what their customers are going to them and saying, we need this to have a minimum viable product or to do the needful. I think we're always going to see some variation of that. The challenge will be providing an appropriate level of access for manageability. Redfish is actively being maintained and worked on and improved. I think that's the path forward since the DMTF has really focused on that. Unfortunately, some folks still use IPMI and insist on using IPMI. Although word has it from some major vendors that they will no longer be doing anything with IPMI, including bug fixes.
Participant 2: How do you view the intersection of hardware-based security devices with these IPU, DPU platforms, because a lot of times they're joined at the hip with the BMC. How is that all playing out?
Kreger: I don't think it's really coming up. Part of the problem is, again, it's being driven by market forces. Some vendors are working in that direction, but they're not talking about it in community. They're seeing it as value add for their use case and model, which doesn't really help open source developers or even other integrators trying to make complete solutions.
See more presentations with transcripts