I like to think about it in terms of two separate issues. You can get benefit from virtualization - that’s a no-brainer, everybody knows that virtualization is a way to make more out of less. To make more efficient use of the resources you have you don’t have to spend a lot more money on the hardware. There is no question, virtualization is the way we’re going to compute in the future. On that level, it makes sense to simply look at Cloud as an extension of virtualization. At some point you are going to go into virtualization one way or another, so when we start talking about Private Cloud architecture we’re really talking about "How much further beyond simple virtualization am I going to go and what benefits am I going to get from that?"
Assuming that all of the normal benefits of virtualization apply, that you can run virtual machines on physical hardware, run 10-20-50 virtual servers on a single box, there are obvious benefits there, without degrading performance of the individual services running on the virtual machines. Then, when you take the next step and you go into Private Cloud, I try and explain as a worldview. Cloud computing is not a moniker that’s attached to existing technologies that make it easier to sell and make it more trendy. Cloud computing is a new way to express applications, it’s a new way of thinking, it’s a new worldview and how we look at systems and architectures and applications and how they all work together.
I can’t say what benefits you are going to get out of it, but I can say the benefit that we got out of it is decreased downtime, because you don’t have to worry about servers that can’t really handle the load they are given because I’m running fat application servers because I don’t have the scalability to keep adding multiple copies of that because it’s too difficult to configure for whatever reason. Those big fat application servers is a model we’re moving away from and I see that as kind of a thing of the past. With the Private Cloud, which is built on top of a virtualized environment, I can add machines when the load goes up, I can change things randomly if I need to. If I need to do testing of a new feature, a new software or whatever I can spin up new instances that are based on the existing ones, so that I have a good base to work from.
There are benefits, obviously there are monetary benefits, saving time, more efficient use of resources, but it’s those intangibles that I see in Cloud computing that are really where I’m interested, where I’m focused. That’s what I’m looking at. Obviously the practical world benefits pay the bills, those make managers say "Yes" to you, but the more philosophical benefit is to me the ability to express applications in new ways that I never had the capability to do before: to process things in parallel, to spin up new instances, to do new things that I hadn’t thought about without the cost of anything but time to set it up, because I don’t have to buy anything to do it. I can just spin it up and away I go.
You can do the latter. You can just jump straight from "I’m running on physical servers" to running on Private Cloud architecture, but to be honest, I don’t think that many people are willing to take that jump. So it’s an absolutely common progression, it’s the one we took, it’s the one I’m hearing and everyone who talks about it is taking this path. We’ve got all these rack mounted servers, they each do individual things, they’ve each got an operating system on them, they each got servers on them doing different things. "Let’s buy a big Mongo Server, 8-way, 64 GB of RAM, let’s back it by some SAN storage, let’s put VMWare on it. Let’s play with it! let’s put the mail server on it! Let’s put the file server on it!" Now that we’ve done that, I feel comfortable that it runs well and I can do some cool stuff.
Now let’s look at something that’s maybe production, but maybe not mission critical; let’s put that on there. So you get some familiarity with how virtualization works and the benefits you get out of virtualization and then you start looking at those other servers and you are saying "I could put VMware on that one." Think about "Let’s have these bunched together, then if they are all the same architecture and they all run Intel of the same family, I can just vmotion between those - a running server. I can just move it to another one and then I can do something with that piece of hardware."
Once you start thinking like that and you start really using virtualization it is really just like in the movie "Robin Hood - Men in Tights" when they’ve got the little creek and he’s like "I’m on one side, I’m on the other!" - That’s what it is: you are on one side, you’re on the other. Virtualization and Cloud aren't that far away from each other.
The short answer is that you are not going to know that until you test it. There is no way to predict really. There is no way for me to give you a suggestion because I don’t know what applications you run and I don’t know what loads you have on them. But let’s assume that the servers you have now are reasonably adequate to perform the task that you demand of them, so it’s safe to assume that virtualization is not going to add anything because the whole idea is that virtualization make it more efficient. Just a general rule of thumb: you probably have got enough servers now. If you can do the workload you have now, then you’ve got enough. In fact, you can probably do 50% more at least, maybe twice that, I don’t know, it’s just rough estimates.
I don’t want to oversell it. You have to caveat, whatever, use it at your own risk, but it’s an area of the industry that’s being debated. "How many virtual machines do I put on a physical box?" I hate to sound like I don’t really want to answer the question, but it depends what kind of virtual machines are you running. Are you running a mail server? Are you running a file server? Are you running application servers? Are you trying to do a database server? They all have different demands of CPU and RAM and you can mix them. You could put a pretty I/O intensive application like a database server and you could virtualize it and it will run just as well as it will run on a hardware, Bare Metal Box.
Then maybe put other machines on that same physical hardware that aren’t so I/O intensive. Maybe they’re RAM intensive, maybe they are NoSQL store or maybe a messaging server. Maybe you don’t want to put anything, maybe you just want to have a database server and a backup, one to stand by. You can do that, it’s just going to depend on when you are running this and load testing it, how does it act? I’d like to think that a general rule of thumb that’s bubbling to the top of the industry is somewhere between 20 and 40 virtual machines per physical hardware box and that’s servers. I mean that’s application servers, messaging servers, whatever you want to put on it.
You can run at least 20 of those on a modern 8-way, 64 GB box and probably, depending on what you’re doing and the size of virtual machines you could run 50. How many servers are you running now in your rack? Just look at that and think "If I can run between 10 and 20 virtual machines that are heavily used on a single 8-way physical box, then how many physical servers do I have?" Maybe you could sell some of those on eBay and make some money back to buy some support or something. It’s perfectly reasonable to think in those terms.
Again, that depends on the hypervisor, the virtualization software that you are using. I use VMware, vSphere, so that’s what I’m familiar with and I can set a maximum, I can set a reservation, which means I’ll always give this VM at least this much. Then I can keep that box from going crazy on me, so that it will never do that. I can force it into its little box, if I need to. Most of the VMs that I run, I run unlimited on their maximum just because "Let them take it!" If they need that much resource, let them take it. Hopefully, by the time my stuff has got into production I’d put through enough testing paces that I’m not going to get into some kind of random situation where things are going to get out of control and that’s part of your testing process.
But if you play a little fast and loose with your development process, maybe you don’t do all the testing - don't account for all the weird situations that happen (it happens) - and you're having a VM that’s unlimited on the amount of resources it can take, obviously that’s going to be a concern. I’ve not really had that as much of a concern because the way I approach Private Cloud is that I don’t want to go to this model where I have one box that’s going to do a lot of work. Since I can run multiple VMs very easily, I split up the workload and so no one box ever gets more work than it can do. In fact, I try and keep it down to 10-15% usage maximum, so any one VM is ever going to be 10% utilized.
That may seem excessive - "Why don’t you let it be used more?" I look at it this way: if the CPU is at less than 10%, then the response time for that VM is going to be immediate. If it’s 50% the response time it’s going to be noticeably slower. So, as far as I’m concerned, performance-wise, I don’t want any one box to be utilized more than just barely, because if it’s just barely utilized, it’s going to be immediate and I’m going to get a noticeable increase in performance for the user, which is important to me. I want that application to be available, but this perceived performance - the fast response times, the page comes up immediately - even though they are internal employees and we deal with all private, we don’t have any public facing websites or anything (it’s all internal employees) and they are kind of a captive audience and I can make them wait if I need to, which I do sometimes, just out of sheer necessity.
But if I can build up a relationship with the employees that my application works well and they don’t mind using it, then they can get on to doing what it is that they are supposed to be doing much better and they are not going to be bringing me up in a meeting saying "I can’t do my job because this application is too slow."
Probably your best indicator is going to be a particular function that these virtual machines perform in getting more load than others. Because remember, we’re not talking about necessarily resource usage on your box because if the point of virtualization and cloud architecture is to make more usage out of the existing hardware, then obviously we’re going to have higher usage all the time because that’s the idea. We want to use up those spare CPU cycles that are sitting there that were previously wasted because the application that was running on it couldn’t make use of it. Now we’ve got 10 copies of that application running, so our CPU usage has just gone way up and that’s where it should be because that means we’re making better use of the box.
So you can’t just say "My CPU usage is getting up. I need to add more." That’s not necessarily true because the whole idea of cloud and virtualization is that I don’t need as much. Before where I would get concerned when I would see a usage raising, I would be "I need to add another box there". It’s a different metric now. For me it seems to be related to response time. So if my response times are consistent, even though my load changes, I’m fine. Why do I need to add anything? It hasn't impacted my application yet, so if it’s running at 90% of capacity, I’d like to have a little more, obviously. Who wants to run at 90% all the time? That doesn’t give you much room, but if you are running between 60 and 70% CPU RAM usage I have a VMware host that run virtual machines using up almost 90% of my RAM.
I’d like to have more, but for the particular boxes we’re running on the RAM's a little more expensive than other places and it’s running, it’s not affecting response time. I’m ok with that. Maybe you won’t be and maybe you’ll want to add. That’s perfectly ok - stimulate the economy, buy as many servers as you want, that’s completely up to you! But keep in mind it’s not a driving factor anymore when you get up to some of these higher usage and I’m talking about it at a host level, at a physical box level. I’m not talking about a virtual machine level, because like I said before, I want low usage on an individual virtual machine, I don’t mind having high usage on a host because that means I’m making the most of the money I‘m putting into that box buying those fast hard drives, buying that expensive SAN, getting all those processors and all that RAM.
If I’m making 60% -70% usage of that, I’m getting a lot more out of my money than by having more than what I need and having these other pieces of physical hardware 10,000 dollars sitting in a rack. It may not be a lot to some companies, but to us 10,000 dollars is part of a year’s salary for a developer. We’re from the Midwest and so we don’t make nearly what folks do in the city. That’s a pretty good chunk of somebody’s salary. These become really critical issues when you are dealing with tight budgets and we don’t have a lot to spend so we want what we do spend to get a lot of use. We want to make sure that we’re spending that on just the best that we can.
Dealing with spikes in traffic is never easy and part of the problem with the Cloud architecture as it exists today is that if you are not using a public provider, like Amazon who has auto-scaling, then you’ve got to do that yourself. You have got to manual scale and make sure that you can have enough of over the high water mark, that you can handle that traffic. If you can predict it, obviously spin up a few more instances. Most of the virtualization software out there, whether it’s VMware, VirtualBox or Xen or whatever can be manipulated via command line. That opens up whole new possibilities. Maybe you write a bash script that’s on a cron and checks top every few minutes and if it sees something that doesn’t like maybe it spins up another instance.
These are things that you can do and it would take probably 4 hours to write a script to do that. It’s not really complicated to do it, but you’d have to be a little familiar with some of the command line interfaces and how this stuff goes about. It’s not button-click straightforward all the time, it can be an issue that you just have to figure out on your own how to manage. If I’m going to use 70% all the time of my resources, what happens with the spike in traffic? That’s just something that you’re going to have to mull over for your own situation. For our instance, I don’t have to worry about that a whole lot because we serve internal customers, employees and they don’t react to random situations.
They come the day after a check run because they know that the check stubs are going to be posted online. We know at getting close to the end of the month when they have to cram all their training in they’ve been putting off until the last minute. These are trends that we know happen. We know at 3 o’clock in the morning there is nothing going on because obviously no one is in a restaurant at 3 o’clock in the morning. I don’t have to worry about that, I have no unknown spikes in traffic, but your situation may not be that straightforward. It may be exposed to the public and you might have people coming in at 3 o’clock in the morning because they are getting on from Germany. It’s hard to say.
The thing you have to keep in mind with Cloud is (I kind of go back to what I said originally) Cloud is a worldview. So if you think in Cloud terms internally, then you’re going to be able to make use of Public Cloud resources I wouldn’t say just super easy because obviously there are some problems. You are going to be doing it differently than Amazon is, so there are obviously going to be differences. But one of the essential parts of a Private Cloud architecture is the idea of a proxy, a gateway, a load balancer or something that sits upfront that all traffic goes through, that knows about what services are available in the back, whether they are internal to your datacenter or external on Public Cloud and can direct traffic accordingly.
From that perspective, it’s ubersimple, especially if you are talking about serving application server and being able to spin up application server instances, let’s say you need 500 of them in North America all of a sudden. You can do that and they can come in to your datacenter through your proxy and your proxy can go back out to Amazon or some other way. It could go to Amazon first and then come to your Private Datacenter. That’s the flexibility you get in being able to do it that way. I would say though, there is an issue with this because we chose Private Cloud because of compliance issues, PCI compliance, SOX compliance. These things are covered under specific guidelines - we have to have physical security, we have to protect that data and we’re responsible for it.
That is kind of scary and there are not good answers from the Public Cloud community on how to deal with that. The Public Cloud providers don’t come out and say "If you got PCI compliance issues, here’s our certification, don’t worry about it." I’ve not had anybody say that me yet. If they are out there and people are running PCI compliance solutions on Public Cloud and Amazon datacenter are PCI compliant, I’m sure that they’ve got to be in some way compliant with something with all the applications that they run, but is that covered under my auditing? When an auditor comes into our company and they say "OK, show us how you’re doing all these", they are checking all the boxes on the PCI compliance - are these being fulfilled by our process? - if we’re running on a Public Cloud.
The fact that I don’t know and can’t answer that very well is enough for me to say "As much as I would like to get into that space, I can’t right now. I can’t risk it, I can’t lose. how many millions of dollars a day we would if we weren’t PCI compliant. We couldn't take credit cards. The management wouldn’t even want to think about that possibility! In our situation - Private Cloud and if we need to go beyond that, we’re going to have to buy more boxes. That’s just all there is to it.
8. How do monitoring and management change when you switch from racks of servers to Private Cloud?
Monitoring and management is one of those things almost personality based. Some people are absolute control freaks and they have to know what’s going on on those servers all the time. I can empathize with that, I understand "I need to know what the box is doing". To me, if it’s running at 10% or 60%, I don’t care. I mean if it’s within parameters and it’s serving the application and page response times are just are staying consistent, I don’t’ care about the specifics of where it is. I care whether it’s up or down, I care whether I need to do something about it, I care about whether specific metrics go out of bounds - response time, etc.- but beyond that I just let it run.
I’ve got too much stuff to do during a day with a small shop - we’ve got a lot of stuff that we’re responsible for - we do operations, we do development, we talk to the customer, we have all these other responsibilities. I just don’t have time to sit around and watch the CPU histogram on various virtual machines all day. I’m not a good one to ask about that because I definitely have opinions about monitoring and management or from the perspective of the person who is wanting to do the monitoring and management and not so much from the perspective of the tools used to provide it. I’m a minimalist, that’s my mantra. I preach that all the time: absolute minimalism. If I don’t need to run it, I don’t because I know how I am. I know how I am as a Java developer.
I like to complicate things if I let myself and my applications, if they go to seed it will just become absolute beasts. Look at J2EE - the same thing is going to happen with every framework that comes out. It starts out simple, gets more complex. The same things happens with monitoring and management - you want to be able to do a few simple things, then eventually you’ve got a big bloated software that does all kinds of stuff that add no value to your day, to your business, to your operations. If you really don’t need to be collecting that metric, don’t! That’s the bottom line. If you need to run an agent because you have to know what’s going on, then make that decision and go with it but keep in mind that every CPU cycle you spend now that you’re using lots of them more efficiently on monitoring-management tools is a CPU cycle you cannot spend in your application.
Consider the cost of that, I know from running agents on things to do monitoring all the time so that I can look at usage histograms and see all this neat stuff they eat up tons and tons of CPU time. The agent does! And if the agent collecting your statistics is the highest usage process on your box, then maybe you need to revisit how you are doing monitoring and management. It probably wasn’t as much of an issue on Bare Metal boxes, because you just thought about things differently. Now, that you are thinking about things in a minimalist way, I want to use as much as I can out of a single box. Every little bit counts. If you are talking about, in VMWare they measure CPU usage in MHz of a box, if it’s a 2 GHz CPU, times 8 you’ve got 16-17GHz to work with.
If you can use a 10% of the box, 100 MHz out of a 3GHz capacity for that virtual machine, do you really need to be spending more 200-300-400 MHz of those CPU cycles on monitoring tools? Maybe you do because maybe your management is a control freak and has to see those histograms and feel like things are not being done if they don’t see them and that’s another issue entirely, but that doesn’t have anything to do with value to the business. That has to do with "I personally need to have a good warm fuzzy feeling that things are running" and that’s not a problem of the system, that’s another issue entirely to deal with.
I think that’s, as far as a developer is concerned, probably they’d like to think that because of the rivalry. They don’t always get what they need out of operations. Of course the operations guys forget that their job is to service the developers who write the code, that run applications. The purpose of an operations department is not to make sure that the computers are running, they are there to make sure that the applications that run on those computers actually do the work of the business. I’ve been on both sides and I crossed both sides. I understand "Hey, we could have fewer ops guys!" That may be true because you’re using things more efficiently and there is less overhead because you have less maintenance tasks. That’s a good thing, that means less to worry about.
That doesn’t mean that you need fewer people, that just means that you need to do things differently. If you’ve got a couple of ops guys and they pulling their hair out trying to keep things running right and up-to-date and all these other problems and you put in a Private Cloud virtualization, all these neat things and suddenly your downtime is being eliminated, things are a lot easier, that doesn’t mean you need to get rid of any of those ops guys. It just means that their lives are going to be better and then they can start thinking "If we could do this, if we wrote this script, we could actually make this better" or "We could deploy things differently. We could contribute to progression rather than always reacting". That’s the thing that really gets me about this business climate that we’re in right now.
Things are tight, we don’t have all the people we need - everybody knows that - there is lots to be done, more all the time, fewer resources to do with. We understand that, but what if you could make it easier? What if you could make it better? You wouldn’t have to be always reacting and always getting yourself in situations where you’re making work for yourself because you don’t have time to do it better? That’s a contribution to the company that could be extremely valuable. But if you are spending all of your time just on the mundane "I got to this" type of stuff, you’re never going to think about those sorts of things. I like to think about it like that. Not necessarily that you are going to give people free time or they can go check their Facebook more often, but that you’re going to give people the ability to make things better and mull it over a little bit.
Give them some time to work through some of these problems instead of "We’ve got this right now! We have to get this out! We have to get these servers up and we got to get these changes done!" Just calm down a second! Is it really that important? Is there somebody circling the Earth that needs to be retrieved before they run out of oxygen? If that’s the case, then yes, you need to have that kind of "Hey, we need to focus on this!" If it’s just this manufactured frenzy that we find ourselves getting into too easily, then maybe we need to take a look at "How do I interact now with this new cloud virtualization architecture as an operations person?" because it’s going to be different. Hopefully it’s going to be less stressful because we’re making things better, we’re making things more efficient, there is less to worry about.
All of these things are going to make our lives utopian. Maybe we should look at it like that rather than "Are they going to fire me? It's the guy with the red stapler in Office Space?" We’re not looking at them like that guy, we’re looking at them as they are critical to the infrastructure, they are critical to the processes and they’re probably not going to go anywhere.
10. With the Private Cloud on premises, how do the day-to-day tasks of ops change?
Getting back to this theme of efficiency and minimalism, the less you have to worry about, the easier it’s going to be for everyone - less headaches, more uptime; these are all good things. Implementing a Private Cloud, if you have problems to deal with, it’s probably going to be easier. That’s definitely going to change the way you look at your architectures for one thing. You might actually be fond of them rather than looking at them with total fear and trepidation, so it’s change. It’s inevitable, it’s going to happen, it’s going to be different. How much different and whether or not you personally as an operations person can grok what’s going on and can deal with things that are changing it’s going to be dependent on the team and the person and the management and getting people actually excited about doing it rather than "Are you serious we have to do this too?"
It’s going to depend a little bit. Being more on the developer side than strictly operations, and coming from a development background AS400 and what not (the AS400, iSeries, I-5 and whatever they are calling it these days, that world has been so static for 20-25 years), things have not changed and they are not going to and if you try and change that, you are going to be met with some fierce resistance. Operations can’t be that way! It’s never been allowed to because things have been changing too fast. That’s going to be the issue right now, things are going to be changing even more quickly the next year than they had in the last year because momentum is getting to the point now where it’s reaching critical mass. Cloud is reaching critical mass.
At some point, enough people are going to say "This is a worldview change! This is a difference to the way we construct applications!" that you’re going to start thinking like that. It’s like when somebody learns a new language. They say that you are really getting proficient in that language when you start dreaming in the new language. In essence, when you start dreaming in Cloud (not to be too silly; hopefully you are not that much of a nerd) when you start thinking like that and it’s like the second nature, then you’re going to start realizing "I can do this". You can see things differently, you can see opportunities for things, you can see how you can make things better that you probably didn’t see before. And it’s not so much just simple change. I know things always change, but there is something more than just change.
There is a change to a new worldview, so it’s not just the change and the technology is different, but the way we see things is the same. There is technological change and there is philosophical change as well and the combination of the two that’s it. I can’t say more than "I apologize" if it’s changing too quick for you.