Cloud computing is more than just fast self-service of virtual infrastructure. Developers and admins are looking for ways to provision and manage at scale. This InfoQ article is part of a series focused on automation tools and ideas for maintaining dynamic pools of compute resources. You can subscribe to notifications about new articles in the series here.
How should organizations plan for a cloud environment and optimize it for efficiency at scale? InfoQ reached out to 3 leading cloud practitioners to uncover practical advice on this topic.
Each panelist has dealt with the sometimes-messy transition from on-premises to cloud, and coached organizations through their journey.
The panelists:
- Conor Brady - Director of Enterprise Cloud Architects (ECA)
- Michael Collier - Principal Cloud Architect and a Microsoft Azure MVP
- Brian McCallion - Founder of Bronze Drum Consulting
InfoQ: What existing customer processes need the most streamlining to function at a broader cloud scale? Where do you see companies struggling to adopt to the self-service and rapid scale that cloud can offer?
Conor: The prerequisite(s) to streamlining existing business processes:
#1 Education, stakeholder involvement in key design decisions and “buy-in” from C-Level down at the early stages of cloud adoption
#2 Approach the cloud with an “operational checklist” of incumbent technologies and processes and try as best to adopt these in your cloud service provider (CSP). If existing operational tools (e.g. monitoring) can be utilized and maximized then exiting processes work best. Where gaps are identified (and they inevitably are) you need to fill these with vendor tooling, e.g. CSP operations portal, application performance monitoring (APM) or a move to a Cloud Management Platform (CMP). Aim for the “single pane of glass” where possible.
Self-service is the end-goal of most enterprises but in reality I have not seen it widely adopted in enterprise IT. Some of the challenges are arcane business processes, polices, financial management and governance (lack-of) that exist today. Arguably the right tooling can solve nearly all this by utilizing the correct CMP that fits your enterprises needs for self-service, monitoring, end-user requests, provisioning, metering, policies etc. In my experience, when an enterprises starts having pain points with regards scale and management then there is an inflection point where they should adopt a CMP to try solve or circumvent these issues.
Michael: I think the customer processes that need to be streamlined the most to function at a broader cloud scale are primarily related to network security/management, and identity management. This is especially the case in medium to large enterprises. Such organizations tend to (in my experience) have a fairly well entrenched set of tools - both software and hardware based - and processes which don't often translate easily to a public cloud environment. I doubt if cloud providers will ever match legacy on-premises tools one-to-one, and that's probably a good thing. Organizations will need to re-evaluate their on-premises processes to see if they really need to have the same in the cloud, or if there are new, modern ways to accomplish the desired goal.
The self-service and rapid scale the cloud provides seems to cause many organizations challenges related to how to best govern the usage of the cloud. Organizations like the speed and agility the cloud provides, but are also very concerned that the control once easily enforced on-premises is not readily available in the public cloud. Could a developer expose sensitive data without realizing it and without any organizational oversight? How does an organization best control data access? How does an organization best allocate and manage costs? All interesting questions and challenges that often don't have a straight forward answer.”
Brian: I see a great deal of difficulty letting go of existing processes and also difficulty executing even when the objective is to speed up time to market. I work with really smart, great people in the enterprise who seem to know what to do yet experience challenges getting multiple departments to cooperate to affect change.
To me these firms need to abolish spreadsheets and manual configurations and the delays requesting resources from multiple silos such as storage, networking, and virtual machines. While these firms work very hard on process, people spend a large amount of time on documentation and then manual implementation. Even in cases of Cloud infrastructure it takes a supreme effort to get everyone to pull in the same direction. An emphasis on education and focus on automation would accelerate the rate of learning and could enable continuous improvement in speed and agility.
InfoQ: Michael mentioned some (unexpected?) challenges when organizations make the shift to using cloud resources. What challenges do you see customers least prepared for, and most prepared for? Are there things that organizations should have figured out before adding cloud resources to their portfolio, or, have you seen a "learn as you go" strategy work well?
Brian: I thought I had the answer to this question about six months ago. I work with the CIO of a midmarket firm. As I didn't have the Fortune 500 "process" to contend with I was able to correct some fundamental aspects of Cloud account configuration and access control and get the basics of the Cloud accounts setup. Where I was surprised was in the resistance to automation encountered at the staff level.
While the CIO and I are on the same page as to how the infrastructure needs to be managed and provisioned, I was surprised at how the "DevOps" or perhaps "SysOps" folks approach Cloud. Essentially they explained to me how they wanted to use AWS so as to "cut out the middleman" and to get virtual machines when they needed them.
Now I try to avoid dogma and to approach each case with a fresh perspective. However what I find is that unless teams have a real desire to form new habits and do things differently people continue to do things manually. Now these guys had been using Chef and some other tools in the data center and felt those tools didn't provide the kinds of flexibility they needed. They believed that for the Cloud they would switch to a new tool, and that this would give an opportunity to redo and revamp the process.
My assessment was that prior to moving to the Cloud the team needed to figure out the automation. In my experience if teams don't have a "carrot" it's hard for them to dig deep and automate things. I also believed that unless some kind of automation was working internally, the team would not have the time or focus to get the automation working in the Cloud.
I think the takeaway here is that teams need to understand that moving to the Cloud requires learning the Cloud platform in depth and doing things differently. My assessment is that if all you need is to get virtual machines faster that's not something Cloud delivers. It's not necessary to resolve all issues prior to moving to Cloud, however the more candidly you can assess what's going on in your organization and why you are migrating to Cloud, the faster you can define the behavior and practices that align with what Cloud offers.
Conor: I've been fortunate enough to work with enterprises who have identified a need for guidance (consulting, vendor, "in-house" training) and advice more or less from the inception of adding cloud resources. They have had the maturity and vision to start with the application architecture rather than trying to shoehorn an application into the cloud for the sake of it. As with any new technology there will always be an element of "learn as you go" but a good architecture should set a foundation in place to build on. I can't say I have seen a "learn as you go" strategy work well on its own outside of skunkwork projects. Too many unknowns in this approach which can lead to project failures.
Nearly every enterprise client I have seen has overlooked or been ill-prepared for cloud 'connectivity' with their on-premises datacenter, be it network (Direct Connect/VPN) or Integration (SOAP/REST etc.), the big challenge is working with the various internal stakeholders and deciding on the best approach (factoring in budget, timescales, effort, resources etc.). On the flip side, data sovereignty due diligence and data classification (for cloud workloads) has been executed very well, it’s a hot topic in my geographical region; still refreshing to see enterprises (InfoSec) on-board with this at an early stage executing risk assessments.
Michael: I continue to be surprised by the number of organizations, both large and small, that appear to have forgotten to account for how to manage and support their new cloud resources. The new solution may follow all the "best practices" in terms of cloud architecture and development, yet then just gets thrown over the wall to a support team that is relatively clueless in how to support such an application. There continues to be that old wall between IT and Operations departments. Those in the Operations department then try to use legacy tools and/or practices to support the cloud resources. Often times that doesn't work, and blame falls back on IT - and then the viscous cycle repeats. Many customers still seem to be unprepared in how to best support their new solutions.
One thing that customers seem to be coming to terms with more in recent times is the idea that cloud resources will fail. While they might not necessarily like the idea of failure, it is a fact of life in the cloud. With that understanding, customers are willing to have those discussions about what a failure scenario could be, and how it'll impact the business. Now we can start having conversations about how to mitigate those risks. Those get to be some fun architectural discussions!
InfoQ: Where do you see customers get bit by "premature optimization" when planning out their cloud deployments and subsequent management? What advice do you offer to those who are getting ready for a major cloud investment?
Conor: I can't say I have seen much in the line of "premature optimization" with regards cloud deployments, if anything some have been a bit lapse in thinking about "automation" for cloud provisioning, deployment, configuration, scale etc. I have seen some SMB's purchase a CMP far too early in their cloud lifecycle to manage < 10 VM's. Nothing wrong with that and they might have a direct need for a fully-fledged CMP but the dollar costs should reflect the investment, I got the feeling a good sales person had been at play …
To sound cliché, Cloud Service Provider selection due diligence is paramount on a major cloud investments. Benefits of cloud are well known but I also like to see the numbers stack up; do a business case, focus on TCO/ROI, be pragmatic and make sure your enterprise has a CSP "fit" - security, workload migration, agility, price, innovation, roadmap, operate model, sovereignty etc. IMO research and advisory firms (analysts) are your best friends, utilize their papers and wealth of experience to make informed decisions on technology selection. CIO's like to see concrete evidence on why a decision was made, back it up with empirical evidence and a solid cloud architecture and strategy.
Brian: Premature optimization when planning Cloud deployments I perceive on a few levels:
- IT wants to use a single OS across multiple use cases: Cloud, VMware, bare metal. This "one size fits all," Product Tower approach doesn't serve well in the Cloud because such images tend to be "kitchen sink” and do not have software necessary to customize an instance after it boots.
- Restrictions regarding what Cloud services can be used to attempt to have parity between a "Private Cloud" that may one day be built, but hasn't been started just yet. I've actually listened to people earnestly and authoritatively articulate this anti-pattern and kept a straight face. My role is to listen and figure out how to modify behavior so I'm more interested in developing a deeper understanding of how people think, that immediately correcting the "wrong." I'm quite candid with folks, nonetheless folks with long enterprise backgrounds in Corporate IT have had experiences that shape their approach. My approach is to focus on creating new experiences in the Public Cloud which may serve to collapse a series of discussions into a "Eureka" moment. The premature optimizations I see stem from IT's lack of control over datacenter infrastructure. To get beyond bad habits learned in the datacenter it helps to define new experiences and new habits around public cloud.
- I don't see many strictly technical "premature optimizations" of interest to me. However, do see many attempts to "force" cost savings or "compensate" for a perceived performance delta between datacenter and cloud hardware. Sometimes people try to run production workloads in AWS micro-instances just "because." I also see managers and project managers attempt to cash the "time-to-market" voucher the mass media has pinned on Cloud only to find that these kinds of gains aren't automatically dispensed by the great infrastructure vending machines in the sky.
The best remedy for premature optimization is to hose down Cloud horny executives with a little cold water candid discussion of the kinds of gains available through architecture, automation, and a potentially simplified support model.
Michael: Like Brady, I can't say I've seen much in terms of "premature optimization" when it comes to cloud deployments. One case I did see though was when a customer wanted to force the use of a particular deployment / operations management tool before any cloud-based solutions were more than ideas on a PowerPoint slide. The goal of using this particular tool was to ensure symmetry across both on-premises and cloud environments. Which, at the surface, makes perfect sense. Yet, in this case, I think there probably needed to be a problem developed before the solution was dictated.
Moving to the cloud provides a perfect excuse to break old habits and jump into a new way of accomplishing the objectives. I've seen the situation many times where groups within an organization will throw up road blocks when a new, cloud-friendly way of accomplishing some task is purposed. The reason is often quite simple - "that's not the way we've done it before". Making a major cloud investment provides the perfect excuse to challenge the status quo. At least ask the questions to find out if there is a new way to accomplish the objective that may provide some yet-to-be-realized benefit.
InfoQ: How are you advising customers to architect solutions that will scale properly in the cloud? Does the cloud make it easier to pitch a "layered approach" with independently scaling components?
Conor: Educating, advising and architecting scalable cloud solutions usually is a 'three pronged attack':
#1 Discussions with 'the business' on the differences between vertical & horizontal scaling and what it means to them. I've had this discussion with a number of business unit general managers and it resonates with them well
#2 Pets and cattle discussion with key IT stakeholders (wonderful analogy)
#3 Come prepared with a cloud application reference architecture that opens up discussions with architects, developers and vendors.
The layered approach is an easier sell when discussing cloud application architectures with enterprise IT. Armed with the aforementioned reference architecture for guidance, developers et al can see the benefits of building applications in a layered approach, with outcomes that lend its self to loosely coupled, composite, event driven, scalable and elastic applications, built for cloud scale. I also find that having a set of cloud design principles and patterns accelerates the development ramp up, producing high quality cloud scale applications.
Michael: I will often try to direct customers to adopt as much of a platform-as-a-service approach as possible. Even though infrastructure services gets most of the attention these days, there is a lot of real value to be had in platform services. This is often not an easy conversation - although not as hard as it was a few years ago. The architecture/development differences that PaaS often imposes are still there, yet maybe don't seem quite as crazy as they once were. This isn't necessarily because the technology/platform itself has considerably evolved, but that the concepts are (slowly) starting to resonate with people. If platform services don't fit the problem space for whatever reason, then take a look at infrastructure services.
The "layered approach" with cloud solutions is certainly an easier discussion. Developers/architects are much more open now to composing solutions built of multiple services. Doing so is becoming the norm. With a compositional compute model, we have to take into account how the various aspects of a solution may scale, or not.
Brian: I advise customers to design independent single function web services and to scale out these services horizontally. Often in the data center I find a bias towards running many services or applications on a couple of nodes. I think this reflects the complexity of working with and managing infrastructure in the datacenter. I've also found that in the datacenter the internal charge back for VMs seems to encourage fewer nodes. I've had teams state that running many applications on a few nodes is driven by some sense of thrift, even while they struggled for days with patches, code releases to their cluster. In the Cloud, Automation and pricing that reflects resources consumed (approximated by instance size) as opposed to number of nodes seems to make it easier for customers to accept the practices and independent scale-out services.
InfoQ: What skills do you consider essential for developers and system administrators who are (or will be) working with cloud assets at scale? What skills are "nice to have" but can be matured as you go along?
Conor: Understanding distributed systems and SOA is one of the key essential developer skills for the cloud; developers should create discrete services (separation of concerns) and also build robust systems that cater for failure which include latency constraints of the distrusted service components. Developers also need to become security savvy both with data in transit (SSL) and also data at rest (cryptographic API's). Automation is also key when working at scale and a best practice essential skill.
Sys Admins (or DevOps) need to become proficient with CSP operation tools (performance monitoring etc.) to understand how cloud applications are running and have a deep enough insight to support the application in production. One of the other key essential tasks for Sys Admins is tracking and monitoring cost and monthly spend.
Skills that can be matured in the future include the deployment of a CMP, Application Performance Monitoring (APM) systems or third party vendor products for deeper insights into CSP platforms. Enterprise should also strive to become more familiar with their CSP API's which can automate monitoring and deployment as they mature their processes. Lastly the security posture of the cloud can always be enhanced e.g. third party products (firewalls, intrusion detection system) or a focus on advanced IAM capabilities.
Michael: First, I agree with Brady in that building for distributed systems and SOA is a vital skill. I would also add PowerShell (or some other scripting or command line type of interface) and network/system administration as two skills that are crucial for developers working in the cloud. Once you get past the simple single service or VM deployment, and do it more than once, automation and the ability to quickly repeat a process are vital - and scripts enable that automation. Being able to script out a complex, multi-service and/or multi-region deployment is a vital time saver. For that matter, being able to create a VM or web site with a single PowerShell or CLI script is a huge time savings as well.
In an on-premises world, developers could easily get by without really needing to understand networking/system administration at anything more than a 100-level. But in the cloud, developers are often involved in a lot more than just writing code. Today, working in the cloud is a DevOps model. Developers need to understand more about networking, server software configuration (e.g. Active Directory, SQL Server, etc.), and the like. Likewise, system administrators need to have a tighter relationship with the developers. System administrators no longer control the "keys to the kingdom". They need to be proactive in helping the developers build the solutions the right way.
As for a skill that can grow as maturity in the cloud increases, I'd include architecting/developing for cost in that list. Every architectural and development decision has a direct, measurable cost in the cloud. Bad code has a true monetary cost. We can measure how much a solution really costs to operate. Then figure out if there are efficiencies to be gained either through changing parts of the architecture, using different services, or optimizing code paths. It's easy to start out with a high bill, and then the real work begins - bringing down that cost as one understands more about the solution and platform."
Brian: Developers and Systems folks working with Cloud applications at scale need to understand the following to succeed.
- Automation in infrastructure provisioning. I don't think it’s a nice to have.
- Availability of storage endpoints.
- Client side encryption
- Single Sign-on / SAML 2
- Caches
- The difference between building an application and building a web service
- The difference between building a shared identity services and consuming a third party security services
- Load balancers and session state
- Application Design and Scale-out
- Design, or when building applications that include legacy web applications, decompose and recompose Web services into scale-out single function web services.
- Design separate "strings" each running in parallel availability zones so as to minimize cross zone requests that could result in a backlog of timeouts that may result in a failure of a service in one zone to cascade across to unaffected zones because the unaffected zones continue to make requests to services in the impaired zone.
- While there are probably fancier ways to route around failures we've been very successful with a simple web service that returns an HTML page if all the services it checks are healthy. The load balancer requests that page and if it doesn't respond, no traffic is directed to another "string" in another zone. Session state stored in a cache cluster that spans zones or something like dynamo DB or cookie based session affinity / stickiness can make this work. Both developers and infrastructure folks need to work closely together and communicate effectively.
- Role based access controls and group entities with assigned Roles are necessary.
It's possible not to have everything at once but if the data is sensitive I think all of the above is essential.
About The Panelists
Conor Brady, Director of Enterprise Cloud Architects (ECA), based out of Sydney (Australia). Cloud Architect helping enterprises adopt, transform and migrate to the “Cloud” with a focus on architecture and strategy consulting. Find Conor on Twitter at @conorbrady.
Michael Collier is a Principal Cloud Architect and a Microsoft Azure MVP residing in Columbus, OH. He has a vast amount of experience in helping companies determine the best strategy for adopting cloud computing, and providing the insight, and hands-on experience to ensure they’re successful. Find Michael on Twitter at @michaelcollier.
Brian McCallion is founder Bronze Drum Consulting, a New York City-based consultancy and a member of the Amazon Partner Network. He can be reached at bmccallion@bronzedrum.com or followed on Twitter at @BrianMcCallion
Cloud computing is more than just fast self-service of virtual infrastructure. Developers and admins are looking for ways to provision and manage at scale. This InfoQ article is part of a series focused on automation tools and ideas for maintaining dynamic pools of compute resources. You can subscribe to notifications about new articles in the series here.