BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Leverage the Cloud to Help Consolidate On-Prem Systems

Leverage the Cloud to Help Consolidate On-Prem Systems

Key Takeaways

  • A cloud model can be used to architecturally validate the possibility of consolidating multiple application servers or instances into a smaller number of physical resources that will ultimately remain on-prem.
  • The cloud is an ideal R&D "sandbox" where engineers can test key design assumptions much more quickly and complete the entire server consolidation project on a shorter timeline.
  • The cloud offers solutions to many of the classical problems of on-prem physical resources, such as creating multiple identical test environments, resetting test environments that have become corrupt, and providing fast self-service of environments.
  • The key to successful testing is to create multiple "clones" of working environments that replicate the same network topology as the final target system
  • This method can be used for R&D, classical QA automation testing, integration testing, chaos engineering and disaster recovery.

This post discusses how to use a cloud model to architecturally validate the possibility of consolidating multiple application servers or instances into a smaller number of physical resources that will ultimately remain on-prem. It proposes a process for doing so and explains several scenarios where that process allows IT to perform more QA testing, or advanced testing concepts such as chaos engineering, while planning for the migration.

It is important to note that this post is not advocating for the reengineering of applications from on-prem to the cloud, though that is a possibility. Instead, the focus of this post is to describe how to leverage the cloud to help validate the design of re-organizing a large number of physical on-prem servers down to a smaller number of resources also hosted on-prem. In this case, the cloud is used as the R&D "sandbox" for key design assumptions.

Key questions for consolidation projects

Undertaking a server consolidation project of this type requires addressing a foundational set of questions:

  1. Is it even possible to consolidate multiple resources to a more efficient target number? How can this basic assumption be proven without investing significant additional costs in existing infrastructure if it turns out that the basic premise is false?
  2. How can the basic assumption be proven out in the fastest time possible? If it takes weeks or months to acquire and provision the resources required to perform experiments against the basic premise, how will that impact the project timeline?
  3. How do we quickly adapt the final target model as new information is learned? For example, if initial calculations on CPU, memory, storage are incorrect, how can adjustments be made quickly and easily without incurring massive unexpected costs?
  4. How can we provide R&D, test, and lab environments for all the constituents involved in the project? If the project is to consolidate servers, how can all the various application, database, networking, administrative, and QA testing groups gain access to a production-like representation of the final target architecture? Will all the groups have to schedule time on a limited set of testing resources? How will integration testing from all the teams be done?

Questions like these and others can be addressed by leveraging the cloud as a "placeholder" for the final target system that will eventually exist on-prem. The flexibility of the cloud offers solutions to many of the classical problems faced with working with traditional on-prem physical resources, such as creating multiple identical test environments, resetting test environments that have become corrupt, and providing fast self-service of environments so testing isn’t delayed for weeks or months. Here is how the cloud addresses these issues.

Creating identical test environments

How can identical R&D and test systems exist without colliding with each other? QA environments "Test1", "Test2", and "Test3" should all have a host called "Database" with an IP address of 192.168.0.1. How can these identical systems also communicate with shared resources without colliding at the network layer? Traditional enterprise thinking would require that hosts go through a  "re-IP" process so that no IP addresses are duplicated on the same network address space. The downside of this approach is that the test systems are no longer exact duplicates. There is a greater possibility of having multiple configuration problems that need to be debugged by hand. The cloud model offers "environment cloning," where multiple environments with the same network topology can exist in harmony.

Resetting a corrupt test or R&D environment

Once an application environment has been used extensively, the test data may become stale, or automated test cases have created data that must be reset or removed before the next test run can be executed. Another variation is that multiple, but slightly different test datasets are needed to validate all configurations of the target system. Traditional enterprise thinking would use scripts and possibly other automation techniques to delete database data, reset configuration files, remove log files, etc. Each of these types of options can be time-consuming and error-prone. The cloud approach leverages "image templates" where complete, ready to use VMs along with their network topology and data are saved into templates. If a database becomes corrupt, or heavily modified from the previous testing, instead of resetting the data via scripts, the cloud model replaces the entire database VM along with all of its data in one-step. The reset process can often be done in seconds or minutes versus hours or days. Complete ready to use environments containing dozens of VMs can be saved as templates and reconstituted in very short amounts of time. For example, instead of taking weeks or months to rebuild a complex multi-VM/LPAR from scratch, what if it could only take a few hours?

Providing self-service of new test environments

Traditional enterprise thinking would limit control of cloud resources so that only a select few have direct access to the cloud resources, and those few then create environments for others. The historical reasons given for this approach are steeped in antiquated cultural models, and can cause a project sprint that requires a new, clean environment to be delayed for weeks or months. The cloud approach delivers "self-service with IT control and oversight." Users and groups are given direct access to the cloud but have restrictions that limit their consumption. The overall system is protected from a runaway script that incurs excessive charges or consumes all available resources. Users and groups have a "Quota" that limits the amount of consumption possible at any one time. Users are assigned "Projects" that define what environments they can see. A QA user sees environments "QA1" and "QA2" while a developer might see both of those as well as "R&D1" and "R&D2". Finally, the cloud system provides universal "Auditing" so that user activities are tracked and available for reporting. The question of "Who deleted that AIX LPAR?" is no longer a mystery.

How to leverage the cloud to consolidate on-premise systems

With the above points laid out, here is my recommended cloud-based model for projects involving the consolidation of multiple on-prem servers/LPARs that will remain on-prem after consolidation:

1. "Lift and shift" existing-working resources from on-prem

The recommended approach is to "create or recreate" a representation of the final target system in-the-cloud, but not re-engineer any components into cloud-native equivalents. The same number of LPARs, same memory/disk/CPU allocations, same file system structures, same exact IP addresses, same exact hostnames, and network subnets are created in the cloud that represents as much as possible a "clone" of the eventual system of record that will exist on-prem. The benefit of this approach is that you can apply "cloud flexibility" to what was historically a "cloud stubborn" system. Fast cloning, ephemeral longevity, software-defined networking, API automation can all be applied to the temporary stand-in running in the cloud. As design principles are finalized based on research performed on the cloud version of the system, those findings can be applied to the on-prem final buildout.

To jump-start the cloud build-out process, it is possible to reuse existing on-prem assets as the foundation for components built in the cloud. LPARs in the cloud can be based on existing mksysb images already created on-prem. Other alternatives like ‘alt-disk-copy’ can be used to take snapshots of root and data volume groups and move them to LPARs running in the cloud.

2. Save to Template

Once a collection of LPARs representing an "Environment" has been created in the cloud, the environment is saved as a single object called a Template. The Template is used to "clone" other working environments. Clones are exact duplicates of the Template, down to the hostname, IP address, subnet, disk allocations, everything. Creating ready to use environments from a template is the most powerful component of the cloud-based approach. It provides the ability for multiple exact copies of the reference system to be handed out to numerous ENG/DEV/TEST groups, all of which can be running in parallel. There is no need to change the IP address of individual servers or their hostnames. Each environment runs in a virtual data-center in harmony with the others. If environments have to communicate to other on-prem resources, they are differentiated via an isolated NAT mechanism, as described below. Many of the environments contain the same VM clone base image(s) with the same hostnames, IP addresses, etc.

3. Make Templates available to users

Once Templates are created, they can be made available to users. Each cloud provider has a different mechanism for allowing users to access and deploy assets within their clouds. The idea being that if you are a developer, you don’t need access to components that are part of production. And if you are part of ENG, you might not need access and visibility to the core QA environments. Your cloud should allow you to assign roles to various types of consumers, and based on that role, users can only do certain operations. For example if you are "Restricted", then you might be able to view VMs and connect with them, but you can’t change them in any way.

How to create cloned environments with duplicate address spaces

Step 2 above involves creating multiple working environments that replicate the same network topology as the final target system. This has several benefits, but can be difficult to understand for people used to working with on-premise networks. Let’s dig into this in more detail.

"Replicate" meaning re-using the same host-names, IP addresses, and subnets within each environment. To achieve this, some form of isolation must be implemented to avoid collision across duplicate environments. Each environment must exist within its own software-defined networking space not visible to other environments that are also running. Using this mechanism, it is possible to create exact clones of multi-VM architectures with multiple subnets containing replicated address spaces. Each environment becomes a virtual private data-center.

Cloned environments communicate back to upstream on-prem resources via a single focal point called an "environment virtual router" (VR). The VR hides the lower VMs containing duplicate host-names and IP addresses and exposes a unique IP address to the greater on-prem network. Using this mechanism creates a simplified and elegant way for multiple duplicate environments to exist in harmony without breaking basic network constructs. By allowing duplicate host-names and IP addresses to exist, individual hosts do not have to go through a "re-IP" process, which is error-prone and time-consuming. The VR becomes the "jump-host" that allows operations like SSH into each unique environment. From on-prem, users first SSH to the jump-host, which exposes a unique IP address to on-prem, and then relays down to the VM within an individual environment.

Use Cases

The following use cases apply to many server consolidation projects where different groups of participants need access to a representation of the final target system or some subset of the final design. Using the cloud-based process explained above makes all of these scenarios faster, more efficient and more productive. The key is to quickly deliver the infrastructure needed to perform a specific need or task. Waiting weeks for infrastructure delivery should be considered an "anti-pattern" and avoided if at all possible, since the cumulative time of waiting over the course of the project would be considerable. Building internal resource delivery processes with slow delivery times goes against the concepts described in works such as "The Phoenix Project" and "The Goal."

1. R&D Sandbox

Developers need a way to create their own "environment" of components that represent the target system. See Gene Kim, "DevOps for High Performing Enterprises." Providing "representative" environments instead of mock environments running on local workstations/laptops would be a key indicator of the ability to reliably prove out many architectural assumptions being made in the project.

2. Classical QA Automation testing

Instead of QA having to share a limited number of environments that commonly develop "configuration drift" for the current versions of components. QA should be able to easily create ‘n’ number of QA environments "QA-1", "QA-2", "QA-3", "QA-n." Leveraging cloud representations of the target system will allow QA to completely destroy-and-rebuild the correct target environment from scratch within minutes or hours. No more scripting to back-out, reset test data. No more "reset" scripts to return configurations to a starting state. The QA environment is completely ephemeral (short-lived) and may only exist for the duration of the test run. If tests fail, the entire environment is "saved" as a complete snapshot, aka ‘Template,’ and attached to the defect report to be reconstituted by ENG when diagnosing the problem. For the next test run, a completely new environment is generated from the Template and is separate from the previous environment used in past test runs that may contain defects.

3. Integration testing

Traditional enterprise thinking historically creates a limited number of integration test environments shared among many groups. Because of this, the integration environment is often broken, misconfigured, out of date, stale, or unusable in some way. Environment "drift" becomes a barrier to doing regular testing.

Applying cloud thinking to the building of on-prem systems allows for different integration testing approaches to be used in an economical and efficient manner. The cloud can create multiple Integration environments that are all "identical" based on the current target goal. R&D and ENG subgroups can each have their dedicated Integration environment that can combine work from multiple squads of the same discipline, without colliding with other system components. For example, all the teams working on database changes can first integrate their work into a localized Integration testing environment. Once successful, move the bulk modifications to the higher level where all system components are being combined.

4. Chaos Engineering, "What if we take server X offline?"

"Chaos Engineering for Traditional Applications" documents the justification and use cases where cloud-native Chaos engineering theory can be applied to traditional applications running on-prem. Server consolidation projects have the manifest need to apply "Chaos Testing" to the intermediate release candidates being built. A new level of "What if XYZ happens?" will be achieved when combining multiple systems down into a smaller number. Some categories of problem areas that need "random failure testing" would include:

Resource-based like:

  • Low memory
  • Not enough CPU
  • Full disk volumes
  • Low network bandwidth, high latency
  • Hardware failures like a failed disk drive, failed server, disconnected network

And not so obvious ones could be:

  • Database/server process down
  • Microservice down
  • Application code failure
  • Expired certificate(s)

And even less obvious:

  • Is there sufficient monitoring, and have alarms been validated?

Each category of items requires the execution of multiple "experiments" to understand how the overall system reacts when a chaotic event is introduced into the system. The cloud can then be used to recover the system back to a stable state. Quick environment recovery allows for the execution of multiple, potentially destructive experiments.

5. Disaster Recovery

Creating a consolidated server solution creates a whole new application architecture that did not exist before. This brings up the question of how Disaster Recovery (DR) will be implemented for this new system. The DR requirements for a consolidated system are even more elevated than individual disparate systems that may have existed during pre-consolidation. The consolidated system creates an "all or nothing" approach for DR since the DR mechanism now holds all of the previous individual components as a single unit of failure. Before consolidation, one of the individual components may have failed without impacting the others. Now during post-consolidation, all components are in a single unit of failure. So a DR event may cause more "ripples in the pond" than during pre-consolidation.

But once again, a cloud-based thinking model allows for experimentation and trial-and-error during the design of the DR implementation. Viable approaches can be tested in production-like but mock environments running in the cloud, that will eventually be represented by traditional on-prem systems. The cloud becomes the DR "sandbox" so that the right approach can be validated in a way that does not require non-disposable fixed assets to be purchased.

A faster, safer path to server consolidation

This document outlines an approach that can be taken to provide a "safe path" for server consolidation projects. The temporary use of the cloud as a placeholder for eventual on-prem resources allows for experimentation in the design, the ability to perform greater amounts of QA testing, and advanced testing concepts like chaos engineering. The outlined approach "unblocks" traditional enterprise construction models by giving "Production-Like" environments to all groups that need them, and eliminates the constraint of environment resources as described in "The Goal." The cloud-to-on-prem design model described here is the solution to the anti-patterns historically created on-prem where Agile sprint teams and squads wait "weeks or months" for needed environments.

About the Author

Tony Perez is a Cloud Solution Architect at Skytap. Perez has deep experience as a solution architect with a demonstrated history of working in the information technology and services industry in engineering roles in Sales, Customer Success, Cloud, Monitoring Performance, Mobile Applications, Professional Services, and Automated Software Testing. He began his career at Sequent Computer Systems and Oracle and has worked at Netscape, Mercury Interactive, Argogroup and Keynote.

Rate this Article

Adoption
Style

BT