Key takeaways
|
This article addresses a topic that is not fully covered in current IT world: live migration of containers, how it works behind the scenes, and what problems it solves. The demand for this technology is growing as it unlocks new possibilities by giving more freedom in application lifecycle management.
What is a live migration?
Live containers migration refers to the process of moving application between different physical machines or clouds without disconnecting the client. Memory, file system, and network connectivity of the containers running on top of bare metal hardware are transferred from the original host machine to the destination keeping the state without downtime.
Problems live migration solves
There are several problems that live migration can address:
- Downtime during hardware maintenance. When a system administrator needs to upgrade hardware, it`s very painful to migrate all the customers from one hardware node to another hardware node, and in many cases it's just impossible without downtime.
- Unbalanced cluster load. When one hardware node becomes overloaded, the rebalance process can require implementation of specific application patterns narrowing the choice of workloads that can be hosted in the cluster.
- Troubles within a cloud. Today there are so many clouds on the market and sometimes they have downtimes, change pricing policy or degradate in the quality of services. And in most cases it is a problem to easily migrate the application from one cloud provider to another.
Alternative Solutions
The above mentioned problems can be solved in different ways. Let's analyze how people deal with these issues, in addition to live migration.
- Planned downtimes. To perform a maintenance for a cluster, applications owners should be notified in advance about maintenance window and possible downtime, then hardware is shut down and restored back only when all required changes are performed. The issue here is the relatively big downtime frame.
- Traffic rerouting. To do the maintenance, a copy of each application is restored at another hardware node, then traffic is rerouted to this new copy and the previous one is shut down. The issue here is the complexity - applications should be specifically designed to gain high availability and data synchronization. In addition, it may require more hardware resources.
- Microservices. Granular division of the application services into separate containers and their distribution to different physical servers help to avoid downtimes in case of hardware failure. The affected containers will be automatically restored at active hardware node. However, the issue here is again complexity as the applications in the cluster should be properly designed to manage high availability and restore process after the fail.
How Live Migration Works
Let’s see how live migration process works from the technical side using the scheme below.
- Source Node - where a container is placed before live migration
- Destination Node - where a container will be placed after live migration
To perform the migration, the platform freezes container at the source node blocking memory, processes, file system and network connections, and gets the state of this container. After that, it is copied to the destination node. The platform restores the state and unfreezes the container at this node. Then, there is a quick cleanup process at the source node.
It is pretty straightforward: you get the state, you copy the state, and you restore the state. However, please note, there is a freeze timeframe, and we have to consider this during the application architecture design, as it can be an issue for some applications.
There are two kinds of solutions of live migration. One of them is pre-copy memory. If you want to migrate a container, platform turns track memory on the source node, and copies this memory in parallel with the destination node until the difference becomes minimal. After that, it freezes the container, gets the rest of the state, migrates it to a destination node, restores and unfreezes it.
Another solution is post-copy memory, or in other words - lazy migration. The system freezes container at the source node at the beginning, gets the state of the fastest changing memory pages, moves the state to the destination node, restores it, and unfreezes the container. The rest of the state is copied from the source node to the destination one in a background mode.
Usually time freeze takes from 5 to 30 seconds for each container depending on the application. That is a really small time frame in comparison to hours of downtime possible during a cluster maintenance.
Use Cases of Live Migration
- Hardware maintenance without downtime
During maintenance window the containers can be migrated in live mode from one physical hardware node to another inside one data center, as result no downtime will happen. - Load rebalance
With live migration the load can be rebalanced by migrating containers from one hardware node to another. This can be even automated by implementing specific scheduling algorithm and triggers.
- High-availability within hardware zones and data centers
A cloud service provider can preconfigure and offer a set of hardware availability zones inside one or several data centers. As result, end users get more options for high availability by performing containers live migration without involvement of system administrators. - Change of the cloud vendor
Live migration allows end users not to be locked to a specific cloud infrastructure vendor. Applications can be migrated to an alternative cloud service provider without any reconfigurations and redeployments during migration process.
Bottlenecks and Hidden Issues
There is a set of challenges that should be taken into consideration while choosing live migration as a solution for the mentioned problems:
- During live migration you can notice some performance degradation while the container is frozen. For some applications it’s a critical issue, as they don`t accept any performance degradation (for example, heavy-loaded monolithic realtime applications). But for majority of applications in the internet, specifically for web applications, a small time freeze doesn't matter.
- Another challenge is related to big data and fast changing data that is not easy to migrate from one cloud provider to another one. Network latency and data volume might be a blocker to successfully perform live migration.
- Public IPs within multi-cloud. It’s impossible to migrate containers with public IP address from one cloud vendor to another, as the IP address is kept inside their network.
- If an application inside a container is using native APIs or native cloud services of a specific cloud service provider, live migration across the clouds might be harder to implement or even impossible.
Live Migration Market Offerings
Who is offering live migration today? There are several products, that are offering live migration of containers.
- Virtuozzo - this team actually created the live migration technology for containers; they were pioneers in this direction and today already offer a production-ready containers engine with live migration.
- runC, from the Open Containers Initiative, is another upcoming containers engine solution with live migration based on CRIU.
- Jelastic is offering containers orchestration platform that provides live migration of production applications across hardware regions, data centers and cloud vendors.
Demo: Migrating Minecraft in Live Mode
The video below demonstrates the migration of the Minecraft application from AWS to Azure in a live mode without downtime.
Live migration for containers is still a relatively new technology on the market. However, the benefits are obvious for the businesses - no maintenance downtimes, no need to spend a lot of efforts on preparation, verifying or double-checking everything. That is why this solution is a good option to improve high availability and gain more flexibility. Feel free to share your experience of moving containers in a live mode from one instance to another or across the data centers.
About the Author
Ruslan Synytsky is CEO and co-founder of Jelastic, company that delivers PaaS for DevOps and hosting business. With over 15 years in the IT industry, Ruslan is an expert in large-scale distributed Java applications and enterprise platforms. Before starting Jelastic in 2011, Ruslan was one of the key engineering leads at the National Space Agency of Ukraine and worked on various innovative projects. Ruslan Synytsky has a reach scientific luggage and is actively involved in various tech conferences for developers, hosting providers, integrators and enterprises.