Speaker
Description
Application containers are widely used in contemporary cloud computing environments. Migration of containers across hosts provides cost-effective cloud management by enabling improved server consolidation, load balancing and enhanced fault tolerance. One of the primary objectives of container migration is to reduce the service downtime of applications hosted in containers. The service downtime depends on performing the migration activities efficiently, specifically from the time the container is stopped on the source host till it is restored and fully functional at the destination host.
In this paper, we show that, the state-of-the-art pre-copy migration strategy for containers using checkpoint and restore techniques (e.g., CRIU) inflates the downtime due to its inherent limitations in the restoration procedures, particularly for containers with large memory working set size. We propose PCLive to address this bottleneck using a pipelined restore mechanism. Compared to the baseline CRIU pre-copy migration, PCLive results in up to ~38.8x reduction in restoration time which leads to a reduction of service downtime by up to ~2.7x for migration of a container hosting the Redis key-value store over an one Gbps network. We also present comprehensive comparative analysis of the resource cost for the proposed solution along with additional optimizations to demonstrate that PCLive can reduce the application downtime in a resource efficient manner leveraging its flexible and efficient design choices.