A team of researchers at the RISELab at UC Berkeley has recently released Skyplane, an open source tool to optimize the transfer of large datasets between cloud providers, reducing transfer times and costs.
Designed to be cloud agnostic, Skyplane currently supports transfers between object storages on AWS, Google Cloud, and Azure. Paras Jain and Sarah Wooders, PhD students at UC Berkeley, and Joseph Gonzalez, associate professor at UC Berkeley, write:
Data transfers in the cloud are slow and expensive. Transfers using typical CLI tools like aws s3 cp or rsync lead to transfer speeds as low as 20MB/s (slower than the US average broadband speed). Moreover, cloud transfers are very expensive; it can cost up to $14 to copy a 100GB dataset due to the cloud’s egregious egress fees.
Skyplane creates an overlay network on top of the cloud providers so it can route around congested cloud network links. It also utilizes LZ4 compression and parallelism by striping data transfers over many pipelined TCP connections and multiple VMs. Jain, Wooders and Gonzalez explain the goals of the project:
For the last decade as researchers at Berkeley, our research to optimize the performance of data-intensive workloads led to Apache Spark and the Ray project. These systems were designed at a time when datasets predominantly lived in a single region in a single cloud. Increasingly, the bottleneck is data transfer between cloud regions and cloud providers. We think data transfer should be faster, cheaper and universal across any cloud.
The researchers claim that Skyplane is up to 110x faster than services like AWS DataSync and up to 3.8x cheaper than existing free tools. Skyplane profiles cloud network cost and throughput across regions, and borrows ideas from the MIT RON (Resilient Overlay Networks) to identify optimal transfer paths across regions and cloud providers, transparently to the end user.
Source: https://medium.com/@paras_jain/skyplane-110x-faster-data-transfers-on-any-cloud-8f0165c1d711
For example, to move a single large file between two regions in AWS, the Skyplane CLI command is :
$ skyplane cp s3://srcbucket/64gb.tar.gz s3://dstbucket/
The syntax is similar to the AWS CLI one but the benchmarks suggest significant improvements in terms of parallelism and cost reduction. An article has been published to describe the architecture of the tool, covering security, firewalls, transfer integrity and checksumming.
Skyplane is not the only option available to run high throughput transfers. Developers can use data transfer services like AWS DataSync or GCP Storage Transfer Service but they charge service fees and are usually optimized to move data into the provider’s cloud, lacking portability.
Skyplane currently supports local to cloud transfers only by defaulting to awscli for AWS and gsutil for Google Cloud. With a plan to support Oracle Cloud, IBM Cloud and Alibaba Cloud in the roadmap, additional providers can be added integrating with their authentication and object store APIs.