The Amazon Web Services (AWS) team has announced a limited preview of Amazon Redshift, a cloud-hosted data warehouse whose cost and capabilities are poised to disrupt the industry. In addition, AWS revealed two new massive compute instance types, and a data integration tool called Data Pipeline. In aggregate, these services begin to chip away at enterprise concerns about whether it is cost-effective and efficient to gather, store, and analyze their business data in the public cloud.
Introduced at the first annual AWS re:Invent conference in Las Vegas, Redshift was described by AWS CTO Werner Vogels as “a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud.” Vogels explained how Redshift was built to analyze large data sets quickly.
Amazon Redshift uses a variety of innovations to enable customers to rapidly analyze datasets ranging in size from several hundred gigabytes to a petabyte and more. Unlike traditional row-based relational databases, which store data for each row sequentially on disk, Amazon Redshift stores each column sequentially. This means that Redshift performs much less wasted IO than a row-based database because it doesn’t read data from columns it doesn’t need when executing a given query. Also, because similar data are stored sequentially, Amazon Redshift can compress data efficiently, which further reduces the amount of IO it needs to perform to return results.
Amazon Redshift’s architecture and underlying platform are also optimized to deliver high performance for data warehousing workloads. Redshift has a massively parallel processing (MPP) architecture, which enables it to distribute and parallelize queries across multiple low cost nodes. The nodes themselves are designed specifically for data warehousing workloads. They contain large amounts of locally attached storage on multiple spindles and are connected by a minimally oversubscribed 10 Gigabit Ethernet network.
The AWS team blog described the impressive resilience capabilities of Redshift.
Amazon Redshift is designed to retain data integrity in the face of disk and node failures. The first line of defense consists of two replicated copies of your data, spread out over up to 24 drives on different nodes within your data warehouse cluster. Amazon Redshift monitors the health of the drives and will switch to a replica if a drive fails. It will also move the data to a health drive if possible, or to a fresh node if necessary. All of this happens without any effort on your part, although you may see a slight performance degradation during the re-replication process.
Redshift follows the standard AWS “pay as you go” pricing model and Amazon claims that customers will see massive cost savings by using this service. Based on Amazon’s research, the typical on-premises data warehouse costs anywhere from $19,000-$25,000 per terabyte per year to license and maintain. However, the Redshift service will cost less than $1000 per terabyte per year. According to Barb Darrow at GigaOm, this service stands to “siphon business from Oracle (Redshift, get it?), IBM and Teradata” and that AWS isn’t done “building higher-level services that compete not only with old-school IT vendors but with some of Amazon’s own software partners.”
While cloud vendors like AWS provide effectively limitless storage, there still remains the challenge of getting the data to the cloud, and then consolidating it for analysis by tools like Redshift. The new Data Pipeline product is meant to tackle the latter, while solutions continue to emerge for addressing the former. The Data Pipeline provides a graphical drag-and-drop user interface for modeling flows between data sources. The AWS team blog explains that a Pipeline is comprised of a data source, destination, processing steps, and an execution schedule. Pipeline data sources may be AWS databases such as RDS, DynamoDB, and S3, or databases running in EC2 virtual machines or even in on-premises data centers. Data Pipeline is not yet released and only available to a select set of beta partners.
Transferring big data efficiently requires significant bandwidth. In an interview with GigaOm, AWS Chief Data Scientist Matt Wood explains how AWS and its partners are actively addressing this concern.
The bigger the dataset, the longer the upload time.
Wood said AWS is trying hard to alleviate these problems. For example, partners such as Aspera and even some open source projects enable customers to move large files at fast speeds over the internet (Wood said he’s seen consistent speeds of 700 megabits per second). This is also why AWS has eliminated data-transfer fees for inbound data, has turned on parallel uploads for large files and created its Direct Connect program with data center operators that provide dedicated connections to AWS facilities.
And if datasets are too large for all those methods, customers can just send AWS their physical disks. “We definitely receive hard drives,” Wood said.
Continuing the theme of “big”, AWS also revealed the two newest instances types for EC2 virtual machines. The “Cluster High Memory” instance has an enormous 240 GB of memory and a pair of 120 GB solid state drives. Given that the highest amount of RAM available today in EC2 is 60.5 GB, this represents a significant increase. The second instance type, called “High Storage”, has a generous 117 GB of memory along with 48 TB of storage. Neither of these two instance types are available yet, but both are aimed directly at customers doing Map Reduce and big data processing in the cloud.