AWS has recently announced S3 Tables Bucket, managed Apache Iceberg tables optimized for analytics workloads. According to the cloud provider, the new option delivers up to 3x faster query performance and up to 10x higher transaction rates for Apache Iceberg tables compared to standard S3 storage.
In one of his final posts on the AWS Blog, Jeff Barr, vice president and chief evangelist at AWS, writes:
Table buckets are the third type of S3 bucket, taking their place alongside the existing general purpose and directory buckets. You can think of a table bucket as an analytics warehouse that can store Iceberg tables with various schemas.
Originally developed at Netflix, Apache Iceberg is a high-performance, open-source format for large analytic tables. It allows the use of SQL tables for big data, enabling engines like Spark, Trino, Flink, Presto, and Hive to access and work with the same tables simultaneously.
Competing with services like Databricks Delta Lake and Snowflake’s external Iceberg tables, S3 Tables are designed to perform continuous table maintenance, automatically optimizing query efficiency and storage costs. Additionally, they integrate with AWS Glue Data Catalog, enabling data engineers to leverage analytics services such as Amazon Kinesis Data Firehose, Athena, Redshift, EMR, and QuickSight.
In a separate article, the cloud provider details how Amazon S3 Tables use compaction to improve query performance. Aliaa Abbas, Anupriti Warade, and Jacob Tardieu explain:
Customers often choose Apache Parquet for improved storage and query performance. Additionally, customers use Apache Iceberg to organize Parquet datasets to take advantage of its database-like features such as schema evolution, time travel, and ACID transactions.
To illustrate the benefits of automatic compaction, the team compares the query performance of an uncompacted Iceberg table in a general-purpose bucket with that of a newer, optimized table. They write:
Our results revealed significant performance improvements when using datasets compacted by S3 Tables. With compaction enabled on the table bucket, we observed query acceleration up to 3.2x, (...) overall, we saw a 2.26x improvement in the total execution time for all eight queries.
"Is S3 becoming a data lakehouse?" was a common sentiment in the community when the new storage option was announced, with many developers expressing excitement. Andrew Warfield, VP and distinguished engineer at Amazon, summarizes the three main benefits:
First, tables are an important primitive for analytics on S3, and second they are quickly changing how we integrate other services with data in S3. The third one is a little more subtle and speculative but in some ways it's the one that I think is the most interesting. It's the idea that S3 Tables, if we get them right, may turn into a much more general primitive outside of analytics engines like Spark.
John Kutay, director of product & engineering at Striim, offers a different perspective, writing:
As a data platform vendor, I demand AWS stop building high-level S3 table APIs/catalogs, and instead build low-level convenience features for me to sell a managed data lake service.
Javi Santana, co-founder at Tinybird.co, questions the pricing:
Storage and operation costs are almost the same as regular S3. But, the main point (...) is the cost of compaction "$0.05 per GB processed". Seems like not much but I'm checking some of our customers they process around 1PB (...) That means it's a no-go for real-time workloads when you also want to have fast reads.
While some developers highlight missing functionalities. Francesco Mucio, owner and BI/data architect at Untitled Data Company, concludes:
To be fair, this is not the first time that AWS released a half-baked feature/tool... and some of them stayed like that. But it's also true that, despite the marketing announcements, not all tools are for everybody.
Further extending S3 capabilities, AWS announced at re:Invent the preview of S3 Metadata, a new feature that automatically updates object metadata on S3. Read more on InfoQ.
S3 Tables Bucket is currently available in only three U.S. regions. While S3 Tables are generally available, the integration with AWS Glue Data Catalog is still in preview.