Amazon Web Services (AWS) has unveiled Amazon S3 Metadata, a new feature designed to simplify data discovery and management for Amazon S3 users. Currently, in preview in the US East (Ohio and N. Virginia) and US West (Oregon) regions, S3 Metadata enables users to query and analyze their S3 data, leveraging real-time metadata updates and integration with AWS analytics services.
Amazon S3 Metadata automatically captures and organizes metadata for S3 objects, offering insights into system-defined properties—such as object size, storage class, encryption status—and user-defined tags. This capability allows businesses to curate, identify, and utilize their data more effectively for a wide array of applications, including:
- Business Analytics
- Real-time inference applications
- AI model training
Metadata is updated within minutes of changes to S3 objects, ensuring near real-time accuracy. The data is stored in S3 Tables, a new bucket type: a table bucket, which stores tables as subresources.
S3 Metadata employs Apache Iceberg, allowing users to store metadata in fully managed Iceberg tables. This compatibility facilitates high-performance querying at scale using Iceberg-compatible tools such as Apache Spark, Amazon Athena, and Amazon QuickSight.
With Iceberg, each update generates a new row in the table, providing a historical record of object changes that can be easily retrieved and analyzed.
Amrutha Gujjar, CEO of Structured Labs, concluded in a blog post:
By embracing Iceberg, AWS aligns itself with the industry’s move toward open table formats. This not only ensures interoperability with tools like Apache Spark and Flink but also future-proofs investments in S3-based architectures.
S3 Metadata tables integrate seamlessly with AWS analytics tools, enabling robust data processing and visualization. Key integrations include:
- AWS Glue Data Catalog (currently in preview)
- Amazon Athena, Redshift, EMR, and QuickSight for streaming and querying metadata
- Amazon Bedrock, which annotates AI-generated videos stored in S3 with metadata like origin, creation timestamp, and the model used.
The metadata schema includes over 20 elements, from bucket names and object keys to encryption details and user-defined tags. Users can enrich this data further by joining it with application-specific tables.
Enabling S3 Metadata involves three simple steps:
- Create a Table Bucket: Use the create-table-bucket command, the AWS Management Console, or an API call to create a bucket for storing metadata.
- Attach Metadata Configuration: Specify a configuration file to link your data bucket with the metadata table.
- Run Queries: Use tools like Apache Spark or AWS analytics services to query the metadata, enabling insights into object storage, updates, and other critical details.
(Source: AWS News blog post)
An example query looks like this:
spark.sql("SELECT key, size, storage_class, encryption_status FROM mytablebucket.aws_s3_metadata.my_table ORDER BY last_modified_date DESC LIMIT 10").show(false)
Ian Mckay, cloud principal at Kablamo and AWS Community Hero, tweeted:
S3 buckets now support queryable metadata (Iceberg tables) functionality, allowing for a live queryable view of the creations, updates, and deletions of objects using tools like Athena. Check pricing before usage, as the cost increase is non-trivial.
Lastly, users can also configure and manage S3 Metadata through the Amazon S3 Console’s Metadata tab. Pricing is based on the number of updates (object creation, deletion, and metadata changes) and storage costs for the metadata table. Detailed pricing information is available on the S3 Pricing page.