Recently, Amazon announced the general availability (GA) of AWS Lake Formation, a fully managed service that makes it much easier for customers to build, secure, and manage data lakes.
With AWS Lake Formation customers can simplify and automate many of the complicated manual steps usually required to create a data lake, including collecting, cleaning, and cataloging data, and securely making that data available for analytics. Amazon first introduced AWS Lake Formation at AWS re:Invent conference, held last November in Las Vegas, and allowed customers to sign up for a preview. Now customers can leverage AWS Lake Formation fully as a generally available service.
By using AWS Lake Formation, customers can bring data into a data lake from a range of sources using pre-defined templates. Next, they can then define policies to govern access by different groups within the organization while the data is automatically classified and prepared. And finally, they can then analyze this data using their choice of AWS analytics and machine learning services, including Amazon Redshift, Amazon Athena, and AWS Glue, with Amazon EMR, Amazon QuickSight, and Amazon SageMaker following in the next few months.
Source: https://aws.amazon.com/lake-formation/
Suphatra Rufo, senior product marketing manager at Amazon Web Services, said in a tweet:
Setting up your own data lake just got a lot easier. Today we launched AWS Lake Formation! Now you can store unlimited types of data and use multiple analytics services to process your data, whenever you want.
Users can create a data lake by using the Lake Formation console. Next, they can register S3 buckets as part of the data lake, create a database and grant permission to the IAM users and roles necessary to manage it. According to the blog post on the GA release of AWS Lake Formation, the database is registered in the Glue Data Catalog and holds the metadata required to analyze the raw data, such as the structure of the tables that are going to be automatically generated during data ingestion.
Furthermore, users can make the data ingestion easier by using blueprints that create the necessary workflows, crawlers and jobs on AWS Glue for everyday use cases. Moreover, these workflows enable orchestration of the data loading workloads by building dependencies between Glue entities, such as triggers, crawlers and jobs, and allow users to track the status of the different nodes in the workflows on the console, making it easier to monitor progress and troubleshoot issues.
The Log file blueprints allow users to ingest logging formats used by Application Load Balancers, Elastic Load Balancers, and AWS CloudTrail; while with database blueprints, users can load data from operational databases, or load a full snapshot of an existing database, or incrementally load new data. Users can run workflows created by the blueprints:
- On-demand with through the console or programmatically, for example, using any AWS SDK or the AWS Command Line Interface (CLI)
- Schedule option ranging from hourly to monthly, with the opportunity to choose the day of the week and the time
Several AWS customers such as Panasonic Avionics Corporation, Zalando, Amgen, and Alcon are using AWS Lake Formation. Alberto Miorin, engineering lead, Zalando SE, said in an Amazon press release about the GA of Data Lake Formation:
AWS Lake Formation gave us a scalable central point of control for data access through Amazon Redshift that not only simplified the process, but improved it through granular control over how our data is being used. Now we can discover, access, and analyze data in our data lake with our preferred tools, and leverage it for business intelligence and data science. This streamlined workflow helps our executives make the right decisions on time, and fosters innovation through machine learning.
According to a recent market study by Advanced Market Analytics, the global data lake market is anticipated to reach $12.01 billion by 2024 at a compound annual growth rate (CAGR) of 27.8%. Hence, several public cloud vendors offer data lake solutions, and Amazon is not the only one with a data lake service. Microsoft has been offering Azure Data Lake for quite some time, and Google has a suite of data lake processing and analytics tools in Cloud Datalab, Dataproc, and Dataflow.
AWS Lake Formation is currently available in East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland) regions, and Amazon stated in the press release that additional regions would come soon. Furthermore, Amazon also said that there is no extra cost to using AWS Lake Formation; customers only pay for the use of the underlying services such as Amazon S3 and AWS Glue.