InfoQ Homepage Data Lake Content on InfoQ
-
How Data Mesh Platforms Connect Data Producers and Consumers
A challenge that companies often face when exploiting their data in data warehouses or data lakes is that ownership of analytical data is weak or non-existent, and quality can suffer as a result. A data mesh is an organizational paradigm shift in how companies create value from data where responsibilities go back into the hands of producers and consumers.
-
Netflix Uses Metaflow to Manage Hundreds of AI/ML Applications at Scale
Netflix recently published how its Machine Learning Platform (MLP) team provides an ecosystem around Metaflow, an open-source machine learning infrastructure framework. By creating various integrations for Metaflow, Netflix already has hundreds of Metaflow projects maintained by multiple engineering teams.
-
Data Solutions Framework: an Open Source Project for Building Data Solutions on AWS
AWS recently released the Data Solutions Framework (DSF), an opinionated open-source framework designed to accelerate the creation of data solutions on AWS. Built using the AWS CDK, the framework exposes abstractions and patterns as building blocks for constructing data solutions and is available in TypeScript (npm) and Python (PyPi).
-
B2B Data Interchange: Managed Electronic Data Interchange (EDI) on AWS
AWS recently introduced B2B Data Interchange, a platform allowing organizations to automate and monitor the transformation of EDI-based business transactions. The service provides a low-code interface for managing trading partners and translating EDI documents into JSON and XML formats.
-
Netflix Creates Incremental Processing Solution Using Maestro and Apache Iceberg
Netflix created a new solution for incremental processing in its data platform. The incremental approach reduces the cost of computing resources and execution time significantly as it avoids processing complete datasets. The company used its Maestro workflow engine and Apache Iceberg to improve data freshness and accuracy and plans to provide managed backfill capabilities.
-
Grammarly Replaces its in-House Data Lake with Databricks Platform Using Medallion Architecture
Grammarly adopted the medallion architecture while migrating from their in-house data lake, storing Parquet files in AWS S3, to the Delta Lake lakehouse. The company created a new event store for over 6000 event types from 40 internal and external clients and, in the process, improved data quality and reduced the data-delivery time by 94%.
-
Databricks Unveils Lakehouse AI and MosaicML Acquisition at Data + AI Summit
The Data and AI company Databricks recently unveiled Lakehouse AI, a suite of tools for building and governing generative AI models, including large language models (LLMs), within the Databricks platform. Among the tools were LakehouseIQ, a "knowledge engine" that uses AI to understand a company's unique data, culture, and language in order to improve natural language interfaces like chatbots.
-
Amazon Security Lake for Centralized Security Data Management Now GA
AWS recently announced the general availability of Security Lake, a managed service to automate the sourcing, aggregation, normalization, and data management of security data. The new service centralizes data from AWS environments, SaaS providers, on-premises, and cloud sources into a data lake stored in an AWS account.
-
Unified Analytics Platform: Microsoft Fabric
At the recent annual Build Conference, Microsoft introduced a unified analytics platform with Microsoft Fabric that brings together all the data and analytics that organizations need.
-
Amazon Athena Now Supports Apache Spark Engine
Amazon Athena now supports the open-source distributed processing system Apache Spark to run fast analytics workloads. Data analysts and engineers can use Jupyter Notebook in Athena to perform data processing and programmatically interact with Spark applications.
-
AWS Announces Preview Release of Amazon Security Lake
At re:Invent, AWS announced the preview release of Amazon Security Lake. This managed service automatically centralizes an organization’s security data from the cloud and on-premises sources into a purpose-built data lake stored in their account.
-
Google Launches a New Cross-Platform Data Storage Engine BigLake in Preview
At the recent Cloud Data Summit, Google recently announced the preview of BigLake, a new data lake storage engine that makes it easier for enterprises to analyze the data in their data warehouses and data lakes.
-
AWS Introduces HealthLake and Redshift ML in Preview
AWS introduced preview releases of Amazon HealthLake service and a feature for Amazon Redshift called Redshift ML during re:Invent 2020 in December. Amazon HealthLake is a data lake service that helps healthcare, health insurance, and pharmaceutical companies to derive value out of their data with the help of NLP. Redshift ML is a service that provides a gateway into SageMaker to Redshift users.
-
The Distributed Data Mesh as a Solution to Centralized Data Monoliths
Instead of building large, centralized data platforms, corporations and data architects should create distributed data meshes.
-
Amazon Releases AWS Lake Formation to General Availability
Recently, Amazon announced the general availability (GA) of AWS Lake Formation, a fully managed service that makes it much easier for customers to build, secure, and manage data lakes.