InfoQ Homepage Big Data Content on InfoQ
-
Datadog Integrates Google Agent Development Kit into LLM Observability Tools
Datadog recently announced that its LLM Observability platform now provides automatic instrumentation for applications built with Google's Agent Development Kit (ADK), offering deeper visibility into the behavior, performance, cost, and safety of AI-driven agentic systems.
-
Etleap Launches Iceberg Pipeline Platform to Simplify Enterprise Adoption of Apache Iceberg
Etleap has recently launched the Iceberg pipeline platform, a new managed data pipeline layer designed to let enterprises adopt Apache Iceberg without building or maintaining a complex custom stack.
-
Pinterest's Moka: How Kubernetes Is Rewriting the Rules of Big Data Processing
Digital pinboard provider Pinterest has published an article explaining its blueprint for the future of large-scale data processing with its new platform Moka. The company is moving core workloads from ageing Hadoop infrastructure to a Kubernetes-based system on Amazon EKS, with Apache Spark as the main engine and support for other frameworks on the way.
-
How Data Contracts Support Collaboration between Data Teams
Data contracts define the interface between data providers and consumers, specifying things like data models, quality guarantees, and ownership. They are essential for distributed data ownership in data mesh, ensuring data is discoverable, interoperable, and governed. Data contracts improve communication between teams and enhance the reliability and quality of data products.
-
QCon SF 2024 - Incremental Data Processing at Netflix
Jun He gave a talk at QCon SF 2024 titled Efficient Incremental Processing with Netflix Maestro and Apache Iceberg. He showed how Netflix used the system to reduce processing time and cost while improving data freshness.
-
Setting up a Data Mesh Organization
A data mesh organization: producers, consumers, and the platform. According to Matthias Patzak, the mission of the platform team is to make the lives of the producer and consumers simple, efficient and stress free. Data must be discoverable and understandable, trustworthy, and shared securely and easily across the organization.
-
Measuring and Reducing the Environmental Impact of Software
Software applications often manage big amounts of data; most of them are internet-based applications, and incorporate artificial intelligence. According to Coral Calero, these three aspects improve the capabilities and functionalities provided by software but they have also increased the amount of energy needed. We need to measure energy consumption of software to control its environmental impact.
-
Uber’s Journey to Modernizing Big Data Infrastructure with Google Cloud Platform
In a recent post on its official engineering blog, Uber, disclosed its strategy to migrate the batch data analytics and machine learning (ML) training stack to Google Cloud Platform (GCP). Uber, runs one of the largest Hadoop installations in the world, managing over an exabyte of data across tens of thousands of servers in each of its two regions
-
How Data Mesh Platforms Connect Data Producers and Consumers
A challenge that companies often face when exploiting their data in data warehouses or data lakes is that ownership of analytical data is weak or non-existent, and quality can suffer as a result. A data mesh is an organizational paradigm shift in how companies create value from data where responsibilities go back into the hands of producers and consumers.
-
Uber Migrates 1 Trillion Records from DynamoDB to LedgerStore to Save $6 Million Annually
Uber migrated all its payment transaction data from DynamoDB and blob storage into a new long-term solution, a purpose-built data store named LedgerStore. The company was looking for cost savings and had previously reduced the use of DynamoDB to store hot data (12 weeks old). The move resulted in significant savings and simplified the storage architecture.
-
QCon London: Lessons Learned from Building LinkedIn’s AI/ML Data Platform
At the QCon London 2024 conference, Félix GV from LinkedIn discussed the AI/ML platform powering the company’s products. He specifically delved into Venice DB, the NoSQL data store used for feature persistence. The presenter shared the lessons learned from evolving and operating the platform, including cluster management and library versioning.
-
Spotify's Approach to Leverage Recursive Embedding and Clustering to Enhanced Data Explainability
One of the main challenges of any online business is to get actionable insight from their data for decision-making. Spotify shares its methodology and experience to solve this problem by clustering diverse data sets through a unique method involving dimensionality reduction, recursion, and supervised machine learning.
-
Netflix Creates Incremental Processing Solution Using Maestro and Apache Iceberg
Netflix created a new solution for incremental processing in its data platform. The incremental approach reduces the cost of computing resources and execution time significantly as it avoids processing complete datasets. The company used its Maestro workflow engine and Apache Iceberg to improve data freshness and accuracy and plans to provide managed backfill capabilities.
-
QCon San Francisco 2023 Day 1: Architectures, Data Engineering, Infra Languages, Staff+ Skills
The 17th annual QCon San Francisco conference was held at the Hyatt Regency San Francisco in San Francisco, California. This five-day event, organized by C4Media, consists of three days of presentations and two days of workshops. Day One, scheduled on October 2nd, 2023, included a keynote address by Suhail Patel and presentations from four conference tracks and two sponsored tracks.
-
Grammarly Replaces its in-House Data Lake with Databricks Platform Using Medallion Architecture
Grammarly adopted the medallion architecture while migrating from their in-house data lake, storing Parquet files in AWS S3, to the Delta Lake lakehouse. The company created a new event store for over 6000 event types from 40 internal and external clients and, in the process, improved data quality and reduced the data-delivery time by 94%.