InfoQ Homepage Big Data Content on InfoQ
-
Apache Druid 25.0 Delivers Multi-Stage Query Engine and Kubernetes Task Management
Apache Druid is a high-performance real-time datastore and its latest release, version 25.0, provides many improvements and enhancements. The main new features are: the multi-stage query (MSQ) task engine used for SQL-based ingestion is now production ready, and Kubernetes can be used to launch and manage tasks eliminating the need for middle managers...
-
How Twitter Automated Data Quality Check Process
Twitter engineering has recently shared a blog post on how they architected and developed a quality automation platform. Twitter digests and creates thousands of data sets for different data products and applications. The next natural step is to make sure of the quality of the data by adding automation on top of it. In this news post, we explore this architecture in more detail.
-
AWS Announces Clean Rooms for Secure Collaboration with Analytics Data
During the recent re:Invent conference, AWS announced the preview of Clean Rooms for analytics data. The new service provides safe environments where multiple customers can securely share and analyze data with control of how the data is used, reducing the risk of sharing personal data.
-
Uber Reduces Logging Costs by 169x Using Compressed Log Processor (CLP)
Uber recently published how it dramatically reduced its logging costs using Compressed Log Processor (CLP). CLP is a tool capable of losslessly compressing text logs and searching them without decompression. It achieved a 169x compression ratio on Uber's log data, saving storage, memory, and disk/network bandwidth.
-
Uber Freight Near-Real-Time Analytics Architecture
Uber Freight is the Uber platform dedicated to connecting shippers with carriers. Providing reliable service to shippers is crucial for Uber Freight. This is why the Carrier Scorecard was developed, with several metrics including on-time pickup/delivery, tracking automation, and late cancellations.
-
Unraveling Techno-Solutionism: How I Fell out of Love with “Ethical” Machine Learning
At the recent QCon San Francisco conference, Katherine Jarmul gave a talk on unravelling techno-solutionism, in which she explored the inherent bias in AI training datasets, the bias that assumes there will be a technical solution to almost any problem and that those technical solutions will be beneficial for mankind. She posed questions for technologists to consider when building products.
-
Snap Way to Design Ads Ranking Service Using Deep Learning
Snap engineering has recently published a blog post on how they designed their ads ranking and targeting service using deep learning. Showing ads to the users is the mainstream of social network platform monetization. Snap ad ranking system is designed to target the right user at the right time. Snap is providing an excellent user experience while preserving user privacy and security.
-
Open-Source Constellation K8 Engine Aims to Bring Confidential Computing to Kubernetes
Constellation is a Kubernetes engine that shields Kubernetes clusters from the rest of the cloud infrastructure using confidential computing and confidential VMs. This creates a confidential context that ensures data is always encrypted, both at rest and in memory.
-
Azure Data Explorer Supports Native Ingestion from Amazon S3
Microsoft recently announced the ability to natively ingest data from Amazon S3 into Azure Data Explorer (ADX). The new feature simplifies multi-cloud data analytics deployments, bringing data from Amazon S3 to Azure, without relying on custom ETL pipelines.
-
Microsoft Releases SynapseML 0.1.0 with .NET and Cognitive Services Support
Microsoft announced the first .NET-compatible version of SynapseML, a new machine learning (ML) library for Apache Spark distributed processing platform. Version 0.1.0 of the SynapseML library adds support for .NET bindings, allowing .NET developers to write ML pipelines in their preferred language.
-
Next Generation of Data Movement and Processing Platform at Netflix
Netflix engineering recently published in a tech blog how they used data mesh architecture and principles as the next generation of data platform and processing to unleash more business use cases and opportunities. Data mesh is the new paradigm shift in data management that enables users to easily import and use data without transporting it to a centralized location like a data lake.
-
Uber Open-Sourced Its Highly Scalable and Reliable Shuffle as a Service for Apache Spark
Uber engineering has recently open-sourced its highly scalable and reliable shuffle as a service for Apache Spark. Spark is one of the most important tools and platforms in data engineering and analytics. It is shuffling data on local machines by default and causes challenges while the scale is getting very large. Shuffle as a service is a solution developed at Uber for this problem.
-
Google Introduces Zero-ETL Approach to Analytics on Bigtable Data Using BigQuery
Recently, Google announced the general availability of Bigtable federated queries, with BigQuery allowing customers to query data residing in Bigtable via BigQuery faster. Moreover, the querying is without moving or copying the data in all Google Cloud regions with increased federated query concurrency limits, closing the longstanding gap between operational data and analytics.
-
Amazon Redshift Serverless Generally Available to Automatically Scale Data Warehouse
Amazon recently announced the general availability of Redshift Serverless, an elastic option to scale data warehouse capacity. The new service allows data analysts, developers and data scientists to run and scale analytics without provisioning and managing data warehouse clusters.
-
Shopify’s Practical Guidelines from Running Airflow for ML and Data Workflows at Scale
Shopify engineering shared its experience in the company's blog post on how to scale and optimize Apache Airflow for running ML and data workflows. They shared practical solutions for the challenges they faced like slow file access, insufficient control over DAG, irregular level of traffic, resource contention among workloads, and more.