InfoQ Homepage Big Data Content on InfoQ
-
Azure Data Explorer Supports Native Ingestion from Amazon S3
Microsoft recently announced the ability to natively ingest data from Amazon S3 into Azure Data Explorer (ADX). The new feature simplifies multi-cloud data analytics deployments, bringing data from Amazon S3 to Azure, without relying on custom ETL pipelines.
-
Microsoft Releases SynapseML 0.1.0 with .NET and Cognitive Services Support
Microsoft announced the first .NET-compatible version of SynapseML, a new machine learning (ML) library for Apache Spark distributed processing platform. Version 0.1.0 of the SynapseML library adds support for .NET bindings, allowing .NET developers to write ML pipelines in their preferred language.
-
Next Generation of Data Movement and Processing Platform at Netflix
Netflix engineering recently published in a tech blog how they used data mesh architecture and principles as the next generation of data platform and processing to unleash more business use cases and opportunities. Data mesh is the new paradigm shift in data management that enables users to easily import and use data without transporting it to a centralized location like a data lake.
-
Uber Open-Sourced Its Highly Scalable and Reliable Shuffle as a Service for Apache Spark
Uber engineering has recently open-sourced its highly scalable and reliable shuffle as a service for Apache Spark. Spark is one of the most important tools and platforms in data engineering and analytics. It is shuffling data on local machines by default and causes challenges while the scale is getting very large. Shuffle as a service is a solution developed at Uber for this problem.
-
Google Introduces Zero-ETL Approach to Analytics on Bigtable Data Using BigQuery
Recently, Google announced the general availability of Bigtable federated queries, with BigQuery allowing customers to query data residing in Bigtable via BigQuery faster. Moreover, the querying is without moving or copying the data in all Google Cloud regions with increased federated query concurrency limits, closing the longstanding gap between operational data and analytics.
-
Amazon Redshift Serverless Generally Available to Automatically Scale Data Warehouse
Amazon recently announced the general availability of Redshift Serverless, an elastic option to scale data warehouse capacity. The new service allows data analysts, developers and data scientists to run and scale analytics without provisioning and managing data warehouse clusters.
-
Shopify’s Practical Guidelines from Running Airflow for ML and Data Workflows at Scale
Shopify engineering shared its experience in the company's blog post on how to scale and optimize Apache Airflow for running ML and data workflows. They shared practical solutions for the challenges they faced like slow file access, insufficient control over DAG, irregular level of traffic, resource contention among workloads, and more.
-
Fitting Presto to Large-Scale Apache Kafka at Uber
The need for ad-hoc real-time data analysis has been growing at Uber. They run a large Apache Kafka deployment and need to analyse data going through the many workflows it supports. Solutions like stream processing and OLAP datastores were deemed unsuitable. An article was published recently detailing why Uber chose Presto for this purpose and what it had to do to make it performant at scale.
-
Amazon Elastic MapReduce Now Generally Available as a Serverless Offering
AWS recently announced that Amazon Elastic MapReduce (EMR) Serverless is generally available (GA). The offering is a serverless deployment option for customers to run big data analytics applications using open-source frameworks like Apache Spark and Hive without configuring, managing, and scaling clusters or servers.
-
Google Brings Confidential Computing to Latest C2D and N2D Machine Types
A few months after upgrading its general-purpose (N2D) and compute-optimized (C2D) virtual machines to adopt the latest AMD EPYC technology, Google is now making confidential computing available in preview on those machine types.
-
PipelineDP Brings Google’s Differential-Privacy Library to Python
Google and OpenMined have released PipelineDP, a new open-source library that allows researchers and developers to apply differentially private aggregations to large datasets using batch-processing systems.
-
Google Introduces Autoscaling for Cloud Bigtable for Optimizing Costs
Cloud Bigtable is a fully-managed, scalable NoSQL database service for large operational and analytical workloads on the Google Cloud Platform (GCP). And recently, the public cloud provider announced the general availability of Bigtable Autoscaling, which automatically adds or removes capacity in response to the changing demand for applications allowing cost optimizations.
-
Amazon OpenSearch Adds Anomaly Detection for Historical Data
Amazon OpenSearch recently introduced the support of anomaly detection for historical data. The machine learning based feature helps identifying trends, patterns, and seasonality in OpenSearch data.
-
Austrian DPA Ruling against Google Analytics Paves the Way to EU-based Cloud Services
In a recent ruling, the Austrian data regulator declared the use of Google Analytics unlawful based on EU GDPR regulation. While the ruling is very specifically argued and worded, its implications go well beyond this particular case.
-
Microsoft Open-Sources Distributed Machine Learning Library SynapseML
Microsoft announced the release of SynapseML, an open-source library for creating and managing distributed machine learning (ML) pipelines. SynapseML runs on Apache Spark, provides a language-agnostic API abstraction over several datastores, and integrates with several existing ML technologies, including Open Neural Network Exchange (ONNX).