InfoQ Homepage Data Analysis Content on InfoQ
-
Google Cloud Launches C4 Machine Series: High-Performance Computing and Data Analytics
Google Cloud recently announced the general availability of its new C4 machine series, powered by 4th Gen Intel Xeon Scalable Processors (Sapphire Rapids). The series offers a range of configurations tailored to meet the needs of demanding applications such as high-performance computing (HPC), large-scale simulations, and data analytics.
-
Data Solutions Framework: an Open Source Project for Building Data Solutions on AWS
AWS recently released the Data Solutions Framework (DSF), an opinionated open-source framework designed to accelerate the creation of data solutions on AWS. Built using the AWS CDK, the framework exposes abstractions and patterns as building blocks for constructing data solutions and is available in TypeScript (npm) and Python (PyPi).
-
Amazon Q Data Integration in AWS Glue Simplifies Data Transformation on AWS
Recently, AWS announced the preview of a new feature for AWS Glue, enabling customers to use natural language for authoring and troubleshooting data integration jobs. With Amazon Q data integration in AWS Glue, developers can provide a description of their data integration workload, and the service will generate an ETL script.
-
Spotify's Approach to Leverage Recursive Embedding and Clustering to Enhanced Data Explainability
One of the main challenges of any online business is to get actionable insight from their data for decision-making. Spotify shares its methodology and experience to solve this problem by clustering diverse data sets through a unique method involving dimensionality reduction, recursion, and supervised machine learning.
-
Netflix Creates Incremental Processing Solution Using Maestro and Apache Iceberg
Netflix created a new solution for incremental processing in its data platform. The incremental approach reduces the cost of computing resources and execution time significantly as it avoids processing complete datasets. The company used its Maestro workflow engine and Apache Iceberg to improve data freshness and accuracy and plans to provide managed backfill capabilities.
-
ClickHouse Keeper: Efficient Apache ZooKeeper Alternative Created with C++ and Raft
ClickHouse project team created an in-house replacement for Apache Zookeeper as it needed a more efficient implementation that would also address some of Zookeeper's shortcomings. Now, ClickHouse Keeper is an essential part of the ClickHouse project and a cornerstone of this open-source analytical database, but can also be used independently for many distributed coordination use cases.
-
KubeCon NA 2023: Kubernetes Storage Platform to Run Real-Time Analytic Databases
Kubernetes storage platform provides a portable and flexible foundation for data management to help developers build their own data solutions. Robert Hodges spoke last week at KubeCon CloudNativeCon North America 2023 Conference on different techniques his teams developed to build their own data platform.
-
Confluent Announces Apache Flink on Confluent Cloud in Open Preview
Confluent recently announced the open preview of Apache Flink on Confluent Cloud as a fully-managed service for stream processing. The company claims that the managed service will make it easier for companies to filter, join, and enrich data streams with Flink.
-
Running Apache Flink Applications on AWS KDA: Lessons Learnt at Deliveroo
Deliveroo introduced Apache Flink into its technology stack for enriching and merging events consumed from Apache Kafka or Kinesis Streams. The company opted to use AWS Kinesis Data Analytics (KDA) service to manage Apache Flink clusters on AWS and shared its experiences from running Flink applications on KDA.
-
Pfizer Uses Serverless Architecture on AWS to Scale Processing of Digital Biomarkers
Pfizer upgraded the serverless architecture for processing digital biomarker data at scale to make it more flexible and configurable. They created a framework that uses a file processing pipeline built with AWS Step Functions and other serverless services, as well as a custom Python package for data ingestion and processing.
-
AWS Introduces New Clickstream Analytics on AWS Solution for Mobile and Web Applications
AWS recently announced a new service called Clickstream Analytics on AWS, an end-to-end solution to collect, ingest, analyze, and visualize clickstream data inside organizations’ web and mobile applications.
-
Unified Analytics Platform: Microsoft Fabric
At the recent annual Build Conference, Microsoft introduced a unified analytics platform with Microsoft Fabric that brings together all the data and analytics that organizations need.
-
AWS Introduces Athena Provisioned Capacity
AWS recently announced a new feature Provisioned Capacity for Athena, that allows users to run SQL queries on fully-managed compute capacity for a fixed price and no long-term commitments.
-
Netflix Built a Scalable Annotation Service Using Cassandra, Elasticsearch and Iceberg
Netflix recently published how it built Marken, a scalable annotation service using Cassandra, ElasticSearch and Iceberg. Marken allows storing and querying annotations, or tags, on arbitrary entities. Users define versioned schemas for their annotations, which include out-of-the-box support for temporal and spatial objects.
-
Apache Druid 25.0 Delivers Multi-Stage Query Engine and Kubernetes Task Management
Apache Druid is a high-performance real-time datastore and its latest release, version 25.0, provides many improvements and enhancements. The main new features are: the multi-stage query (MSQ) task engine used for SQL-based ingestion is now production ready, and Kubernetes can be used to launch and manage tasks eliminating the need for middle managers...