BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Modern Big Data Pipelines over Kubernetes

Modern Big Data Pipelines over Kubernetes

Leia em Português

This item in japanese

Lire ce contenu en français

Container management technologies like Kubernetes make it possible to implement modern big data pipelines. Eliran Bivas, senior big data architect at Iguazio, spoke at the recent KubeCon + CloudNativeCon North America 2017 Conference about big data pipelines and how Kubernetes can help develop them.

In the past, big data solutions used to be mainly based on Hadoop, but the ecosystem has evolved in recent years with new databases, streaming data and machine learning solutions which which require more than the Hadoop deployment model (Map/Reduce, YARN and HDFS). These solutions also require a cluster scheduling layer to host diverse workloads such as Kafka, Spark and TensorFlow, working with data stored in databases like Cassandra, Elasticsearch and cloud-based storage.

Bivas talked about the different teams typically involved in software developmet lifecycle and their primary objectives. Application engineers want agile software development, whereas data engineers care about where the data is and they want the database systems to keep working. And the DevOps teams want all the systems to work with less maintenance and disruptions. Because of the container technologies revolution, all of these objectives are possible to accomplish in the organizations.

He discussed a common framework to create cloud native end-to-end analytics applications. Developers should decouple the data services from applications and frameworks to make the big data solutions flexible and efficient. It also helps with data services which are typically used to manage different types of data like unstructured or structured or streaming data.

The solutions should be ideally based on cloud native applications and frameworks and use the unified orchestration provided by Kubernetes.

Bivas described the continuous analytics flow model which includes data services in the middle to analyze the data coming from operational data stores (relational databases), external sources (IoT) using conainerized big data analytics tools like Spark and TensorFlow.

Serverless frameworks like Kubeless and OpenFaaS are a great choice to use in these solutions. Serverless solutions are easy to deploy with no YAMLs, Dockerfile, or build involved. They also support auto scaling and event triggers.

Bivas discussed the architecture details of Nuclio, a real-time serverless platform which was recently open sourced. The architecture involves using Kubernetes as an alternative to YARN, and using frameworks like Spark ML, Presto, TensorFlow & Python and serverless functions coupled with local and cloud-based storage. Nuclio also supports pluggable event sources and data sources.

He also talked about an automotive customer use case of real-time analytics for vehicle maintenance. The solution includes the vehicle data being streamed using web API's and microservices being used for data ingestion. The vehicle data is enriched in real time with weather and road data to find correlations between weather condition and vehicle components.

The presentation included a demo to show the benefits of running big data analytics over a cloud native architecture. Bivas concluded the session with some best practices that developers know about the tools provided by Kubernetes, do the application logging, collect the metrics and use the metrics to get insights into application performance.

If you are interested in learning more about Nuclio framework, checkout their github project, code examples and the documentation.
 

Rate this Article

Adoption
Style

BT