Building GPU Accelerated Workflows with TensorFlow and Kubernetes

Daniel Whitenack spoke at the recent KubeCon + CloudNativeCon North America 2017 Conference about GPU based deep learning workflows using TensorFlow and Kubernetes technologies.

He started off discussing a typical artificial intelligence (AI) workflow with an example of object detection. The workflow includes steps like pre-processing, model training, model generation, and finally model inference. All the different stages of the AI/ML workflow can be executed on Docker containers.

Model training is typically done using a framework like Tensorflow or Caffe. This stage is also where GPU's come into play to help with performance. Deep learning workflows that utilize TensorFlow or other frameworks need GPUs to efficiently train models on image data.

Model training programs can run on GPU nodes using Kubernetes clusters. Kubernetes provides a nice framework for multiple GPU nodes running on the platform. The workflow process works better by following the steps below:

Get the right pieces of data to the right code (pods)
Process data on the right nodes
Trigger the right code at the right time

It can also be used to track which versions of code and data ran to produce which results (for debugging, maintenance, and compliance purposes).

Kubernetes provides the foundation for all of this and is great for machine learning projects because of its portability and scalability.

Whitenack discussed the open source project called Pachyderm that supports data pipeline and data management layer for Kubernetes. The workflows typically involve mutli-stage data pre-processing and post-processing jobs. Pachyderm provides a unified framework for scheduling multi-stage workflows, managing data, and offloading workloads to GPUs.

He talked about the capabilities of Pachyderm framework, which include the following:

Data versioning: versioned data can be stored in an Object Store like Amazon S3 database
Containers for analysis
Distributed pipelines or DAG's for data processing
Data provenance: this helps with compliance and debugging

Whitenack also gave a live demo of the AI workflow using Pachyderm and Kubernetes. The sample application implements an image to image translation use case where satellite images are automatically transferred to maps. It utilizes TensorFlow for model training and inference.

If you are interested in learning more about Pachyderm framework, checkout the machine learning examples, Developer Documentation, Kubernetes GPU documentation or join the Slack channel.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter