IBM recently announed Fabric for Deep Learning (FfDL, pronounced "fiddle"), a microservices based platform layered on Kubernetes for:
- Training of Deep Learning models
- Open Deep Learning APIs
- Common Instrumentation and
- Hosting Deep Learning in multiple clouds
Instead of needlessly exposing the details of the underlying hardware, which may comprise of CPUs, GPUs and so on, it leverages Kubernetes, Helm Charts and Microservices for the different configurations inherent in a Deep Learning platform.
InfoQ caught up with Ruchir Puri, chief architect of Watson, regarding FfDL.
InfoQ: There is lot of activity around Deep Learning and Kubernetes platforms. Can you describe this overall synergy between Deep Learning and Kubernetes?
Puri: For distributed Deep Learning, scalability, parallelism, resiliency, on demand scheduling and termination of batch jobs are key characteristics needed from an underlying platform. Kubernetes provides all this and more, and hence the rise of Deep Leaning platforms on Kubernetes like FfDL, Kubeflow, PaddlePaddle and others.
Kubernetes' ever evolving support for NVIDIA GPUs using device drivers is another reason. GPU Resources are expensive, and for groups of data scientists it makes sense for them to have a managed cluster of GPU resources which are shared. Even for the single Deep Learning practitioner, Kubernetes will handle scheduling and job management for them, freeing them to focus on their tasks.
The work happening in the Kubernetes community leveraging stateful sets for persistent storage, support for CSI etc. also plays into this strongly, given many machine learning jobs have to work closely with data.
InfoQ: The ML platforms themselves are installed via Helm charts. What’s the value add that FfDL provides then?
Puri: Beyond the installation and deployment of FfDL via Helm charts, which will appeal to DevOps folks familiar with the Kubernetes way of working, the FfDL control plane microservices are deployed as pods, and we leverage Kubernetes to manage this cluster of GPU and CPU enabled machines effectively, to restart microservices when they crash, and to report the health of microservices. We also provide support for S3 compatible storage, in addition to supporting a multi framework approach to distributed deep learning.
For AI developers and data scientists they use one platform for deep learning training, with integrated deep learning job scheduling, logging and monitoring dashboards. The dashboards display all the evaluation metrics at every step, with accuracy, entropy, weights, biases etc in a framework agnostic manner.
For system operators, FfDL provides insulating APIs, allowing the services to grow and evolve in modular ways, and allowing for the constant evolution of the components, as the rapid innovation in the DL area demands.
For Deep Learning framework innovators, FfDL provides a platform to develop API collaboration, which can make their frameworks more universally accessible to the rich Deep Learning ecosystems.
For software engineers interested in developing AI componentry, workflows, and applications, FfDL provides an open framework for collaborative development. Accepted APIs and components have the potential for being adapted into the industrial-strength IBM AI Studio, as well as the the general open source AI ecosystem.
InfoQ: Can you compare and contrast Kubeflow and FfDL from a developer and user perspective?
Puri: FfDL is the core of IBM Watson Studio's Deep Learning as a Service technology, that we have open sourced and made available on GitHub, and it forms a key part of IBM's Center for Open-Source Data and AI Technologies building on IBM Spark Technology Center, along with Model Asset eXchange (MAX) and Adversarial Robustness Tool (ART), with more on the way, available at IBM Code for developers. We believe FfDL has capabilities that are complementary to other open source frameworks in this space, such as Google's Kubeflow, Baidu's PaddlePaddle, and others. IBM is a leader in open source community contributions across a wide range of complementary technologies that have the potential to democratize access to AI.
InfoQ: The documentation seems to suggest that object storage is intended on Amazon S3 which makes it cloud specific. Can you clarify if there is a dependency on a specific cloud?
Puri: Any S3 API compatible storage works, including IBM Cloud Object Storage. We are also working on making the storage story more generic, by adding support for NFS etc.
In addition, we are closely monitoring Kubernetes community support for CSI interface to make the storage story standards based.
InfoQ: How will FfDL simplify the daily life of a ML/Data Scientist or developer, rather than complicate it by adding the additional Kubernetes layer?
Puri: Training deep neural networks, known as Deep Learning, is currently highly complex and computationally intensive. It requires a highly tuned system with the right combination of software, drivers, compute, memory, network, and storage resources. Data scientists and AI developers should be focused on doing what they do best: focusing on data and its refinements, training neural network models (with automation) over these large data sets, and creating cutting edge models.
FfDL offers a stack that abstracts away these concerns so data scientists can execute training jobs with their choice of deep learning framework at scale in the cloud. It has been built to offer resilience, scalability, multi-tenancy, and security without modifying the deep learning frameworks, and with no or minimal changes to model code.FfDL will also insulate the data scientist from the turmoil of rapidly evolving AI infrastructure, at least to a point. They can expect their clusters to evolve and improve, and higher level features to be added, without them having to rewrite their systems for every change.
InfoQ: Can you provide more technical details on how support for other ML toolkits can be integrated into FfDL? Can you also talk about community support for FfDL?
Puri: Bring a Docker image of the ML toolkit, and add pointers in FfDL Lifecycle Manage (LCM) configuration files to tell the platform about its inclusion. If we are adding support for distributed training, additional code would be needed to ensure the distributed architecture for the ML engine can be supported, e.g. Tensorflow prefers parameter server approach to distributed learning by default, whereas PyTorch will prefer MPI appoach.
We expect the community to evolve over time, based on the merit of our offerings. We will reach out to major players, as collaboration opportunities arise. In addition, IBM has strong connections with universities, and will work with joint research efforts such as the MIT-IBM Watson AI Lab to use FfDL as a platform for AI engineering, where appropriate. We hope, in general, that open source communities will be pleased to have this addition to the AI ecosystem, and will work with us in its evolution to offer ongoing greater value.
Additional technical details are available on the FfDL Wiki.