BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Serving Deep Networks in Production: Balancing Productivity vs Efficiency Tradeoff

Serving Deep Networks in Production: Balancing Productivity vs Efficiency Tradeoff

This item in japanese

A new project provides an alternative modality for serving deep neural networks. It enables utilizing eager-mode (a.k.a define-by-run) model code directly at production workloads by using embedded CPython interpreters. The goal is to reduce the engineering effort to bring the models from the research stage to the end-user and to create a proof-of-concept platform for migrating future numerical Python libraries. The initial library is also available as a prototype feature (torch::deploy) in the PyTorch C++ API (version 1.11).

The two common practices for deploying deep networks for API inference have been direct containerization of model code with a REST/RPC server (e.g. using Python, CUDA, ROCm base images) and construction of the static model graph (e.g. using Tensorflow graph-mode or Torchscript). Containerization brings agility when carrying the model from development to production. On the other hand, the graph-mode (a.k.a define-and-run) allows optimized deployment within a larger serving infrastructure. Both methods have related tradeoffs considering required engineering cost, API latency, and resource constraints (e.g. available GPU memory).

Depending on the application field and company infrastructure requirements, the deep network deployment method may be customized to specific needs (e.g. graph-mode may be enforced or containers may be created from compiled models instead). But the rapidly evolving nature of deep networks mandates relatively loose development environments where quick experimentation is important for the success of the project, hence possible reduction in development cost to bring such models to production is bounded. The question the paper raises is: Can we minimize machine learning engineering effort by enabling eager-mode serving within the C++ ecosystem without giving up large performance penalties?

The author's answer to this question is not very surprising. What if incoming requests were handed off to (multiple) CPython workers that are packaged within the C++ application? By load balancing via a proxy, the models may be used directly without further engineering work. Such a scenario would also allow machine learning engineers to keep the external libraries used when developing the model (e.g. Numpy, Pillow) as the required CPython interpreter is available (this is not yet supported in the current prototype). This approach may seem similar to containerization with base Python images since they both evade Python's GIL by using multiple decoupled interpreters, but the new method also allows execution in C++ (i.e. the idea combines two aforementioned common ways in a unique way).

An example of packaging in Python can be seen below:

from torch.package import PackageExporter
import torchvision

model = torchvision.models.resnet.resnet18()

with PackageExporter("my_package.pt") as e:
    e.intern("torchvision.**")
    e.extern("sys")
    e.save_pickle("model", "model.pkl", model)

After saving the model artifacts, it can be used with the C++ API for inference purposes:

#include <torch/deploy.h>
#include <torch/script.h>

int n_workers{4};
torch::deploy::InterpreterManager manager(n_workers);
torch::deploy::Package package = manager.loadPackage("path_to_package");
torch::deploy::ReplicatedObj model = package.loadPickle("model", "model.pkl");

The benchmarks carried out in the report show such CPython packaging can be a good alternative, especially for large model serving. As it is still a work-in-progress, there are several shortcomings of the project. For example, the external library support is limited to Python standard library and PyTorch only. Also, it requires copying and loading the shared interpreter library to each interpreter, therefore the size and number of workers may become a factor for scalability. In the future, contributors plan to package dependencies directly from Pip/Conda environment libraries, hence allowing even easier production deployment.

About the Author

Rate this Article

Adoption
Style

BT