Google announced the first major release for TensorFlow Serving at Google I/O 2017 last month. Noah Fiedel detailed some of the new features and his insight into the project vision going forward.
TF Serving 1.0 includes several batching options, a model manager for life cycle management, simultaneous multiple model-version serving, support for subtasks, and data source definition standardization for pluggable, callable model storage.
Fiedel's context for the talk was that of a sometimes nebulous data science and engineering world that's still 10 to 20 years behind state-of-the-art in software engineering, specifically around critical non-functional requirements like portability, reproducibility, and resiliency. But since reproducibility and portability of models, their configuration and metadata is still being standardized, some of the best-practices that emerged in software engineering are still a ways out from being a ubiquitous industry standard for machine learning.
TensorFlow Serving hopes to address some of these challenges and offer standardization as part of it's core programming model and platform going forward.
One of the core challenges Fiedel notes in the TensorFlow community is how to move a model from disk somewhere for making it callable by services and doing so as part of a reproducible pipeline. TF Serving allows users to natively push out multiple versions of a model to run simultaneously, and revert changes without additional stand-alone CI/CD tooling.
TensorFlow Serving 1.0 will be minor-version-aligned with all releases of TensorFlow, as well supply a maintained debian package available through apt-get.
TensorFlow Serving has three major components. There's a set of C++ libraries that support saving and loading tensorflow models with a standardized save-and-export format. The second component is a generic config-driven core platform that offers a level interoperability with other ML platforms, but wasn't covered in detail. The core platform packaging includes executable binaries with best-guess default settings and practices baked, as well as a hosting service. Lastly, TF Serving offers a commercial option through Cloud ML.
Low-latency request and response times, as well as more optimized compute-time allocation became a focus for the core platform. TF Serving implements request mini-batch processing for fewer round-trip calls and more efficient resource allocation using separate thread pools for GPU/TPU, and CPU bound tasks.
The libraries are optimized for high performance using the read-copy-update pattern, fast model loading on server startup and reference-counted pointers to provide fault-tolerance the pointer to keep track of where along the graph a model is executing at a given moment.
Model-loading and inference serving are non-blocking and handled through separate thread pools. Mini-batching with queues is an important concept for TF Serving. Asynchronous requests are mini-batched together and passed into a single TF Session that coordinates with a Manager to process requests. Inference requests are mini-batched together to reduce latencies. Request handling is implemented using ServableHandle, a pointer to specific client requests.
The SessionBundle model format is deprecated in favor of a new SavedModel format. TF Serving 1.0 introduced the concept of a MetaGraph, which contains information about the processor architecture a trained model can run on. The MetaGraph contains model vocabularies, embeddings, training weights and other parameters for models. SavedModel objects are composed of MetaGraphs and are designed to be portable, callable artifacts. The SavedModel abstraction can be used for training, serving, off-line evaluation, and are executable with a new command line interface. SignatureDef is a component of SavedModel that helps annotate and characterize the layers within a graph.
API's are provided for a data Source library, and a SourceAdapter for model types. The Source module emits a loader for SavedModels, and estimates RAM requirements. The source api emits metadata to a Manager module that loads the data for a model, as well as the model itself. Fiedel noted the new ServerCore reduced boilerplate lines-of-code (LOC) from ~300 to 10 by encapsulating common code patterns into a config file used for injection.
New Inference api's provide several reusable components for common use-cases. Predict for prediction configuration and serving, Regress for regression modeling, Classify for classification algorithms, and Multi-inference for combining Regress and Classify api usage together.
Multi-headed inferencing is supported now as well. Multi-headed inferencing is an emergent research topic in machine learning that attempts to address the presence of concurrent identical requests. Multi-headed inferencing can potentially be used to further increase the efficiency of mini-batches by removing what are recognized as erroneous or repititious requests. Resilience against sources of erroneous activity that generate large, expensive input volume can help prevent negative cascading effects on resource consumption and data quality.
Google recommends using static binaries tuned for customers on GCP that want to take advantage of best-guess default settings for gRPC functions. TF Serving 1.0 also provides Docker images, as well as a K8 tutorial to get started with.