Key Takeaways
- As deep networks are becoming more specialized and resource-hungry, serving such networks on acceleration hardware in tight-budget environments is also becoming difficult for startup and scale-up companies.
- Loosely coupled architectures may be preferred as they bring high controllability, easy adaptability, transparent observability, and auto-scalability (cost-effectiveness) when serving deep networks.
- By utilizing managed cloud components such as functions, messaging services, and API gateways, companies of any size can utilize the same serving infrastructure for both public and internal requests.
- Managed messaging brokers bring easy maintainability without requiring specialized teams.
- Adaptation of serverless and loosely coupled components may lead to faster development and shipping of deep learning solutions.
As applications of deep learning widen in many industries, the size and specificity of deep networks are also increasing. Large networks demand more resources, and compiling them from eager execution to optimized computational graph running on CPU or FPGA backends may not be feasible due to their task-specificity (i.e., custom-made functions/layers). Therefore, such models may require explicit GPU acceleration and custom configuration during runtime. However, deep networks are operated in constrained environments with limited resources, i.e., cloud GPUs are expensive and low-priority (aka spot, preemptible) instances are scarce.
Bringing such networks to production using common machine learning serving frameworks may cause a lot of headaches to machine learning engineers and architects due to the tight temporal coupling between the caller and the API handler. This scenario is very probable especially for startup and scale-up companies that are adopting deep learning. Starting from GPU memory management to scaling, they may face several issues while serving deep networks.
In this article, I will emphasize the advantages of an alternative way to deploy deep networks that exploit message-based mediation. This will relax the tight temporal coupling that we see in REST/RPC API serving and brings asynchronously operative deep learning architecture that will be favorable by engineers working in startups and enterprise companies alike.
I will contrast the serving architecture with REST frameworks using four main criteria: controllability, adaptability, observability, and auto-scalability. I will further show how REST endpoints can be easily added to message-oriented systems using modern cloud components. All the ideas that will be discussed are cloud-agnostic and language-agnostic. They can be carried for hosting on on-prem servers as well.
Challenges in Deep Network Serving
Deep networks are heavily cascaded nonlinear functions that form computational graphs applied to data. Parameters of these graphs are tuned in a way to minimize a chosen objective during the training phase using selected input/output pairs. At inference time, the input data is simply piped through this optimized graph. As this description exposes, the obvious challenge of any deep network is its computational intensiveness. Knowing this, it may surprise you that the most common way of serving deep networks is employing REST/RPC API invocation that is based on tight temporal coupling.
API-based serving does not cause any problems as long as the graph constituting a deep network can be optimized by fusion of layers, quantization, or pruning. However, such optimizations are not always guaranteed especially when moving a research network to production. Most of the ideas coming out of the R&D phase have specificity, and general frameworks created to optimize computational graphs may not work on such networks (e.g., layers with Swish activation function cannot be compiled by Pytorch/ONNX JIT).
In this scenario, tight-coupling caused by a REST API is not desirable. Another method to increase inference speed is through compiling the graph to run on specific hardware that is designed for proper parallelization of the computations performed in the graph (e.g. FPGA, ASIC). But similarly, specificity issues cannot be handled this way as custom functions need to be integrated into FPGA by a hardware description language (i.e. Verilog, VHDL).
Considering deep networks will continue to enlarge in size and become more and more specialized depending on the industry, explicit GPU acceleration is expected during inference time in the near future. Therefore, separating synchronous interface between the caller and the serving function, allowing high-controllable pull-based inference is more advantageous in many different ways.
Breaking the Tight Temporal Coupling
Relaxation of temporal coupling can be achieved by adding additional middleware services to a system. In a way, it is more like using an email service provider to communicate with your neighbor rather than shouting out of the window. By using a messaging-middleware (e.g. RabbitMQ, Azure Service Bus, AWS SQS, GCP Pub/Sub, Kafka, Apache Pulsar, Redis), the target now can have full flexibility over how to process the caller's request (i.e. neighbor may ignore your email until s/he finishes eating dinner). This is especially advantageous as it allows high controllability from an engineer's point of view. Consider the case where there are 2 deep networks (that require 3Gb and 6Gb memory at inference time) deployed on a GPU with 8Gb memory.
In a REST-based system, one may need to take precautions beforehand to ensure these workers of two types of models do not overuse the GPU memory, otherwise due to direct invocation, some requests will fail. On the other hand, if a queue is used instead, the workers may choose to defer the work for later when the memory will be available. As this is asynchronously handled, the caller is not blocked and may continue to execute its work. This scenario is especially optimum for internal requests within a company for example that have relatively loose time constraints, but such queuing can also handle client or partner API requests in real-time using cloud components (e.g., serverless functions) as will be further described in the next section.
Another convincing aspect of choosing a message-mediated DL serving is its easy adaptability. There exists a learning curve for any web framework and library even for micro-frameworks such as Flask if one wants to exploit its full potential. On the other hand, one does not need to know the internals of messaging middleware, furthermore all major cloud vendors provide their own managed messaging services that take maintenance out of the engineers’ backlog. This also has many advantages in terms of observability. As messaging is separated from the main deep learning worker with an explicit interface, logs and metrics can be aggregated independently. On the cloud, this may not be even needed as managed messaging platforms handle logging automatically with additional services such as dashboards and alerts. The same queuing mechanism lends itself to auto-scalability natively as well.
Stemming from high observability, what queuing brings is the freedom to choose how to auto-scale the workers. In the next section, an auto-scalable container deployment of DL models will be shown using KEDA (Kubernetes Event-driven Autoscaling). It is an open source event-based auto-scaling service that aims to ease automatic K8s pod management. It is currently a Cloud Native Computing Foundation sandbox project, and supports up to 30 scalers from Apache Kafka to Azure Service Bus, AWS SQS, and GCP Pub/Sub. KEDA’s parameters give the freedom to optimize the scaling mechanism according to the incoming volume such as waiting number of messages, duration, and load size.
An Example Deployment
In this section, a template deployment example will be shown using Pytorch worker containers and Kubernetes on Azure. Data communication will be handled by the Azure Service Bus except for large artifacts such as network weights, input, and possible output images. They should be stored in blob storage and downloaded/uploaded from containers using Azure Python SDK for blobs. The high-level overview of the architecture can be seen in Fig-1 schematics.
Fig-1: The high-level overview of the proposed architecture. For each block, the corresponding Azure service is given in parentheses. It can handle both external REST API requests using serverless functions and internal requests directly from the queues.
We will implement an auto-scalable competing consumer pattern server using Azure Service Bus queue and KEDA. To enable request-reply pattern to handle REST requests, Azure Durable Function external events can be used. In the example architecture, we assume a durable function is ready and transmits the feedback event reply URL via service bus queue to the worker, the details of setting up this service are explained in Azure documentation. KEDA will allow us to set scaling rules using queue length, by this way the number of worker pods in K8s will be automatically updated according to the load. By also binding a worker container (or multiple containers in our case) to a GPU, we can auto-scale any managed cluster and add more GPU machines to our system without any hassle. Auto-scaling the cluster is handled automatically by the K8s to resolve resource constraints (i.e. node pressure due to insufficient number of GPUs).
The detailed template describing how to serve a vanilla ResNet classifier can be found in the Github repo. In the article, the shortened versions of each block will be shown. As the first step, let’s create our deep network serving function (network.py). A template class to initialize our inference function can be written as follows, this can be customized according to task at hand (e.g., segmentation, detection):
class Infer(object):
__slots__ = [""]
# Initialize the inference function (e.g., download weights from blob)
def __init__(self, tuned_weights=None):
pass
# Carry out inference
@torch.no_grad()
def __call__(self, pil_image):
pass
In the original function, we return class IDs of the top 5 ImageNet categories. Subsequently, we are ready to write our worker python function (run.py) where we integrate the model with Azure Service Bus. As shown in the snippet below, Azure Python SDK for Service Bus allows a pretty straightforward management of incoming queue messages. PEEK_LOCK
mode allows us to explicitly control when to complete or abandon the incoming request:
...
with servicebus_client:
receiver = servicebus_client.get_queue_receiver(
queue_name=self.bus_queue_name,
receive_mode=ServiceBusReceiveMode.PEEK_LOCK,
)
with receiver:
for message in receiver:
# Carry out serving using the input data
data = json.loads(str(message))
...
At this point, we have our template worker ready, now let’s create the container’s Dockerfile and push it to our Azure Container Registry. Here requirements.txt
contains additional pip dependencies for our worker. Running the main process by exec
can be seen as a hack to ensure it runs as the PID 1 process. This enables the cluster to restart the pod automatically if any error occurs without writing an explicit liveness endpoint in the deployment YAML file. Note that it is still a better practice to specify the health check:
FROM pytorch/pytorch:1.8.0-cuda11.1-cudnn8-runtime
RUN python -m pip install -r requirements.txt
COPY . /worker
WORKDIR /worker
CMD exec python run.py
NAME= # SET
VERSION= # SET
ACR= # SET
sudo docker build -t $ACR.azurecr.io/$NAME:$VERSION -f Dockerfile .
sudo docker push $ACR.azurecr.io/$NAME:$VERSION
After creating the K8s cluster, don't forget to enable the node auto-scaling feature (e.g. min 1, max 8) from the portal. As the final preparation step, we need to enable GPU drivers (to gpu-resources
namespace) in our cluster and deploy the KEDA service (to keda
namespace) via the official YAML files. Keda and GPU driver YAML files are already included in the repo for your convenience:
kubectl create ns gpu-resources
kubectl apply -f nvidia-device-plugin-ds.yaml
kubectl apply -f keda-2.3.0.yaml
As the next step, the worker containers can be deployed by the prepared shell script. First, we create the namespace to deploy our service:
kubectl create namespace ml-system
Note that using a shell file instead of plain YAML can help to change the parameters easily. By running the deployment script (deploy.sh), we are ready to go (don’t forget to set the parameters according to your needs):
bash deploy.sh
Since we limit a single GPU per pod, scaling the pods via KEDA will effectively scale the cluster nodes as well. This makes the overall architecture very cost-effective. In some cases, you may even set the minimum number of nodes to zero and cut GPU costs when the workers are idle. However, one has to be very careful in such configuration and take into account the node’s scaling time. The details of the KEDA parameters used in the deployment script can be found in the official documentation.
In the deployment script (deploy.sh), if you look carefully, you will realize we are trying to reach the GPU from another container (worker-1) by setting the NVIDIA_VISIBLE_DEVICES
environment variable to "all". This trick enables us to exploit both cluster scaling and multiple containers in a pod. Without setting this, K8s will not allow adding more containers per GPU due to the "limit" constraint on worker-0. Engineers should measure GPU memory usage of their models and add containers according to the GPU card’s limit. Note that details of the Azure-specific blocks specified in Fig-1 (except the example service bus receiver) are avoided for brevity. Azure has extensive documentation for each component with related Python example implementations.
Future Directions
What is happening now in deep learning hardware research may seem very familiar if one reads about how computation evolved in the 20th century. An obvious low-hanging fruit was the minification of translators to brute-force the number of computational cores that can fit in a chip. In the end, industries heavily depended on VLSI improvements rather than algorithmic developments. Seeing how fast DL-specific hardware grows, we may expect a very similar 21st century. In the cloud, on the other hand, the low-hanging fruit appears to be serverless accelerated DL services. Deep learning deployment will be further abstracted and pay-per-use startups will be common in the near future. It is also fair to project that loosely coupled architectures will ease the minds of the engineers working in such startups due to the flexibility they bring, hence we may see a lot of new open source projects for loosely coupled architectures.
Conclusion
In the article, four main benefits of message meditated deep learning serving are described: controllability, adaptability, observability, and auto-scalability (cost-effectiveness). In addition to that, I presented a template code that can be used to deploy the described architecture on the Azure platform. It should be emphasized that flexibility of such serving may not be practical in some scenarios such as IoT and embedded device serving where local independence of the components overweights. However, the idea presented here can be adopted in many different ways, for example rather than using cloud messaging services low-level C/C+ message broker libraries can be used to create similar loosely-coupled architectures in resource-constraint platforms (e.g., for autonomous driving, IoT).
Want to try the concepts in this article in practice? You can find the accompanying code on Github.
About the Author
Sabri Bolkar is a machine learning applied scientist and engineer who is interested in the complete lifecycle of learning-based systems from R&D to deployment and continuous improvement. He studied computer vision at NTNU and completed his MSc thesis on unsupervised image segmentation at KU Leuven. After a PhD adventure at TU Delft, he is currently tackling the large-scale challenges in applied deep learning faced by the e-commerce industry. You can reach him via his website.