InfoQ Homepage Articles Efficient Resource Management with Small Language Models (SLMs) in Edge Computing

AI, ML & Data Engineering

Efficient Resource Management with Small Language Models (SLMs) in Edge Computing

Nov 11, 2024 11 min read

Suruchi Shah
Staff Software Engineer, AI/ML Platform @LinkedIn

reviewed by

Srini Penchikala
Senior Software Architect

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key Takeaways

Small Language Models (SLMs) bring AI inference to the edge without overwhelming the resource-constrained devices.
SLMs can be used for learning and adapting to patterns in real-time, reducing the computational burden, and making edge devices smarter.
Techniques like quantization and pruning make the language models faster and lighter.
Google Edge TPU is designed to perform high-efficiency AI inferences directly on edge devices; it's a good case study to explore how pruning and sparsity techniques can optimize resource management.
Future direction of SLMs for resource management include IoT sensor networks, smart home devices, edge gateways in industrial automation, and smart healthcare devices.

In our hyper-connected world, where everything from your fridge to your fitness tracker is vying for a piece of the bandwidth pie, edge computing is the unsung hero keeping it all running smoothly. Think of it as the cool kid on the block, processing data right where it’s generated instead of dragging everything back to the cloud. This means faster decisions, less bandwidth-hogging, and a nice little privacy boost - perfect for everything from smart factories to your smart thermostat.

But here’s the catch: edge devices often operate under strict constraints in terms of processing power, memory, and energy consumption. Enter Small Language Models (SLMs), the efficient sidekick to save the day. These nimble little models can bring AI inference to the edge without overwhelming these resource-constrained devices.

In this article, we’ll dive into how SLMs can work their magic by learning and adapting to patterns in real-time, reducing the computational burden, and making edge devices smarter without asking for much in return.

Challenges in Resource-Constrained Edge Environments

Edge computing devices like IoT sensors and smart gadgets often have limited hardware capabilities:

Limited Processing Power: Many are powered by low-end CPUs or microcontrollers, which struggle to perform computationally heavy tasks.
Restricted Memory: With minimal RAM - storing "large" AI models? Not happening.
Energy Efficiency: Battery-powered IoT devices require efficient energy management to ensure long-lasting operation without frequent recharging or battery replacements.
Network Bandwidth Constraints: Many rely on intermittent or low-bandwidth network connections, making continuous chat with cloud servers inefficient or impractical.

Most AI models are just too big and power-hungry for these devices. That’s where SLMs come in.

How Small Language Models (SLMs) Optimize Resource Efficiency

Lightweight Architecture

SLMs are like the slimmed-down, lean version of massive models like GPT-3 or GPT-4. With fewer parameters (DistilBERT, for example, has 40% less baggage than BERT), they’re small enough to squeeze into memory-constrained devices without breaking a sweat, all while retaining most of their performance magic.

Compression Magic

Techniques like quantization (think reducing weights to lower-precision integers - reduces computational load) and pruning (cutting off the dead weight) make them faster and lighter. The result? Speedy inference times and reduced power drain, even on devices with the computational muscle of a flip phone.

Quantization

In cases where quantization is applied, the memory footprint is dramatically reduced. For instance, a quantized version of Mistral 7B may consume as little as 1.5GB of memory while generating tokens at a rate of 240 tokens per second on powerful hardware like the NVIDIA RTX 6000 (Enterprise Technology News and Analysis). This makes it feasible for edge devices and real-time applications that require low-latency processing.

Note: Studies on LLaMA3 and Mistral show that quantized models can still perform well in NLP and vision tasks, but the precision used for quantization must be carefully selected to avoid performance degradation. For instance, LLaMA3, when quantized to 2-4 bits, shows notable performance gaps in tasks requiring long-context understanding or detailed language modeling [Papers with Code], but it excels in more straightforward tasks like question answering and basic dialogue systems [Hugging Face]. Basically, there is no well-defined decision tree on how to do perfect quantization, it requires experimenting with specific use case data.

Pruning

Pruning works by identifying and removing unnecessary or redundant parameters in a model - essentially trimming neurons or connections that don't significantly contribute to the final output. This reduces the model size without major performance loss. In fact, research has shown that pruning (Neural Magic - Software-Delivered AI) can reduce model sizes by up to 90% while retaining over 95% of the original accuracy in models like BERT (Deepgram).

Pruning methods range from unstructured pruning, which removes individual weights, to structured pruning, which eliminates entire neurons or layers. Structured pruning, in particular, is useful for improving both model efficiency and computational speed, as seen with Google's BERT-Large, where 90% of the network can be pruned with minimal accuracy loss (Neural Magic - Software-Delivered AI).

Pruned models, like their quantized counterparts, offer improved speed and energy efficiency. For example, PruneBERT achieved a 97% reduction in weights while still retaining around 93% of its original accuracy, significantly speeding up inference times (Neural Magic - Software-Delivered AI). Similar to quantization, pruning requires careful tuning to avoid removing essential components of the model, particularly in complex tasks like natural language processing.

Pattern Adapters

Small Language Models (SLMs) are efficient because they can recognize patterns and avoid unnecessary recalculations, much like a smart thermostat learning your routine and adjusting the temperature without constantly checking with the cloud. This approach, known as adaptive inference, reduces computation, saving energy for more critical tasks and extending battery life.

Real-world Evidence on Pattern Adapters

Google Edge TPU: Google's Edge TPU enables AI models to perform essential inferences locally, eliminating the need for frequent cloud communication. By applying pruning and sparsity techniques, Google has demonstrated that models running on the Edge TPU can achieve significant reductions in energy consumption and processing time while maintaining high levels of accuracy (Deepgram). For example, in image recognition tasks, the TPU focuses on key features and skips redundant processing, leading to faster, more energy-efficient performance.
Apple’s Neural Engine: Apple uses adaptive learning models on devices like iPhones to minimize computation and optimize tasks like facial recognition. This approach reduces both power consumption and cloud communication.
Dynamic Neural Networks: Research on dynamic networks shows up to 50% reduction in energy usage through selective activation of model layers based on input complexity. (Source: "Dynamic Neural Networks: A Survey" (2021))
TinyML Benchmarks: The MLPerf Tiny Benchmark highlights how power-aware models can use techniques like pattern reuse and adaptive processing to significantly reduce the energy footprint of AI models on microcontrollers (ar5iv). Models can leverage previously computed results, avoiding recalculation of redundant data and extending battery life on devices such as smart security cameras or wearable health monitors.
IoT Applications: A prime example of pattern adaptation is found in the Nest Thermostat, which learns user behaviors and adjusts temperature settings locally. By minimizing cloud interaction, it optimizes energy use without sacrificing responsiveness. SLMs can also adaptively adjust their learning rate based on the frequency of user interactions, further optimizing their power consumption. This local learning ability makes them ideal for smart home and industrial IoT devices that require constant adaptation to changing environments without the energy cost of continuous cloud access.

Real-world Evidence on Pattern Adapters: Google Edge TPU in Action

The Google Edge TPU is designed to perform high-efficiency AI inferences directly on edge devices, and it's an excellent case study to explore how pruning and sparsity techniques can optimize resource management. Let's take an example of image recognition on an IoT device equipped with Edge TPU.

Technical Implementation: Optimizing Image Recognition on Google Edge TPU

In this example, I will deploy a pruned and quantized model to recognize objects in a smart factory environment. The task is to identify defective parts on an assembly line using a camera feed, ensuring real-time detection without overwhelming the device’s computational resources.

**Prerequisites**: Ensure Python 3.7 or later, TensorFlow 2.x, TensorFlow Model Optimization Toolkit, and Edge TPU API are installed. Instructions can be found on their respective documentation pages.

Step 1: Model Pruning and Quantization

We'll start by using TensorFlow Lite to prune and quantize a pre-trained MobileNetV2 model. MobileNetV2 is well-suited for edge devices due to its lightweight architecture.

import tensorflow as tf
from tensorflow_model_optimization.sparsity.keras import prune_low_magnitude
from tensorflow_model_optimization.sparsity.keras import strip_pruning

# Load the pre-trained MobileNetV2 model
model = tf.keras.applications.MobileNetV2(weights="imagenet", include_top=True)

# Define the pruning schedule
pruning_params = {
    'pruning_schedule': tf.keras.experimental.PruningSchedule.ConstantSparsity(0.50, begin_step=0)
}

# Apply pruning to the model
pruned_model = prune_low_magnitude(model, **pruning_params)

# Compile the pruned model
pruned_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Fine-tune the pruned model with data to retain accuracy
pruned_model.fit(train_data, epochs=10)

# Strip the pruning wrappers for deployment
final_model = strip_pruning(pruned_model)

Once pruning is complete, the model size is significantly reduced, allowing it to fit more easily within the memory constraints of an edge device. Now, we proceed to quantization for further optimization.

# Convert the model to TensorFlow Lite format and apply quantization
converter = tf.lite.TFLiteConverter.from_keras_model(final_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_model = converter.convert()

# Save the quantized model
with open("mobilenet_v2_pruned_quantized.tflite", "wb") as f:
    f.write(quantized_model)

Step 2: Deploying the Quantized Model to Edge TPU

After quantization, we can deploy this model to the Google Edge TPU using the Edge TPU runtime. The inference engine efficiently runs the model with lower latency and power consumption.

First, compile the model using the Edge TPU Compiler:

edgetpu_compiler mobilenet_v2_pruned_quantized.tflite

Now, we can run inference using Python and the Edge TPU API:

import numpy as np
from tflite_runtime.interpreter import Interpreter
from pycoral.utils.edgetpu import make_interpreter

# Load the compiled model
interpreter = make_interpreter('mobilenet_v2_pruned_quantized_edgetpu.tflite')
interpreter.allocate_tensors()

# Load an image for testing
def preprocess_image(image_path):
    img = Image.open(image_path).resize((224, 224))
    return np.array(img, dtype=np.float32)

image = preprocess_image('defective_part.jpg')

# Perform inference
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

interpreter.set_tensor(input_details[0]['index'], [image])
interpreter.invoke()

# Get the output
output = interpreter.get_tensor(output_details[0]['index'])
print("Inference result:", output)

In this case, the quantized and pruned MobileNetV2 runs on the Edge TPU, classifying images efficiently while using minimal power and memory resources. This makes it feasible to deploy similar AI models across multiple devices in a smart factory without requiring constant cloud connectivity or excessive energy consumption.

Energy Savings and Bandwidth Optimization

By deploying such optimized models directly on the edge, the smart factory setup reduces reliance on cloud services, cutting down both latency and bandwidth usage. The device only sends critical alerts to the cloud, such as notifications when a defect is detected, thus conserving bandwidth and lowering operational costs.

Classification Results (Example Output for Defective Part Detection):

Inference result:
[
    {"class": "Defective Part", "confidence": 0.92},
    {"class": "Non-Defective Part", "confidence": 0.05},
    {"class": "Unknown", "confidence": 0.03}
]

Key Metrics:

Pruning rate: 50% sparsity (50% of the weights removed)
Model size reduction: ~60% smaller after pruning and quantization
Latency: Reduced inference time from 150ms to 40ms on the Edge TPU
Energy consumption: Lower by 30% compared to an unoptimized model

Future Direction of SLMs for Resource Management

1. IoT Sensor Networks

SLMs deployed in IoT sensor networks can revolutionize resource usage by predicting activation patterns and managing data transmission more intelligently.

Energy Efficiency: Take soil moisture sensors in a smart farm, for example. Instead of constantly monitoring and sending data, these sensors could learn weather patterns and soil conditions. SLMs enable them to act only when necessary - such as before a predicted dry spell - saving energy and reducing the frequency of data transmission. This leads to more efficient water usage and extended battery life for the sensors.

2. Smart Home Devices

SLMs can make smart home devices truly live up to their "smart" reputation by learning user habits and optimizing operations without draining unnecessary power.

Example: A smart speaker with an embedded SLM could analyze the user’s speech patterns and adjust its wake word detection system accordingly. Instead of always listening at full power, the speaker could scale its resource usage based on the likelihood of hearing a command, conserving energy during times of lower activity. Similarly, thermostats powered by SLMs could predict when you are home, adjusting the temperature preemptively, all while reducing reliance on constant cloud checks.

3. Edge Gateways in Industrial Automation

In industrial environments, edge gateways are critical for processing and aggregating data from various sensors and machines. SLMs can enhance their efficiency by determining which data needs immediate attention and what can be processed later or offloaded to the cloud.

Bandwidth Optimization: Imagine a manufacturing plant with an edge gateway powered by an SLM. The gateway can predict critical events, like equipment failure, by analyzing vibration or temperature data. Only significant insights, such as early signs of a malfunction, are sent to the cloud for further analysis, reducing bandwidth usage and avoiding unnecessary data overload. This allows the plant to operate more efficiently, with faster decision-making at the edge and lower operational costs.

4. Smart Healthcare Devices

SLMs could improve wearable health monitoring devices, making them more resource-efficient while providing accurate data analysis. For example, a smart heart rate monitor embedded with an SLM can learn the user's regular heart rhythm and only transmit data when anomalies, such as arrhythmias, are detected, reducing unnecessary power usage and data transmission.

Energy Efficiency: Instead of constantly streaming data to the cloud, an SLM-powered device could predict potential health events and alert users or healthcare professionals only when necessary. This would extend battery life and minimize bandwidth usage, making the device more practical for long-term, real-time health monitoring.

By incorporating SLMs into these resource-constrained environments, industries ranging from agriculture to manufacturing can enjoy smarter, more efficient devices, leading to significant cost and energy savings.

Conclusion

Small Language Models (SLMs) are game-changers for resource management in edge computing. By using lightweight architectures and adaptive inference, SLMs enable smarter, more efficient devices across industries, from IoT sensor networks to smart homes and industrial automation. They optimize power, bandwidth, and processing without overwhelming resource-constrained devices, offering a scalable solution for real-time, AI-driven intelligence at the edge. As edge computing grows, SLMs will play a key role in making devices smarter and more energy-efficient, driving innovation across various sectors.

About the Author

Suruchi Shah

Suruchi Shah is a Staff Software Engineer with a specialization in machine learning infrastructure and distributed systems. Currently at LinkedIn, she leads the LLM Serving team, overseeing the infrastructure for large-scale inference workloads. Suruchi has an extensive background in building and scaling complex systems, previously driving the development of LinkedIn's Identity-as-a-Service platform and leading significant migrations, such as transitioning LinkedIn’s traffic to a new distributed graph database, Liquid, which resulted in millions in hardware savings. Her technical expertise spans multiple domains, including search infrastructure and graph databases, with previous roles at BloomReach, where she also contributed to innovative solutions in SolrCloud.

Show moreShow less

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?