Key Takeaways
- Service assurance is a pivotal element of cellular networks to meet the demands of emerging applications and use cases.
- It is critical to define the correct set of key performance indicators in a service assurance framework.
- Private LTE/5G networks typically have more stringent service assurance requirements.
- Orchestration and automation play a very crucial role in service assurance.
- Standardization bodies are working toward defining the requirements and procedures for Zero Touch Network and service management.
For cellular networks to become ubiquitous to the needs of emerging applications (IoT, AR/VR, V2X, etc.), service assurance becomes a critical factor of success. Complete visibility in the network and automated recovery in case of customer-impacting issues is crucial to meet the customers’ demanding expectations and achieve the desired monetization from a service provider’s point of view.
This article talks about service assurance in the context of cellular networks, how private networks pose additional needs, and how an end-to-end service assurance framework can be designed and developed for such networks.
Concept of service assurance
Service assurance in telecommunications is the ability to monitor the performance of a network, identify the anomalies, and report them or take corrective actions. This will be based on the service level agreements (SLAs) done with the customers/enterprises, the set of metrics or KPIs identified, and establishing complete visibility and observability across network, services, and applications to demonstrate that the SLAs have been met.
The SLA parameters need to be defined in terms of measurable metrics/KPIs. For instance, QoS agreement encompasses latency, round trip times, packet loss, etc. QoE involves experiencing service continuity when moving from one site to another, etc.
Further, a service assurance framework is needed to measure and monitor these metrics/KPIs, and corrective/preventive actions must be taken. This will require adding handles/plugins in the network to capture real-time data from various network entities. The end-to-end service assurance framework will integrate the monitoring data (faults, alarms, statistics, and other performance measures) and leverage this data to identify anomalies.
If the service assurance framework intends to only expose the information, this is called open-loop. An anomaly discovered is reported to a central location, and someone else needs to act on it. Most systems today, with the help of automation in instantiation and provisioning of network entities, allow automatic healing or recovery from these anomalies.
The AI models go one step further, predicting the usage or failures and accordingly planning changes in the network. If the service assurance framework integrates AI-driven models to learn from the captured/monitored information and takes corrective/predictive actions, it is called closed-loop service assurance.
Service assurance in private networks
While service assurance is important for all kinds of cellular networks, a well-implemented service assurance model is critical for private networks. By adding more observability in the network and more automation techniques, mobile operators have been faring better regarding service degradation and unavoidable outages. But for private networks, the expectation is to have real-time detection and prediction of service degradation and extremely fast reaction times. The 5-figure availability of private networks requires the highest degree of automation in the life cycle management of various network functions and smart Artificial Intelligence (AI) models to identify or predict customer impacting issues.
Network topology changes also need to be detected automatically, and intelligent SDN-based algorithms will be needed to build. Apart from these closed-loop techniques, it’s crucial for the service providers offering private networks to provide complete visibility to their customers. Comprehensive reporting portals in central locations must be accessible to customers paying for private networks.
Common metrics of interest
For cellular networks, in general, the metrics of interest are broadly divided into two categories - the metrics from network functions and the metrics from the infrastructure hosting them. For instance, some examples of metrics from the network functions can be - RAN UE throughput, downlink latency in gNB-DU, device handover rate, the utilization of the transport/backhaul links, the mean number of PDU sessions of network and network Slice Instance, the health of the 5G core network functions, loading at the user plane functions, etc.
At the infrastructure level, you can add monitoring for the compute/storage and network utilization and accordingly scale the infrastructure. You will need multiple monitoring points to get exhaustive data—some radio analysis applications, monitoring agents for the fiber or transport networks, retrieving data for transaction count and load at control plane and user plane functions, etc.
Role of 5G network functions
Service assurance will involve managing and monitoring all domains in a 5G network. This requires triggering collection mechanisms at different levels. For instance, some data sources offered in 5G include the following:
- Access Management Function (AMF) - capturing UE mobility event notifications and location change notifications
- NG-RAN - capturing QoS Parameter Notifications
- User plane function (UPF) - capturing the packet buffering rates
Apart from these network status and performance monitoring functions, other analytics functions are taking reactive/predictive actions on the collected data. On the RAN side, the RAN analytics generally supported by SON algorithms in legacy RAN and by RAN Intelligent Controller (RIC) in the emerging OpenRAN solutions. On the core side, the Management data analytics service (MDAS) provides data analytics for the network. It can be deployed at different levels, for example, at the domain level (e.g., RAN, CN, NSSI) or in a centralized manner (e.g., PLMN level).
Network data analytics function (NWDAF) on the core side is responsible for collecting and analyzing data from UE, other 5GC network functions, etc., and providing real-time analysis back to the other application functions. This network function was introduced in release 15 of 3GPP standards and is an evolved version of RAN congestion awareness function (RCAF) from the previous 3GPP releases.
Generally, the NWDAF and service assurance solutions are run in an integrated manner. For closed-loop service assurance, you need both a powerful NWDAF and a containerized service assurance solution that provides network monitoring, optimization, and insights from the RAN to Core.
For example, an integrated NWDAF and Service Assurance solution will be able to identify the network slice instance and create a slice utilization KPI per network slice instance. Then there will be other NF consumers, such as the Policy Control Function (PCF) or Network Slice Selection Function (NSSF), which subscribe to real-time or periodic notifications of the KPIs and receive notifications when a KPIs exceed a specified threshold.
The PCF takes these inputs from the NWDAF to navigate traffic policies and assign more resources when necessary, enabling the operator to manage the slices dynamically. This is one example of how the NWDAF helps achieve service assurance in 5G networks.
Key players in service assurance - Orchestration and Automation
While providing real-time data and running some analytics on this data is one part of the problem, to achieve the desired performance, reliability, availability, etc., for private networks, it is required to integrate this information with end-to-end service orchestration frameworks (e.g., ETSI OSM, LF ONAP, etc.).
These frameworks offer interfaces to define policies as per customer SLA and accordingly carry out necessary configurations on the network elements. Further, these frameworks have inbuilt monitoring entities capable of retrieving/polling data from the network, identifying performance issues or failures, and taking corrective actions via auto-scaling and auto-healing operations, etc. Some of these frameworks additionally have AI/ML-based models to learn from the reported alerts or performance numbers and accordingly provide updates to the initially defined policies/configurations.
For these corrective/predictive actions to take place in the network in a closed-loop way, the core requirement is to provide a high degree of automation in the instantiation/provisioning of services or network elements. This ensures that whenever there is service degradation or a prediction for the same, the system should be able to bring up an additional set of resources without any manual intervention. This will ensure swift reaction times to any anomalies in the network and thus will be pivotal to meeting the SLAs for private networks.
Closed-loop automation strategy with ONAP as an example
Linux Foundation Project ONAP is an open-source service orchestrator solution offering "Closed Control loop automation." It provides the necessary automation to proactively respond to network and service conditions without human intervention, thus playing a pivotal role in service assurance frameworks. The diagram below shows a high-level view of how ONAP offers closed-loop automation.
The following is the expected flow:
- As a first step, a network service is defined based on the resource needs of the application and the SLAs to be monitored. These resource needs are defined in close coordination between the application developer or the VNF provider and the infrastructure planning team.
- Next step is to define the policies in terms of KPIs that will be monitored, the thresholds for these KPIs, and the actions that will be taken whenever these thresholds are hit.
- Then, within the deployment, at the level of application or the level of infra, these KPI values are distributed to the monitoring agents.
- The framework to monitor the stats related to these KPIs is already enabled—the real-time stats are passed back to the policy managers.
- Based on these stats, the policy managers at the orchestrator or controller level take actions in terms of scaling, topology changes, or any other actions as per applications or VNFs.
- While this corrective action has been taken in a reactive manner i.e., when the threshold has already been hit, these orchestrators have AL/ML-based learning models which can take these data and train their own models to predict such events and accordingly take preventive action. A simple example is predicting the traffic pattern and accordingly scaling out or scaling in the data handling VNFs.
Most orchestrator solutions offer a similar framework to support closed-loop automation in any cellular network. Any operators offering private networks would additionally have a higher degree of automation in the configuration/provisioning of network functions and more sensitive values for the metrics of interest. An ideal solution would require a completely zero-touch network and service management framework.
Zero-touch network and service management - ISG at ETSI
Closed-loop automation frameworks based on advanced AI/ML techniques lead to a completely ZERO-TOUCH NETWORK. A dedicated ISG at ETSI is responsible for exploring the detailed use cases and requirements for zero-touch network and service management. This ISG was established in 2017. They published the first draft of the relevant specifications in 2019, defining a future-proof, end-to-end operable framework, solutions, and core technologies for autonomous networks. These networks will be self-managed and self-organized without human intervention other than the initial business/network policies fed to the zero-touch network and Service Management (ZSM) management frameworks.
In summary, 5G and private networks are evolving rapidly, but the customer experience is a pivotal element for their success. While better speeds, ultra-low latency, and reliability are needed for the advanced applications around analytics, digital twins, etc., the real monetization from these networks will not be possible by these indirect benefits. So setting up the proper service assurance infrastructure for measurable KPIs, and frameworks to monitor them and take preventive, corrective actions is the real key to success.