BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Examining the Internals of a Serverless Platform: Moving towards a ‘Zero-Friction’ PaaS

Examining the Internals of a Serverless Platform: Moving towards a ‘Zero-Friction’ PaaS

Key Takeaways

  • A common trends of application delivery over the last two decades has been a reduction in the time required to move from development to production
  • Platform-as-a-Service (PaaS) architectures simplify how applications are deployed for developers, and serverless architecture or Functions-as-a-Service (FaaS) is the next step in that direction 
  • Some of the benefits of the new emerging ‘serverless’ platforms include faster time to market, and easier management and operability
  • In a nutshell, a serverless platform needs the application developers to think and write business logic in the form of functions which are invoked when an event is dispatched to the system. 
  • Even though serverless platforms have improved since their inception, developers still need to focus on latency planning, caching, configuration management, load balancing, traffic shaping, and operational visibility
  • Developers can leverage tools such as distributed cluster schedulers, containers, service discovery, and software load balancers to build a lambda-like platform
  • The serverless platforms space is ripe for innovation, and it provides interesting new challenges in areas such as data stores, traffic shaping, messaging systems, cluster schedulers and application runtimes
  • The authors are in the process of building a serverless batch platform for NASA Earth science missions through the OpenNEX platform on AWS

A common trends of application delivery over the last two decades has been a reduction in the time required to move from development to production. Moreover, on the operational side of the spectrum, the demand for higher availability, predictable latency and increased throughput of services has significantly increased. Platform-as-a-Service (PaaS) architectures simplify how applications are deployed for developers and promise a ‘zero-touch’ operational experience thereby making application deployment faster and easier to scale. Serverless architecture or Functions-as-a-Service (FaaS) is the next step in that direction where the underlying platform completely alleviates most of the operational concerns of a PaaS, thereby reducing ‘friction’ between development and operation. This article explores how serverless technology can be utilised to make this move from a ‘zero-touch’ to ‘zero-friction’ possible.

Traditional PaaS versus Serverless

Traditional PaaS architectures involved building services which were static operationally with respect to their runtimes, load balancing, logging, configuration strategies and the most popular architecture being the Twelve-Factor-Application.  PaaS has been a great leap forward from deploying snowflake applications by providing consistency around configuration, operational visibility, operability, scaling heuristics, etc.

Serverless architecture or Functions-as-a-Service is the next step in that direction where the underlying platform completely alleviates most of the operational concerns of a PaaS. Services like AWS Lambda, Google’s Cloud Functions or Azure Functions allow developers to write just the request processing logic in a function and the other aspects of the stack like middleware, bootstrapping and operational concerns like scaling and sharding are automatically handled.

Some of the benefits of the new emerging ‘serverless’ platforms are:

  1. Faster time to market - Serverless platforms allows product engineers to innovate at a rapid pace by abstracting away the problems of system engineering in the underlying platform. Also, regular concerns of an internet facing application like identity management, storage, etc are exposed to functions as a service or handled by the underlying middleware. Product engineers can concentrate on developing the actual business logic of the application.
  2. Easier management and operability - Serverless platform provides a clear separation between infrastructure services and applications running on top of the platform. System Engineers and SREs can focus on managing and running the underlying platform and core services such as databases and load balancers while product engineers manage the functions running on top of the platform.

Event-driven Function-as-a-Service (FaaS)

In a nutshell, a serverless platform needs the application developers to think and write business logic in the form of functions which are invoked when an event is dispatched to the system. Event streams are central to Serverless Architectures especially in AWS’s Lambda implementation. Any interaction with the platform such as an user’s request or mutation of state such as updating an object in the data store generates events, which is streamed into a user defined function for processing the event and accomplishes any domain specific concerns.

On receiving an HTTP request from an user, the AWS API Gateway generates events which invokes an instance of a lambda function that handles HTTP Request events and the API Gateway responds to the request with the response from the Lambda function. Similarly, if a file is written onto a datastore or objectstore like DynamoDB or S3, AWS Lambda generates an event which invokes a lambda function and the function could then process the data object asynchronously.

The principle aspects of a serverless platform are:

  1. No need for a server to handle requests - Application developers don’t need to worry about writing servers which accepts and responds to user requests. Although threaded servers like Apache HTTPD,Tomcat and evented servers like Netty have been around for a long time, writing high throughput application servers have always presented their own set of challenges. FaaS by comparison, does not need users to provide an application server to respond to requests. Events in a serverless platform are handled by ephemeral functions that are processes spawned by the underlying middleware.
  2. RPC-less - Microservices/SOA based services rely on RPC for fanning out to 100s of services across the network to respond to user requests. On one side of the spectrum are load balancers like HAProxy that operate on data streams as opaque byte streams with limited configurability and on the contrary,  there are RPC libraries like Finagle and gRPC which provides extreme configurability around load balancing, request processing, etc, but are quite complex and needs a lot of experience to use. With FaaS, RPC is handled transparently by the platform at the time one Lambda function invokes another Lambda function. A lambda function usually uses the SDK provided by the platform provider to invoke another lambda function or emits an event which is asynchronously fed into other lambda function.
  3. Topology and Supervision Free - Network topology is an important factor to consider while deploying any modern internet scale application. Clusters have to be spread across failure domains such as racks and switches in Data Centers (DCs) in order to achieve higher availability. Also, the platform needs to react when there is an outage of a unit of compute - which could be a node, a rack or an entire DC. Cluster schedulers such as Nomad, Kubernetes and Mesos provide features like cluster supervision and high availability (HA) environments out of the box and allow for the deployment of highly scalable services. One such example of a highly scalable service is the Map Reduce infrastructure at Google which is built on top of the Borg cluster scheduler (PDF download). Serverless platforms alleviate an application developer’s concerns of placement details and process supervision. The underlying platform invokes a function whenever an event has to be handled in turn reducing operational complexity.

However, as the article “There's Just No Getting around It: You're Building a Distributed System” from ACM Queue points out, there is just no getting around the inherent complexities of distributed systems while building internet scale applications.

Most mission critical services provide the following service level agreements (SLAs) to the end consumers:

  1. Predictable throughput
  2. High availability of core functionalities
  3. Tolerable latency
  4. Reliability of core functionalities

Companies like Netflix, Google, and Facebook have invested significantly in this area during the course of building modern platforms for their consumer facing services. Each of these companies have a proven track record for their quality of service despite running on commodity hardware and network.

Serverless Operational Concerns

Even though serverless platforms such as the Lambda service of AWS has improved since it’s inception, we still need to focus on some of the problems we have already solved in existing platforms.

The most relevant ones are:

Latency Planning - Synchronous user requests usually require responding within a predictable time period. If a user request involves invoking lambda functions in a chained manner, it is important that we factor in the average time and variance it takes for the middleware to do it’s work such as invoke functions and emitting events. This is hard since the underlying platform is opaque to the application developer.

Existing cloud platforms like NetflixOSS solves these kinds of problems by using request cancellation techniques in their software load balancer Ribbon. An analogous implementation in a serverless platform would entail a function timing out the invocation of another function once it exceeds a deadline.

Caching - There are various forms of caching, from external caches like memcached to application in-memory caches and clustered in-process caches such as groupcache. Caching choices become limited based on the guarantees the serverless platform makes with respect to durability of the instance of a function which responds to an event.

For example, on Lambda there is no guarantee that the same instance of a function is going to respond to events and in-process caches might not work well necessitating the use of external caches like ElastiCache instead of in-process caches such as GroupCache.

Configuration management - Configuration can be largely divided into two classes, dynamic configuration and static configuration. Static configuration is usually pushed via configuration management systems, baked into applications or machine images and take effect when an application process starts. Dynamic configuration is typically runtime parameters which operators override when an application is already running and changes how the application behaves. Most application configuration libraries or management technique assume a running process, and so they do not work well with serverless platforms.

Application or environment-related configuration has to be stored on external data stores or caches that the lambda function has to retrieve when it is invoked.

Load Balancing - RPC is simply opaque on Lambda architectures since invoking another Lambda function synchronously usually involves using the SDK. There has been a lot of work in RPC libraries like Finagle and Ribbon to improve reliability, latency, and throughput of RPC requests by using techniques for managing outgoing RPC requests such as request cancellation, parallel requests, bulkheads and batching.

The AWS Step Function service is a step in the right direction here, but the FaaS space still needs to mature in this regard in order to ensure that we have the same degree of reliability in handling synchronous requests, which need to fan out to multiple downstream services.

Traffic shaping - It is very common to serve internet-scale applications from more than one geographic region. For example the Netflix service is served from three different geographic regions within AWS. One of the prerequisites for a multi-region architecture is the ability to divide traffic across the different regions so that users can be routed to a nearby DC and the workload is balanced appropriately across all regions. Also it is important that operators retain the ability to shut down a region and steer all the traffic to another region if there is an outage or degradation in a region, which haven’t historically been uncommon.  There are many ways of achieving this such as dividing the continents using Geo DNS and software load balancers such as Zuul.

In a serverless architecture we lose some flexibility in this space since the user requests are handled by an opaque load balancer such as the AWS API Gateway and the underlying platform usually, chooses a lambda function deployed in the same region where the request arrives. So when there are outages in caching or data store services, it is hard to steer part of the traffic to other regions. The coupling between a lambda region and its caches and data stores is inherent in the application but not apparent to FaaS providers.  With DNS based traffic shaping it is still possible for serverless environments to move traffic and indeed is the most common way of steering traffic to date (versus the use of proxying load balancers).

Operational Visibility - Debugging Distributed Systems is challenging and usually requires access to a significant amount of  relevant metrics to zero in on the root cause. Distributed tracing is a helpful technique in aiding in the understanding of how a request fans out across multiple services and helps in debugging systems based on the microservices architecture. Due to the nascency of the serverless architecture we are yet to see mature debugging tools and platforms need to expose more operational metrics than they do today.

Looking at the Anatomy of a Serverless Platform

Currently, the leading serverless platforms - AWS Lambda, GCP Cloud Functions or Azure Cloud Functions - are all hosted services on the public cloud, but similar platforms could be built for a private cloud as well. We can leverage tools such as distributed cluster schedulers, containers, service discovery, and software load balancers to build a lambda-like platform.

There are many ways of building a serverless platform, however the backbone usually comprises of the following services:

  1. Code or function delivery mechanism
  2. Request routing
  3. Cluster scheduler and QOS guarantees
  4. Auto-scaling engine for scale out/in of services
  5. Operational insights

A generic serverless platform

Function Delivery

Functions that are written by application developers need to be deployed on compute clusters before they can be invoked. The functions also need to be wrapped in a server or an event processor that can receive the events and invoke the function. The build process would typically handle the part of the process that includes making the user functions part of an event processor or a server process. Once the code is built, there are many ways of actually packaging the code - OS packages, tarballs, docker containers or root-fs archives such as the ones which is used by LXC and LXD. The methodology we choose here would depend on how we actually run the processes on the compute nodes.

The language and framework choices used to wrap Lambda functions by the event processors could influence the choices for the runtimes that are available to application developers. Needless to say, the user functions have to be written with a keen understanding of  the underlying platform.  For example, if the underlying platform is only capable of invoking functions written in nodejs, the users can only program in JavaScript targeting a specific VM and using the node SDK. The underlying runtime could also have a whitelisted set of APIs for enhanced security and compatibility to the platform.

Request Routing

Request routing is very platform specific. The request router at the edge needs to be aware of the functions registered and needs to be able to dispatch events to the underlying middleware that ultimately invokes the functions. For synchronous requests, the router needs to be able to keep a connection open once a request is made and wait for the middleware to return with the response from the function handling the event.

Ebay’s Fabio load balancer can dispatch requests dynamically to services. Fabio was initially built for load balancing services registered with a dynamic service discovery tool like Consul. Nginx with it’s Lua bindings offer a lot of flexibility as well, it’s possible to extend Nginx to discover services from an external service discovery tool or transform the incoming request and hand it over to the underlying middleware.

Request routers can also add more redundancy to the platform by aiding in traffic shaping. Traffic shaping is a very important aspect of managing and operating services at the global scale. When a region or data center goes offline the norm is to transparently shift the user traffic from the affected region to another and aspire to not degrade the user experience. Serverless platforms should be built with cross-region/data-center operations in mind.

Cluster Scheduler and Events Controller

The cluster scheduler and the events controller, which we will also refer as the middleware, are collectively the backbone of a serverless platform. The events controller is the service which receives the events from various sources and invokes appropriate user defined functions. It is very important for the events controller to provide an at least-once delivery of events for the consistency of the platform.

The cluster scheduler supervises the function handlers that process the events or in some cases pre-scales the function handlers. The events controller and cluster scheduler should work in tandem to ensure that there is appropriate capacity to handle the flow of events at all times. In addition to that, the cluster scheduler has to ensure QOS and high availability of the function handlers so that the functions can process the events within acceptable limits of latency.

Usually serverless platforms are multi-tenant which means the cluster schedulers have to provide resource isolation [memory, CPU, disk and network] to the function handlers.  Technologies like CGroups, Namespaces and ZFS play an important role here.

Cluster Schedulers also provides high availability to the function handlers running on the platform by migrating functions in a failed node, rack or datacenter to somewhere else on the cluster in the same region.

The telemetry generated by schedulers, such as latency of starting function handlers, resource contention among functions in an multi-tenant system and dispatch latencies are very important and should be exposed to feedback control systems such as Auto-scalers and Site Reliability Engineers.

Autoscaling

A serverless platform should attempt to alleviate the need of the capacity planning of compute, storage and network resources for users. The platform should scale to the extent that it can handle the events being dispatched by the routers within reasonable response latencies.

There are various ways of achieving this, usually the middleware or the cluster scheduler pre-scales the function handlers based on the rate of events generated and thus are able to handle all the events in a reasonable time. One such predictive autoscaler is Netflix’s Scryer - similar auto-scalers should be built for serverless platforms which can consume the telemetry from a scheduler and scale the function handlers.

Operational Insights

There are two principle groups of people who interact with a serverless platform -

  • Application developers writing the functions
  • SREs operating the platform

It is important that the platform exposes the right set of telemetry to these two groups of people.

Application developers are usually concerned with the following metrics to their functions:

  • Throughput of events - Number of events generated by the routers and other sinks of data pipelines such as databases and object stores.
  • Time to respond to an event - This metric is usually best expressed as an histogram since that expresses the right amount of detail for understanding the long tail P99 latency.
  • Latency at the edge - The latency experienced by the end users of the applications running on the serverless platform.
  • Latency between functions - When synchronous requests are being responded by a chain of functions, application developers should understand the latency between invoking the functions in the chain. That latency is usually incurred by the middleware or the request router in the underlying platform but it helps application developers think about the maximum number of functions an event could invoke.
  • Distributed Tracing - The underlying platform should be able to trace the events flowing into the system as they invoke various function handlers.

SREs benefit from looking at the system from a much more higher level. Here are some of the most important metrics for operators:

  • Throughput of events being dispatched into the platform - This metric should be tagged with location constraints such as region, datacenter, etc. They usually indicate the load on a specific cluster and its health. Events should also be tagged with the functions they are invoking since that often points to a particular tenant generating more than desired load on the platform.
  • Functions - The number of active functions on the cluster, and the rate at which they are being invoked. Startup latencies are also important here, and again they should be tagged with nodes, racks, data-centers, etc.
  • Distribution - Metrics related to function delivery should also be tracked, since often increased latencies of function invocation could be due to increased time of replication of function handler code across the cluster.

Concluding Remarks

Serverless platforms are far more approachable for application engineers and provide abstractions that promise to make application delivery faster than traditional PaaS platforms. The space is also ripe for innovation and we are going to see a lot of new entrants. It provides interesting new challenges in areas such as data stores, traffic shaping, messaging systems, cluster schedulers and application runtimes.

We are in the process of building a serverless batch platform for NASA Earth science missions through the OpenNEX platform on AWS. The OpenNEX is an open version of the NASA Earth Exchange (NEX), a collaborative supercomputing platform for engaging and enabling Earth scientists to perform big data analytics. Data from NASA missions are processed with community-vetted algorithms on platforms such as NEX to answer various science questions.

New data products and scientific workflows are made available to the public through platforms such as the OpenNEX and they enable the community to develop better algorithms and/or create new products. Enabling such experimental products has been a challenge because the processing pipelines are stovepiped. The proposed technology provides a platform so scientists can simply use the setup including existing functions, and compliment with their own input functions packaged via containers to do new science or generate new products.

About the Authors

Diptanu Gon Choudhury is an infrastructure engineer at Facebook. He works on infrastructure that powers large scale distributed systems such as service discovery, RPC and distributed tracing. He is one of the lead engineers of the Nomad distributed scheduler project and prior to that he worked in the platform engineering group at Netflix where he built an Apache Mesos framework for running tens of thousands of Linux container on AWS.

Dr. Sangram Ganguly is a senior research scientist at the Bay Area Environmental Research Institute and at the Biospheric Science Branch at NASA Ames Research Center. Dr. Ganguly research interests span from remote sensing, cloud computing, machine learning, high-performance computing and advanced signal processing. He has received numerous awards and recognitions for his impactful contributions in Earth Sciences and especially in developing a cloud-centric architecture, OpenNEX, for creating workflows and datasets that can be deployed on the cloud. 

Andrew Michaelis is a software engineer and has contributed to the NEX, OpenNEX and TOPS projects at NASA Ames Research Center. He has been involved in developing large-scale science processing and data analysis pipelines, with emphasis on remote sensing.  Andrew has also co-authored several relevant peer-reviewed publications and received several outstanding awards.

Ramakrishna Nemani is a Senior Earth scientist with the NASA Advanced Supercomputing Division at Ames Research Center. He leads the development of NEX (NASA Earth eXchange), a collaborative computing platform for Earth system modeling and analysis.

 

 

Rate this Article

Adoption
Style

BT