BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Azure Data Lake Analytics and U-SQL

Azure Data Lake Analytics and U-SQL

Leia em Português

Key Takeaways

  • Azure Data Lake Analytics, along with Azure Data Lake Storage, is a key component of Microsoft’s Azure Data Lake solution. 
  • Currently, Azure Data Lake Analytics can be used for batch workloads only. For streaming and event processing workloads, alternate big data analytics solutions on Azure like HDInsight or Azure Databricks should be used.
  • Azure Data Lake Analytics introduces a new big data query and processing language called U-SQL.
  • U-SQL combines the concepts and constructs both of SQL and C#; the power of U-SQL comes from the simplicity and declarative nature of SQL with the programmatic power of C# including rich types and expressions.
  • U-SQL operates on unstructured data stored in files and provides a schematized view on top of it. It also provides a general metadata catalog system very similar to relational databases for structured data.
     

Even though big data and Hadoop technologies are more than a decade old now, big data and big data analytics are more relevant than ever. While the initial version of Hadoop was only able to handle batch workloads, now Hadoop ecosystem has tools for other use cases like structured data, streaming data, event processing, machine learning workloads and graph processing.

While Hadoop ecosystem has a bunch of tools like Hive, Impala, Pig, Storm, and Mahout to provide the complete set of features, newer data analytics framework like Spark have an integrated approach to handle different types of workloads.

Azure Data Lake Analytics, or ADLA, is one of the newer big data analytics engines. ADLA is Microsoft’s fully managed, on-demand analytics service on Azure cloud. Together with Azure Data Lake Storage and HDInsight, Azure Data Lake Analytics forms the complete cloud hosted data lake and analytics offering from Microsoft. Azure Data Lake Analytics introduces a new big data query and processing language called U-SQL. This article provides an overview of U-SQL language and how to use it in applications.

Azure Data Lake

Azure Data Lake is Microsoft’s data lake offering on Azure public cloud and is comprised of multiple services including data storage, processing, analytics and other complementary services like NoSQL store, relational database, data warehouse and ETL tools.

Storage Services

  • Azure Data Lake Storage or ADLS - Azure Data Lake Storage is a scalable cloud storage purposely built for analytics, based on open HDFS standard.
  • Azure Blob Storage – General purpose, managed object storage for Azure.

Analytics & Processing Services

  • Azure Data Lake Analytics or ADLA– Fully managed, on-demand analytics service on Azure cloud. Supports new U-SQL big data processing language apart from .NET, R and Python.
  • HDInsight– HDInsight provides managed Hadoop clusters running on Azure and is based on Hortonworks Data Platform (HDP) Hadoop distro. Supports Hadoop ecosystem tools including Spark, Hive, Map Reduce, HBase, Storm, and Kafka.
  • Azure Databricks– Managed serverless analytics service based on Azure Spark. Supports a Jupyter/ iPython/Zeppelin like Notebook experience, along with other communication features, and supports Scala, Python, R and SQL.

Complementary Services

  • Cosmos DB – The managed, serverless, multi modal NoSQL database service on Azure.
  • Azure SQL Database – A managed, relational database as a service/DBaaS on Azure.
  • Azure SQL Datawarehouse – Cloud-based Enterprise Data Warehouse (EDW) solution running on Azure. It uses familiar distributed systems and data warehousing concepts like Massively Parallel Processing (MPP), columnar storage, compression etc. to ensure fast performance for complex queries.
  • Azure Analysis Service – A fully managed analytics engine on Azure; helps to build semantic models on the cloud. It’s built on familiar SQL Server Analysis Server which is an on-premise analytics engine based on SQL Server. As of now, Azure Analysis Service only supports Tabular Models and does not support Multidimensional Models (remember cubes?).
  • Azure Data Factory – A cloud-based ETL and data integration service. It’s serverless and provides out-of-the-box connectors to 50+ cloud or on-premise systems/services like Azure Blob Storage, Cosmos DB, Azure SQL Database, on prem SQL Server/MySQL/PostgreSQL and even 3rd party services like SFDC, Dropbox etc. It can move data between cloud services, from on-premise systems to cloud or vice versa.  

Figure 1 below shows these various cloud offerings from Microsoft on Azure cloud.

[Click on the image to enlarge it]

Figure 1: Services in Azure Data Lake Offering

The big data and data lake-based application architecture on Azure cloud platform is shown below in Figure 2.

[Click on the image to enlarge it]


Figure 2: Typical big data/data lake/ETL/analytics architecture on Azure

U-SQL Introduction

U-SQL is the big data query and processing language for Azure Data Lake Analytics. It’s a new language created by Microsoft especially for Azure Data Lake Analytics. U-SQL combines SQL-like declarative language with the programmatic power of C#, including C# rich types and expressions. U-SQL provides the familiar big data processing concepts such as "schema on read", "lazy evaluation", custom processors and reducers. Data engineers who have previously used languages like Pig, Hive and Spark would find similarity with those. Developers with C# and SQL knowledge would find U-SQL easy to learn and start with.

 
Figure 3: How U-SQL relates to C# and SQL

Though U-SQL uses many concepts and keywords from SQL language, it’s not ANSI SQL compliant. It adds unstructured file handling capabilities using keywords like EXTRACT and OUTPUT.
Currently, ADLA and U-SQL can be used for batch processing only. It doesn’t provide stream analytics or event processing capability.

U-SQL Concepts and Scripts

  • U-SQL query and processing logic is written in files with ".usql" extension called U-SQL scripts. Visual Studio IDE or Azure portal could be used for authoring these scripts. A U-SQL project in Visual Studio contains multiple scripts, code behind files and related reference assemblies.

Figure 4 below shows a screenshot of a U-SQL project in Visual Studio IDE.

 
Figure 4: A U-SQL project in Visual Studio

  • U-SQL scripts follow the familiar Extract/Retrieve, Transform and Load/Output pattern (ETL) used by other big data languages like Pig or Spark. It can extract data from text files (both unstructured text files and semi structured files like JSON or XML) and tables.
  • U-SQL imposes a schema while retrieving unstructured data from files – this helps in performing SQL-like operations of retrieved data.
  • Rowset is the primary data structure of U-SQL. It’s used across for extracting data from input file/table, and performing transformation, as well as for writing to output destination. Rowsets are unordered which helps Azure Data Analytics Engine to parallelize the processing using multiple processing nodes.
  • U-SQL scripts can use types, operators and expressions from C#.
  • U-SQL scripts use SQL constructs like SELECT, WHERE, JOIN and other data definition (DDL) and data manipulation language (DML). All keywords must be written in upper case only.
  • Supports control flow constructs like IF ELSE, but do not support While or For loop.

[Click on the image to enlarge it]


Figure 5: Data flow in U-SQL script

What is required for U-SQL local development

Microsoft provides an emulator-like setup for trying U-SQL and Azure Data Lake on local machine or laptop. For this, three components are required:

  1. Visual Studio 2017 or 2019
  2. Azure SDK (version 2.7.1 or higher) which comes up with the client side SDKs to interact with Azure cloud services and required for storage, compute etc.
  3. Azure Data Lake and Stream Analytics Tools for Visual Studio (version 2.4), which is a plugin for local U-SQL and Azure Data Lake development. Once you install this, relevant Azure Data Lake Analytics (and other) project templates would be added in Visual Studio as shown below. Select U-SQL Project to start.

[Click on the image to enlarge it]


Figure 6: New project template screenshot

First U-SQL script

For the first U-SQL script, we would use a dataset that constitutes of the rating of restaurants in Bangalore, India. The raw data are in CSV files and have the following columns:

  • rest_id - Unique id of the restaurant
  • name - name of the restaurant
  • address - address of the restaurant
  • online_order - whether online ordering is available in the restaurant or not
  • book_table - whether table booking options are available or not
  • rate - overall rating of the restaurant out of 5
  • votes - total number of ratings for the restaurant
  • phone - phone number of the restaurant
  • location - neighborhood in which the restaurant is located
  • rest_type - type of restaurant (e.g. Casual Dining, Quick Bites, Delivery, Bakery, Dessert Parlor etc.)
  • favorite_dish_id - id of the most favorite dish of the restaurant

The below table shows the sample data.

[Click on the image to enlarge it]

Figure 7: Restaurant ratings table with sample data

The below script reads restaurant ratings data from a CSV file and writes the same data to a TSV file. It doesn’t use a transformation step yet.

// Script - RestaurantScript.usql
// Data is extracted from input file (CSV) and stored in employees rowset variable
@restaurant_ratings =
    EXTRACT rest_id int,
            name string,
            address string,
            online_order bool,
            book_order bool,
            rate double,
            votes int,
            phone string,
            location string,
            rest_type string,
            favorite_dish_id int
    FROM "/Samples/Data/restaurants_ratings.csv"
    USING Extractors.Csv();

// No Transformation – extracted data is loaded as is to output file (TSV)
OUTPUT @restaurant_ratings
TO "/output/restaurants_out.tsv"
USING Outputters.Tsv();

The scripts would write the entire restaurant rowsets to output file in a tab separated format.

Note that C# datatypes are used here (e.g. string and not char/varchar as typically used in SQL). Not only can we use the datatypes of C#, but expressions and all the goodness of an expressive programming language.

U-SQL script with Transform Step

// Script - RestaurantScript.usql
// Variables for input and output file name and paths
DECLARE @inputFile = "/Samples/Data/restaurants_ratings.csv";
DECLARE @outputFile = "/output/restaurants_out.tsv";

// Data is extracted from input file (CSV) and stored in employees rowset variable
@restaurant_ratings =
    EXTRACT rest_id int,
            name string,
            address string,
            online_order bool,
            book_order bool,
            rate double,
            votes int,
            phone string,
            location string,
            rest_type string,
            favorite_dish_id int
    FROM @inputFile
    USING Extractors.Csv(skipFirstNRows:1); // Skip first row which contain headers

// Transformation Step: Columns are renamed and no of rows are filtered
@bestOnlineRestaurants =
    SELECT name.ToUpper() AS Name,    // Converting the names to uppercase
           rate AS Rating,
           online_order AS OnlineOrder,
           phone AS Phone,
           location AS Location,
           rest_type AS Category,
           favorite_dish_id AS FavoriteDishId
    FROM @restaurants_rating
    WHERE rate > 4 && online_order == true;

// Load transformed data to output file
OUTPUT @bestOnlineRestaurants
TO @outputFile
USING Outputters.Tsv(outputHeader:true); // Write column names/headers to output file

Extend U-SQL expression using custom code

U-SQL supports custom expressions written in C# code. The C# code resides in code behind files. Note in the below diagram each .usql file has an associated .usql.cs file where custom C# code resides.
 
Figure 8: U-SQL project with multiple script and code behind files

// Code behind C# file - RestaurantScript.usql.cs
namespace UsqlApp1
{
    public static class Helpers
    {
        public static string FormatRestaurantName(string name, string location, string restaurantType)
        {
            return name + " (" + restaurantType + ") - " + location;
   // Note that U-SQL does not yet support new C# 7.0 string interpolation
               // return $"{name} ( {restaurantType} ) - {location}";
        }
    }
}

// Script - RestaurantScript.usql
// Variables for input and output file name and paths
DECLARE @inputFile = "/Samples/Data/restaurants_ratings.csv";
DECLARE @outputFile = "/output/restaurants_out.tsv";

// Data is extracted from input file (CSV) and stored in employees rowset variable
@restaurant_ratings =
    EXTRACT rest_id int,
            name string,
            address string,
            online_order bool,
            book_order bool,
            rate double,
            votes int,
            phone string,
            location string,
            rest_type string,
            favorite_dish_id int
    FROM @inputFile
    USING Extractors.Csv(skipFirstNRows:1); // Skip first row which contain headers

// Transformation Step: Columns are renamed and no of rows are filtered
@bestOnlineRestaurants =
    SELECT USQLApp1.Helpers.FormatRestaurantName(name, location, rest_type) AS Name,
           rate AS Rating,
           online_order AS OnlineOrder,
           phone AS Phone,
           favorite_dish AS FavoriteDish
    FROM @restaurant_ratings
    WHERE rate > 4 && online_order == true;

// Load transformed data to output file
OUTPUT @bestOnlineRestaurants
TO @outputFile
USING Outputters.Tsv(outputHeader:true); // Write column names/headers to output file

U-SQL script performing joins

U-SQL supports joins between two different datasets. It provides Inner Join, Outer Join, Cross Join, etc.
In the following code snippet, we perform inner join between a restaurants dataset and dish ingredients dataset.

// Script - RestaurantScript.usql
DECLARE @inputFile = "/Samples/Data/restaurants_ratings.csv";
DECLARE @outputFile = "/output/restaurants_out.tsv";

// Data is extracted from input file (CSV) and stored in employees rowset variable
@restaurant_ratings = // Code not shown for brevity. Exact same code as above example

// Transformation Step: Columns are renamed and no of rows are filtered
@bestOnlineRestaurants = // Code not shown for brevity. Exact same code as above example

Now, we would need data about dishes and their ingredients. Though this data typically would be present in an external source, we would use an in-memory rowset here.

// Declare an in-memory rowset for dish ingredients containing dish id, name of dish and // ingredients.

@dish_ingredients =
    SELECT
        * FROM
              (VALUES
               (1, "Biryani", "Rice, Indian spices, Vegetables, Meat, Egg, Yoghurt, Dried Fruits"),
               (2, "Masala Dosa", "rice, husked black gram, mustard seeds, fenugreek seeds, salt, vegetable oil, potatoes, onion, green chillies, curry leaves, turmeric"),
               (3, "Cake", " sugar, butter, egg, cocoa, creme, salt")
              ) AS D(DishId, Dish, Ingredients);
// Perform an Inner Join between @bestOnlineRestaurants and @dish_ingredients rowset
@rs_innerJn = SELECT
        r.Name,
        r.Rating,
        i.Dish,
        i.Ingredients
FROM @bestOnlineRestaurants AS r
INNER JOIN @dish_ingredients AS i
ON r.FavoriteDishId == i.DishId;

// Write to output file
OUTPUT @rs_innerJn
TO @outputFile
USING Outputters.Tsv(outputHeader:true);

This returns the restaurants with higher ratings, along with the ingredient details of its favorite dish, which is retrieved by joining the restaurant details rowset with dish ingredient rowset through inner join.

[Click on the image to enlarge it]

Figure 9: U-SQL project with multiple script and code behind files

U-SQL script using built-in functions

U-SQL provides a host of built-in functions including aggregate functions, analytical functions, ranking functions etc. Below are few samples.

Type of Function     Example
Aggregate Functions AVG, SUM, COUNT, STDEV (Standard Deviation), MIN, MAX etc.
Analytical Functions FIRST_VALUE, LAST_VALUE, LAG, LEAD, PERCENTILE_CONT etc.
Ranking Functions RANK, DENSE_RANK, NTILE, ROW_NUMBER etc.

In the below script, we are using built-in aggregate functions like MIN, MAX, AVG and STDEV on for restaurant.

// Declare variables for input and output files
DECLARE @inputFile = "/Samples/Data/restaurants_raw_data.csv";
DECLARE @outputFile = "/output/restaurants_aggr.csv";

@restaurant_ratings =
    EXTRACT rest_id int,
            name string,
            address string,
            online_order bool,
            book_order bool,
            rate double,
            votes int,
            phone string,
            location string,
            rest_type string,
            favorite_dish_id int
    FROM @inputFile
    USING Extractors.Csv(skipFirstNRows:1);

@output =
    SELECT
           rest_type AS RestaurantType,
           MIN(rate) AS MinRating,
           MAX(rate) AS MaxRating,
           AVG(rate) AS AvgRating,
           STDEV(rate) AS StdDevRating
    FROM @restaurants_ratings
    GROUP BY rest_type;

// Write to output file
OUTPUT @output
TO @outputFile
USING Outputters.Csv(outputHeader:true);

U-SQL catalog

So far, we have focused on unstructured and semi-structured data being read from files and written to files. While one of U-SQL’s strengths is to operate on unstructured data stored in files and provide a schematized view on top of unstructured data, it can manage structured data. It provides a general metadata catalog system like Hive. Below is a list of primary objects supported by U-SQL:

  • Database: U-SQL supports databases similar to other big data systems like Hive.
  • Database Schema: Database schemas group related objects present under a Database, exactly like relational databases.
  • Tables and Indexes: Tables are containers to hold structured data. Tables can contain columns of different data types. Table data is stored in files. Tables provide additional advantages above just schematized views over unstructured files like indexing, partitioning table data into multiple buckets with each bucket backed up by a file.
  • Views: U-SQL views are of two types – (i) views that are based on a U-SQL table and (ii) views that point to a file and use EXTRACT to get the data.
  • Functions: Supports both scalar and table valued functions.
  • Procedures:  Procedures are similar to functions but they don’t return any value.
  • Assemblies: U-SQL supports storing .NET assemblies which extends U-SQL scripts with custom expression.

Now, let’s say in our restaurant rating example, we would like to further analyze restaurants with low ratings. To do so, we would like to move all the restaurants with less than a four rating to a U-SQL table for further analysis.

U-SQL database, tables and indexes

In the below example. we will create a U-SQL database which will be created within the database with a schema and index key. We are not creating a schema specifically here, so the table will be created under the default schema ‘dbo’ (Remember SQL Server?) inside the database.

The below code example shows how to create this table.

// Script - RestaurantScript.usql
DECLARE @inputFile = "/Samples/Data/restaurants_ratings.csv";
DECLARE @outputFile = "/output/restaurants_out.tsv";

// Data is extracted from input file (CSV) and stored in employees rowset variable
@restaurant_ratings = // Code not shown for brevity. Exact same code as above example

// Transformation Step: Filter only those restaurants with rating less than 4
@lowRatedRestaurants =
    SELECT rest_id AS RestaurantId,
           name AS Name,
           rate AS Rating,
           online_order AS OnlineOrder,
           phone AS Phone,
           location AS Location,
           rest_type AS Category,
           favorite_dish_id AS FavoriteDishId
    FROM @restaurants_ratings
    WHERE rate < 4;

// Insert Low rated restaurant details to U-SQL Catalog
// Create the database if it does not exist already
CREATE DATABASE IF NOT EXISTS RestaurantsDW;
USE RestaurantsDW;

// Drop the table if it exists
DROP TABLE IF EXISTS dbo.LowRatedRestaurants;

// Create the table by specifying the column schema and index
CREATE TABLE dbo.LowRatedRestaurants(
    RestaurantId int,
    Name string,
    INDEX idx
        CLUSTERED (Name DESC)
        DISTRIBUTED BY HASH(Name),
    Rating double,
    OnlineOrder bool,
    Phone string,
    Location string,
    Category string,
    FavoriteDishId int
);

// Insert the rowset data to the U-SQL table created just before
INSERT INTO dbo.LowRatedRestaurants
SELECT * FROM @lowRatedRestaurants;

U-SQL views

U-SQL views are similar to database views– they do not physically store the data and provide a view over data stored in table or files. Views could be based on table or based on an extraction over files.

The example script below shows how to create a view that’s based on an extraction.

USE DATABASE RestaurantsDW;
// Delete the View if it already exists
DROP VIEW IF EXISTS RestaurantsView;

// Create the View based on an extraction
CREATE VIEW RestaurantsView AS  
    EXTRACT rest_id int,
            name string,
            address string,
            online_order bool,
            book_order bool,
            rate double,
            votes int,
            phone string,
            location string,
            rest_type string,
            favorite_dish_id int
    FROM "/Samples/Data/restaurants_raw_data.csv"
    USING Extractors.Csv(skipFirstNRows:1); // Skip first row which contain headers

To run the view the following code is used:

@result = SELECT * FROM RestaurantsDW.dbo.RestaurantsView;
OUTPUT @result
TO "/output/Restaurants_View.csv"
USING Outputters.Csv();

U-SQL table valued functions (TVF)

U-SQL supports both scalar function and table valued function (TVF). Functions take zero to many arguments and return either a single scalar value or a table, which is a dataset comprised of columns and rows.

The below code snippet shows first how to create a TVF and then how to invoke it. It takes a single parameter and returns a table.

CREATE DATABASE IF NOT EXISTS RestaurantsDW;
USE DATABASE RestaurantsDW;

DROP FUNCTION IF EXISTS tvf_SearchRestaurants;

// Create the Table Valued Function that accepts Restaurant type as string
// and returns a table that contains matched restaurant details.
CREATE FUNCTION tvf_SearchRestaurants(@RestaurantType string)
RETURNS @searchRestaurants TABLE(rest_id int, name string, address string,
                                    online_order bool, book_order bool, rate double,
                                    votes int, phone string, location string,
                                    rest_type string, favorite_dish_id int)
    AS
BEGIN
    @allRestaurants =
        EXTRACT rest_id int,
            name string,
            address string,
            online_order bool,
            book_order bool,
            rate double,
            votes int,
            phone string,
            location string,
            rest_type string,
            favorite_dish_id int
        FROM "/Samples/Data/restaurants_raw_data.csv"
        USING Extractors.Csv(skipFirstNRows:1); // Skip first row which contain headers

    @searchRestaurants =
        SELECT *
        FROM @allRestaurants
        WHERE rest_type == @RestaurantType;

RETURN;
END;

Now let’s invoke the table valued function we just created and pass ‘Bakery’ as parameter– it would return all the restaurants which are of type Bakery.

OUTPUT RestaurantsDW.dbo.tvf_SearchRestaurants("Bakery")
TO "/output/BakeryRestaurants.csv"
USING Outputters.Csv();

Case Study

The following case study highlights using Azure Data Lake Analytics and U-SQL language in a multiyear, large, strategic digital transformation program. The customer, a large insurance major, over the year acquired multiple insurance companies and brokers, and as a result used multiple customer engagement systems for interacting with customers over email, text/SMS, web/mobile chat and calls (both inbound and outbound). Because of the fractured approach, it became very difficult for the customer to analyze customer interaction data.

While the customer embarked on a journey to build an omni channel platform and an integrated contact center for customer service over various channels (email, text, chat bot, contact center voice calls), their immediate tactical choice was to analyze data from various sources for email, text/SMS, chat and call logs.

An Azure Data Lake-based solution was developed to answer the immediate need of analyzing data from different systems, in different formats. Data from various source systems were moved to Azure Data Lake Store and were then analyzed using Azure Data Lake Analysis and U-SQL.

  1. Ingest – In the ingest phase, unstructured and structured data from two different sources (Email/Text/Chat data as well as Call Log) are moved to Azure using Azure Data Factory ETL service.
  2. Store – Raw data is stored on Azure Data Lake Storage/ADLS as flat files.
  3. Analyze – Various types of analysis including filtering, joins, aggregation, windowing etc. are performed in U-SQL.
  4. Model and Serve – Analyzed data is stored in structured tables for later consumption from Power BI/custom reports by user.

[Click on the image to enlarge it]


Figure 10: Azure Data Analytics Pipeline

Conclusions

Azure Data Lake Storage and Analytics have emerged as a strong option for performing big data and analytics workloads in parallel with Azure HDInsight and Azure Databricks. Though it’s still in its early days and lacks streaming and event processing capabilities, its power lies in the new U-SQL language which combines the simplicity and ubiquity of SQL with Mirosoft’s flagship, the powerful C# language. Also, Microsoft’s development tools like Visual Studio and local dev/test capability make it a powerful competitor in big data & analytics space.

About the Author

Aniruddha Chakrabarti has 19 years of experience spread across strategy, consulting, product development and IT Services. He has experience across functions including solution architecture, presales, technology architecture, delivery leadership and program management. As AVP of digital in Mphasis, Chakrabarti is responsible for presales, solution, RFP/RFI and Technology Architecture of large digital deals and programs. Prior to joining Mphasis he has played various leadership and architecture-focused roles at Accenture, Microsoft, Target, Misys and Cognizant. His focus areas include cloud, big data & analytics, AI/ML, NLP, IoT, distributed systems, microservices and DevOps.

 

Rate this Article

Adoption
Style

BT