BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles The Data Science Mindset: Six Principles to Build Healthy Data-Driven Organizations

The Data Science Mindset: Six Principles to Build Healthy Data-Driven Organizations

Leia em Português

Key Takeaways

  • Most organizations struggle to unlock data science to optimize their operational processes and get data scientists, analysts, and business teams speaking the same language: different teams and the data science process are often a source of friction.
  • The Healthy Data Science Organization Framework is a portfolio of methodologies, technologies, resources that will assist your organization (from business understanding, data generation and acquisition, modeling, to model deployment and management) to become more data-driven.
  • In order to successfully translate the vision and business goals into actionable results, it's important to establish clear performance metrics.
  • Organizations need to think more organically about the end-to-end data flow and architecture that will support their data science solutions.
  • Based on Azure Machine Learning service, the team built a workforce placement recommendation solution that recommends optimal staff composition and individual staff with the right experience and expertise for new projects.

In the last few years, data from myriads of different sources have become more available and consumable, and organizations have started looking for ways to use latest data analytics techniques to address their business needs and pursue new opportunities. Not only has data become more available and accessible, there’s been also an explosion of tools and applications that enable teams to build sophisticated data analytics solutions. For all these reasons, organizations are increasingly forming teams around the function of Data Science.  

Data Science is a field that combines mathematics, programming, and visualization techniques and applies scientific methods to specific business domains or problems, like predicting future customer behavior, planning air traffic routes, or recognizing speech patterns. But what does it really mean to be a data-driven organization? 

In this article, both business and technical leaders will learn methods to assess whether their organization is data-driven and benchmark its data science maturity. Moreover, through real-world and applied use cases, they will learn how to use the Healthy Data Science Organization Framework to nurture a healthy data science mindset within the organization.  This framework has been created based on my experience as a data scientist working on end-to-end data science and machine learning solutions with external customers from a wide range of industries including energy, oil and gas, retail, aerospace, healthcare, and professional services. The framework provides a lifecycle to structure the development of your data science projects. The lifecycle outlines the steps, from start to finish, that projects usually follow when they are executed.

Understanding the Healthy Data Science Organization Framework

Being a data-driven organization implies embedding data science teams to fully engage with the business and adapting the operational backbone of the company (techniques, processes, infrastructures, and culture).  The Healthy Data Science Organization Framework is a portfolio of methodologies, technologies, resources that, if correctly used, will assist your organization (from business understanding, data generation and acquisition, modeling, to model deployment and management) to become more data-driven. This framework, as shown below in Figure 1, includes six key principles: 

  1. Understand the Business and Decision-Making Process 
  2. Establish Performance Metrics 
  3. Architect the End-to-End Solution 
  4. Build Your Toolbox of Data Science Tricks 
  5. Unify Your Organization’s Data Science Vision 
  6. Keep Humans in the Loop 

Figure 1. Healthy Data Science Organization Framework

Given the rapid evolution of this field, organizations typically need guidance on how to apply the latest data science techniques to address their business needs or to pursue new opportunities.

Principle 1: Understand the Business and Decision-Making Process

For most organizations, lack of data is not a problem. In fact, it’s the opposite: there is often too much information available to make a clear decision. With so much data to sort through, organizations need a well-defined strategy to clarify the following business aspects:

  • How can data science help organizations transform business, better manage costs, and drive greater operational excellence?
  • Do organizations have a well-defined and clearly articulated purpose and vision for what they are looking to accomplish?
  • How can organizations get support of C-level executives and stakeholders to take that data-driven vision and drive it through the different parts of a business?

In short, companies need to have a clear understanding of their business decision-making process and a better data science strategy to support that process. With the right data science mindset, what was once an overwhelming volume of disparate information becomes a simple and clear decision point. Driving transformation requires that companies have a well-defined and clearly articulated purpose and vision for what they are looking to accomplish. It often requires the support of a C-level executive to take that vision and drive it through the different parts of a business.

Organizations must begin with the right questions. Questions should be measurable, clear and concise and directly correlated to their core business. In this stage, it is important to design questions to either qualify or disqualify potential solutions to a specific business problem or opportunity. For example, start with a clearly defined problem: a retail company is experiencing rising costs and is no longer able to offer competitive prices to its customers. One of many questions to solve this business problem might include: can the company reduce its operations without compromising quality?

There are two main tasks that organizations need to address to answer those type of questions:

  • Define business goals: the Data Science team needs to work with business experts and other stakeholders to understand and identify the business problems.
  • Formulate right questions: companies need to formulate tangible questions that define the business goals that the data science teams can target.

Last year, the Azure Machine Learning team developed a recommendation-based staff allocation solution for a professional services company. By making use of Azure Machine Learning service, we built and deployed a workforce placement recommendation solution that recommends optimal staff composition and individual staff with the right experience and expertise for new projects. The final business goal of our solution was to improve our customer’s profit.

Project staffing is done manually by project managers and is based on staff availability and prior knowledge of individual’s past performance. This process is time-consuming, and the results are often sub-optimal. This process can be done much more effectively by taking advantage of historical data and advanced machine learning techniques.

In order to translate this business problem into tangible solutions and results, we helped the customer to formulate the right questions, such as:

  1. How can we predict staff composition for a new project? For example, one senior program manager, one principal data scientist and two accounting assistants.
  2. How can we compute Staff Fitness Score for a new project? We defined our Staff Fitness Score as a metric to measure the fitness of staff with a project.

The goal of our machine learning solution was to suggest the most appropriate employee to a new project, based on employee’s availability, geography, project type experience, industry experience and hourly contribution margin generated for previous projects. Azure, with its myriad of cloud-based tools, can help organizations with building successful workforce analytic solutions that provide the basis for specific action plans and workforce investments: with the Azure Cloud, it becomes much easier to gain unparalleled productivity with end-to-end development and management tools to monitor, manage, and protect cloud resources. Moreover, Azure Machine Learning service provides a cloud-based environment that organizations can use to prepare data, train, test, deploy, manage, and track machine learning models. Azure Machine Learning service also includes features that automate model generation and tuning to help you create models with ease, efficiency, and accuracy. These solutions can address gaps or inefficiencies in an organization staff allocation that need to be overcome to drive better business outcomes. Organizations can gain a competitive edge by using workforce analytics to focus on optimizing the use of their human capital. In the next few paragraphs, we will see together how we built this solution for our customer.

Principle 2: Establish Performance Metrics 

In order to successfully translate this vision and business goals into actionable results, the next step is to establish clear performance metrics. In this second step, organizations need to focus on these two analytical aspects that are crucial to define the data solution pipeline (Figure 2) as well:

  • What is the best analytical approach to tackle that business problem and draw accurate conclusions?
  • How can that vision be translated into actionable results able to improve a business?

Figure 2. Data solution pipeline

This step breaks down into three sub-steps:

  1. Decide what to measure

Let’s take Predictive Maintenance, a technique used to predict when an in-service machine will fail, allowing for its maintenance to be planned well in advance. As it turns out, this is a very broad area with a variety of end goals, such as predicting root causes of failure, which parts will need replacement and when providing maintenance recommendations after the failure happens, etc.

Many companies are attempting predictive maintenance and have piles of data available from all sorts of sensors and systems. But, too often, customers do not have enough data about their failure history and that makes it is very difficult to do predictive maintenance – after all, models need to be trained on such failure history data in order to predict future failure incidents. So, while it’s important to lay out the vision, purpose and scope of any analytics projects, it is critical that you start off by gathering the right data. The relevant data sources for predictive maintenance include, but are not limited to: failure history, maintenance/repair history, machine operating conditions, equipment metadata. Let’s consider a wheel failure use case: the training data should contain features related to the wheel operations. If the problem is to predict the failure of the traction system, the training data has to encompass all the different components for the traction system. The first case targets a specific component whereas the second case targets the failure of a larger subsystem. The general recommendation is to design prediction systems about specific components rather than larger subsystems.

Given the above data sources, the two main data types observed in the predictive maintenance domain are: 1) temporal data (such as operational telemetry, machine conditions, work order types, priority codes that will have timestamps at the time of recording. Failure, maintenance/repair, and usage history will also have timestamps associated with each event); and 2) static data (machine features and operator features, in general, are static since they describe the technical specifications of machines or operator attributes. If these features could change over time, they should also have timestamps associated with them). Predictor and target variables should be preprocessed/transformed into numerical, categorical, and other data types depending on the algorithm being used.

  1. Decide how to measure it

Thinking about how organizations measure their data is just as important, especially before the data collection and ingestion phase. Key questions to ask for this sub-step include:

  • What is the time frame?
  • What is the unit of measure?
  • What factors should be included?

A central objective of this step is to identify the key business variables that the analysis needs to predict. We refer to these variables as the model targets, and we use the metrics associated with them to determine the success of the project. Two examples of such targets are sales forecasts or the probability of an order being fraudulent.

  1. Define the success metrics

After the key business variables identification, it is important to translate your business problem into a data science question and define the metrics that will define your project success. Organizations typically use data science or machine learning to answer five types of questions:

Determine which of these questions companies are asking and how answering it achieves business goals and enables measurement of the results. At this point it is important to revisit the project goals by asking and refining sharp questions that are relevant, specific, and unambiguous. For example, if a company wants to achieve a customer churn prediction, they will need an accuracy rate of “x” percent by the end of a three-month project. With this data, companies can offer customer promotions to reduce churn.

In the case of our professional services company, we decided to tackle the first business question (How can we predict staff composition, e.g. one senior accountant and two accounting assistants, for a new project?). For this customer engagement, we used five years of daily historical project data at individual level. We removed any data that had a negative contribution margin or negative total number of hours. We first randomly sample 1000 projects from the testing dataset to speed up parameter tuning. After identifying the optimal parameter combination, we ran the same data preparation on all the projects in the testing dataset.

Below (Figure 3) is a representation of the type of data and solution flow that we built for this engagement:

Figure 3. Representation of the type of data and solution flow

We used a clustering method: the k-nearest neighbors (KNN) algorithm. KNN is a simple, easy-to-implement supervised machine learning algorithm. The KNN algorithm assumes that similar things exist in close proximity, finds the most similar data points in the training data, and makes an educated guess based on their classifications. Although very simple to understand and implement, this method has seen wide application in many domains, such as in recommendation systems, semantic searching, and anomaly detection.

In this first step, we used KNN to predict the staff composition, i.e. numbers of each staff classification/title, of a new project using historical project data. We found historical projects similar to the new project based on different project properties, such as Project Type, Total Billing, Industry, Client, Revenue Range etc. We assigned different weights to each project property based on business rules and standards. We also removed any data that had negative contribution margin (profit). For each staff classification, staff count is predicted by computing a weighted sum of similar historical projects’ staff counts of the corresponding staff classification. The final weights are normalized so that the sum of all weights is 1. Before calculating the weighted sum, we removed 10% outliers with high values and 10% outliers with low values.

For the second business question (How can we compute Staff Fitness Score for a new project?), we decided to use a custom content-based filtering method: specifically, we implemented a content-based algorithm to predict how well a staff’s experience matches project needs. In a content-based filtering system, a user profile is usually computed based on the user’s historical ratings on items. This user profile describes the user’s taste and preference. To predict a staff’s fitness for a new project, we created two staff profile vectors for each staff using historical data: one vector is based on the number of hours that describes the staff’s experience and expertise for different types of projects; the other vector is based on contribution margin per hour (CMH) that describes the staff’s profitability for different types of projects. The Staff Fitness Scores for a new project are computed by taking the inner products between these two staff profile vectors and a binary vector that describes the important properties of a project.

We implemented this machine learning steps using Azure Machine Learning service. Using the main Python SDK and the Data Prep SDK for Azure Machine Learning, we built and trained our machine learning models in an Azure Machine Learning service Workspace. This workspace is the top-level resource for the service and provides with a centralized place to work with all the artifacts we have created for this project.

In order to create a workspace, we defined the following configurations:

Field

Description

Workspace name

Enter a unique name that identifies your workspace. Names must be unique across the resource group. Use a name that's easy to recall and differentiate from workspaces created by others.

Subscription

Select the Azure subscription that you want to use.

Resource group

Use an existing resource group in your subscription, or enter a name to create a new resource group. A resource group is a container that holds related resources for an Azure solution.

Location

Select the location closest to your users and the data resources. This location is where the workspace is created.

 When we created a workspace, the following Azure resources were added automatically:

The workspace keeps a list of compute targets that you can use to train your model. It also keeps a history of the training runs, including logs, metrics, output, and a snapshot of your scripts. We used this information to determine which training run produces the best model.

After, we registered our models with the workspace, and we used the registered model and scoring scripts to create an image to use for the deployment (more details about the end-to-end architecture built for this use case will be discussed below). Below is a representation of the workspace concept and machine learning flow (Figure 4):

Figure 4. Workspace concept and machine learning flow

Principle 3: Architect the End-to-End Solution 

In the era of Big Data, there is a growing trend of accumulation and analysis of data, often unstructured, coming from applications, web environments and a wide variety of devices. In this third step, organizations need to think more organically about the end-to-end data flow and architecture that will support their data science solutions, and ask themselves the following questions:

  • Do they really need this volume of data?
  • How do they ensure its integrity and reliability?
  • How should they store, treat and manipulate this data to answer my questions?
  • And most importantly, how will they integrate this data science solution in their own business and operations in order to successfully consume it over time?

Data architecture is the process of planning the collection of data, including the definition of the information to be collected, the standards and norms that will be used for its structuring and the tools used in the extraction, storage and processing of such data.

This stage is fundamental for any project that performs data analysis, as it is what guarantees the availability and integrity of the information that will be explored in the future. To do this, you need to understand how the data will be stored, processed and used, and which analyses will be expected for the project. It can be said that at this point there is an intersection of the technical and strategic visions of the project, as the purpose of this planning task is to keep the data extraction and manipulation processes aligned with the objectives of the business.

After having defined the business objectives (Principle 1) and translated them into tangible metrics (Principle 2), it is now necessary to select the right tools that will allow an organization to actually build an end-to-end data science solution. Factors such as volume, variety of data and the speed with which they are generated and processed will help companies to identify which types of technology they should use. Among the various existing categories, it is important to consider:

  • Data collection tools, such as Azure Stream Analytics and Azure Data Factory These are the ones that will help us in the extraction and organization of raw data.
  • Storage tools, such as Azure Cosmos DB and Azure Storage: These tools store data in either structured or unstructured form, and can aggregate information from several platforms in an integrated manner
  • Data processing and analysis tools, such as Azure Time Series Insights and Azure Machine Learning Service Data Prep  With these we use the data stored and processed to create a visualization logic that enables the development of analyses, studies and reports that support operational and strategic decision-making
  • Model operationalization tools, such as Azure Machine Learning service and Machine Learning Server: After a company has a set of models that perform well, they can operationalize them for other applications to consume. Depending on the business requirements, predictions are made either in real time or on a batch basis. To deploy models, companies need to expose them with an open API interface. The interface enables the model to be easily consumed from various applications, such as:
    • Online websites
    • Spreadsheets
    • Dashboards
    • Line-of-business (LoB) applications
    • Back-end applications

The tools can vary according to the needs of the business but should ideally offer the possibility of integration between them to allow the data to be used in any of the chosen platforms without needing manual treatments. This end-to-end architecture (Figure 5) will also offer some key advantages and values to companies, such as:

Figure 5. End-to-end architecture example

  • Accelerated Deployment & Reduced Risk: An integrated end-to-end architecture can drastically minimize cost and effort required to piece together an end-to-end solution, and further enables accelerated time to deploy use cases
  • Modularity: Allows companies to start at any part of the end-to-end architecture with the assurance that the key components will integrate and fit together
  • Flexibility: Runs anywhere including multi-cloud or hybrid-cloud environments
  • End-to-End Analytics & Machine Learning: Enables end-to-end analytics from edge-to-cloud, with the ability to push machine learning models back out to the edge for real-time decision making
  • End-to-End Data Security & Compliance: Pre-integrated security and manageability across the architecture including access, authorization, and authentication
  • Enabling Open Source Innovation: Built off of open-source projects and a vibrant community innovation model that ensures open standards

In the case of our professional service company, our solution architecture consists of the following components (Figure 6):

Figure 6. End-to-end architecture developed by Microsoft Azure ML team

  1. Data scientists train a model using Azure Machine Learning and an HDInsight cluster. Azure HDInsight is a managed, full-spectrum, open-source analytics service for enterprises. HDInsight is a cloud service that makes it easy, fast, and cost-effective to process massive amounts of data. The model is containerized and put in to an Azure Container Registry. Azure Container Registry allows you to build, store, and manage images for all types of container deployments. For this specific customer engagement, we created an Azure Container Registry instance using the Azure CLI. Then, use Docker commands to push a container image into the registry, and finally pull and run the image from your registry. The Azure CLI is a command-line tool providing a great experience for managing Azure resources. The CLI is designed to make scripting easy, query data, support long-running operations, and more.
  2. The model is deployed via an offline installer to a Kubernetes cluster on Azure Stack. Azure Kubernetes Service (AKS) simplifies the management of Kubernetes by enabling easy provisioning of clusters through tools like Azure CLI and by streamlining cluster maintenance with automated upgrades and scaling. Additionally, the ability to create GPU clusters allows AKS to be used for high-performing serving, and auto-scaling of machine learning models.
  3. End users provide data that is scored against the model. The process of applying a predictive model to a set of data is referred to as scoring the data. Once a model has been built, the model specifications can be saved in a file that contains all of the information necessary to reconstruct the model. You can then use that model file to generate predictive scores in other datasets
  4. Insights and anomalies from scoring are placed into storage for later upload. Azure Blob storage is used to store all project data. Azure Machine Learning Service integrates with Blob storage so that users do not have to manually move data across compute platforms and Blob storage. Blob storage is also very cost-effective for the performance that this workload requires.
  5. Globally-relevant and compliant insights are available in the global app. Azure App Service is a service for hosting web applications, REST APIs, and mobile back ends. App Service not only adds the power of Microsoft Azure to your application, such as security, load balancing, autoscaling, and automated management. You can also take advantage of its DevOps capabilities, such as continuous deployment from Azure DevOps, GitHub., Docker Hub, and other sources, package management, staging environments, custom domain, and SSL certificates.
  6. Finally, data from edge scoring is used to improve the model.

Principle 4: Build Your Toolbox of Data Science Tricks 

When working on the recommendation-based staff allocation solution for our professional services company, we immediately realized that they were limited in time and didn’t have an infinite amount of computing resources. How can organizations organize their work so that they can maintain maximum productivity?

We worked closely with our customer’s data science team and helped them develop a portfolio of different tricks to optimize their work and accelerate production time, for example:

  • Train on a subset much smaller than the whole data set you have first: Once data science teams have a clear understanding of what they need to achieve in terms of features, loss function, metrics, and values of hyperparameters, then scale things up.
  • Reuse knowledge gained from previous projects: Many data science problems are similar to one another. Reusing the best values of hyperparameters or feature extractors from similar problems other data scientists solved in the past will save organizations lot of time.
  • Setup automated alerts that will inform data science teams that a specific experiment is over: This will save data science teams time in case something went wrong with the experiment.
  • Use Jupyter notebooks for quick prototyping: Data scientists can rewrite their code into Python packages/classes once they are satisfied with the result.
  • Keep your experiment code in a version control system, such as GitHub.
  • Use pre-configured environments in the cloud for data science development: These are virtual machine images (such as Windows Virtual Machines and Azure Data Science Virtual Machine), pre-installed, configured and tested with several popular tools that are commonly used for data analytics, machine learning training.
  • Have a list of things to do while experiments are running: data collection, cleaning, annotation; reading on new data science topics, experimenting with a new algorithm or a framework. All those activities will contribute to the success of your future projects. Some suggested Data Science websites are: Data Science Central, KDnuggets, Revolution Analytics

Principle 5: Unify Your Organization’s Data Science Vision 

Right from the first day of a data science process, data science teams should interact with business partners. Data scientists and business partners get in touch on the solution non-frequently. Business partners want to stay away from the technical details and so do the data scientists from business. However, it is very essential to maintain constant interaction to understand implementation of the model parallel to building of the model. Most organizations struggle to unlock data science to optimize their operational processes and get data scientists, analysts, and business teams speaking the same language: different teams and the data science process are often a source of friction. That friction is what defines the new data science iron triangle and is based on a harmonic orchestration of data science, IT operations, and business operations.

In order to accomplish this task with our customer, we implemented the following steps:

  • Request the support of a C-level executive to take that vision and drive it through the different parts of the business: Where there is a clear purpose, vision and sponsorship, the taste of initial success or early wins spurs further experimentation and exploration, often resulting in a domino effect of positive change.
  • Build a culture of experimentation: Even where there is a clearly articulated purpose, that alone often doesn’t lead to successful business transformation. An important obstacle in many organizations is the fact that employees just aren’t empowered enough to bring about change. Having an empowered workforce helps engage your employees and gets them actively involved in contributing towards a common goal.

Involve everyone in the conversation: Building consensus will build performance muscle. If data scientists work in silos without involving others, the organization will lack shared vision, values and common purpose. It is the organization’s shared vision and common purpose across multiple teams that provide synergistic lift.

Principle 6: Keep Humans in the Loop 

Becoming a data-driven company is more about a cultural shift than numbers: for this reason, it is important to have humans evaluate the results from any data science solution. Human-data science teaming will result in better outcomes than either alone would provide.

For instance, in the case of our customer, using the combination of data science and human experience helped them to build, deploy and maintain a workforce placement recommendation solution that recommends optimal staff composition and individual staff members with right experience and expertise for new projects, which often led to monetary gains. After we deployed the solution, our customer decided to conduct a pilot with a few project teams. They also created a v-Team of data scientists and business experts whose purpose was to work in parallel with the machine learning solution and compare the machine learning results in terms of project completion time, revenue generated, employees and customer satisfactions from these two pilot teams before and after using Azure Machine Learning’s solution. This offline evaluation conducted by a team of data and business experts was very beneficial for the project itself because of two main reasons:

  1. It confirmed that the machine learning solution was able to improve ~4/5% of contribution margin for each project;
  2. The v-Team was able to test the solution and create a solid mechanism of immediate feedback that allowed them to constantly monitor the results and improve the final solution.

After this pilot projects, the customer successfully integrated our solution within their internal project management system.

There are a few guidelines that companies should keep in mind when starting this data-driven cultural shift:

  • Working side by side: leading companies increasingly recognize that these technologies are most effective when they complement humans, not replace them. Understanding the unique capabilities that data science and humans bring to different types of work and tasks will be critical as the focus moves from automation to the redesign of work.
  • Recognizing the human touch: It is important to remember that even jobs with greater levels of computerization have to maintain a service-oriented aspect and be interpretive in order to be successful for the company — these roles, like data scientists and developers, still need essential human skills of creativity, empathy, communication, and complex problem-solving.
  • Investing in workforce development: A renewed, imaginative focus on workforce development, learning, and career models will also be important. Perhaps most critical of all will be the need to create meaningful work—work that, notwithstanding their new collaboration with intelligent machines, human beings will be eager to embrace.

The human component will be especially important in use-cases where data science would need additional, currently prohibitively expensive architectures, such as vast knowledge graphs, to provide context and supplant human experience in each domain.

Conclusion

By applying these six principles from the Healthy Data Science Organization Framework on data analysis process, organizations can make better decisions for their business their choices will be backed by data that has been robustly collected and analyzed.

Our customer was able to implement a successful data science solution that recommends optimal staff composition and individual staff with the right experience and expertise for new projects. By aligning staff experience with project needs, we help project managers perform better and faster staff allocation.

With practice, data science processes will get faster and more accurate – meaning that organizations will make better, more informed decisions to run operations most effectively.

Below are some additional useful resources to learn more how to nurture a healthy data science mindset and build a successful data-driven organization:

About the Author

Francesca Lazzeri, PhD (Twitter: @frlazzeri) is Senior Machine Learning Scientist at Microsoft on the Cloud Advocacy team and an expert in big data technology innovations and the applications of machine learning-based solutions to real-world problems. She is the author of the book “Time Series Forecasting: A Machine Learning Approach” (O’Reilly Media, 2019) and she periodically teaches applied analytics and machine learning classes at universities in the USA and Europe. Before joining Microsoft, she was a Research Fellow in Business Economics at Harvard Business School, where she performed statistical and econometric analysis within the Technology and Operations Management Unit. She is also a Data Science mentor for PhD and Postdoc students at the Massachusetts Institute of Technology, and keynote and featured speaker at academic and industry conferences - where she shares her knowledge and passion for AI, machine learning, and coding.

Rate this Article

Adoption
Style

BT