Data science is fast becoming a critical skill for developers and managers across industries, and it looks like a lot of fun as well. But it’s pretty complicated - there are a lot of engineering and analytical options to navigate, and it’s hard to know if you’re doing it right or where the bear traps lie. In this series we explore ways in to making sense of data science - understanding where it’s needed and where it’s not, and how to make it an asset for you, from people who’ve been there and done it.
This InfoQ article is part of the series "Getting A Handle On Data Science" . You can subscribe to receive notifications via RSS.
Key takeaways
|
Organisations are increasingly adopting Data Science and advanced analytics, which influence their decision making, products and services to a growing extent. That regularly raises the question of what is the best set of tools for Data Science. On the surface, this subject appears to be about technology comparisons. You could end up reviewing a lengthy list of pros and cons about R, Spark ML, and related technologies like Jupyter or Zeppelin. In fact, we could write a whole series of technology comparisons. However, for the organisation, this is first and foremost a question of what capabilities will support its future business goals. Focusing on them makes the technology choices easier reducing the risk of wasting time and effort.
How can we arrive at a framework to have the above conversation about selecting technologies in a pragmatic and productive manner? In this article, we explore a suitable framework with a real-world example. A typical starting point for organisations is a paralysing number of silos and a plethora of adopted technologies. You don’t want to add more technologies and silos merely because stakeholders ask for them. New technologies and infrastructure should displace existing technologies and break down and replace silos. But this is not trivial in an environment where traditional analytics and business intelligence vendors claim to have the answer to the new challenges and a flood of new technologies, many of them open source, add further choices. The latter often claim to replace the traditional tools and bring capabilities beyond their reach. The incumbents counter that they offer better enterprise qualities like security and support.
The real world example customer we discuss here approached my employer over a year ago with a tremendous challenge that consisted of immediate and long-term strategic requirements. This FTSE100 company was at a transformational moment of its life. It was changing significantly organisationally and needed to reinvent parts of its current platform because of past fragmentation and dependencies that were not maintainable and did not deliver business value. The urgent request to us was to address immediate business needs on advanced reporting and basic analytics for a new platform blending in historical data in a fully transparent fashion with a tight deadline. The existing data warehouse technology based on an appliance technology was costly and limiting. New reports and advanced analytics were prohibitively slow or impossible to execute without investing large sums and without adding future proofing analytics capabilities.
The cost and limitations were a grave concern. The customer recognised that long-term the value derived from its core business activities inevitably will shrink as the market is becoming increasingly competitive with disruptive technology changes on the horizon. The leaders in the organisation realised that novel capabilities were needed to prepare the business for the future immediately after addressing the urgent requirement.
We worked with the key stakeholders and developed a plan to bring together in a central place the main datasets with full flexibility to process and analyse it in the future for the next evolution of the business. It is noteworthy that the core data warehouse was not abandoned and merely reduced to its originally intended role. Still, numerous legacy systems that mostly held data and could only be interrogated with difficulty were put on a path to being phased out. At the same time, great consideration was given to ensuring that the data flows correctly through the different platforms to provide governance and security. The plan deferred the details on the advanced analytics and Data Science technology questions. This was possible because the new platform could adopt most of the relevant ones as needed and when needed. The benefit for the customer with this approach was significant. The details of the future business requirements were still evolving while an immediate business need required action. Breaking the decision making and implementation into a staged approach without blocking future avenues for innovation on the platform side offered the best of both worlds.
The first lesson here is to avoid doubling down on technology that cannot keep up with changes to requirements. In addition is is important not to do a one to one matching of technologies, i.e. do not merely replace one technology with a similar, newer one which only offers minor benefits. Think of the basket of technologies as a cost to the business and a collection of capabilities in return. We want to reduce cost with fewer and cheaper technologies or achieve more capabilities for the business that it needs. Ideally, we deliver both. In our example, a combination of phasing out legacy systems and reducing the footprint of the data warehouse bring savings that can be used for the new analytics platform which in turn replaces some of the capabilities and adds relevant new ones.
With this in mind, we can focus on what we are trying to achieve. Today's Enterprises have the same challenges as yesterday's. They have to reduce cost, improve profitability, evolve to stay compliant, and may also need to redefine their core business in a changing world driven by automation and commoditization of services, for example. What has changed in the last few years is that data and the productive use of it are becoming the pivotal opportunity to answer these challenges.
The problem is that most organisations do not know what shape these answers or even the questions take. Some obvious short-term opportunities are usually available in the drawers in various business areas, and they wholly assume a slight improvement over the status quo. Most stakeholders have become so used to their limitation that they need encouragement to dream a bit. When asked what they would like to achieve they think within the constraints of their organisation's existing capabilities or ask for the moon to address the unknown requirements of the future.
Hence long-term fundamental requirements, sometimes with the core business reinventing itself, are often hard or even impossible to obtain. The second lesson, therefore, was not to aim for the moon but to stay flexible to emerging requirements instead of trying to predict the future. In our example, we demonstrated that we left room for iteratively expanding the platform well into the future without constraining the options or having to do rework. We did this by planning multiple buildout steps that showcased the option to add a rich set of capabilities like stream processing or key value stores, to name only two of many, at the appropriate time.
There is a risk here though if we become wholly technology driven and replace the deferred inward reflection and requirements gathering with the expectation to add every capability and technology. We could end up adopting technologies without a business purpose or value creating enormous costs and complexity, or worse fail completely. The buzz around Big Data and Data Science has led stakeholders to fall into the hype trap in this situation. They believe that technology adoption can address shortcomings in business goals, capabilities and requirements. It is crucial for stakeholders to ask the right questions from Big Data and Data Science to avoid confusion and disappointment. These questions include specific strategic business goals and requirements, which are a prerequisite. While the strategic goals have to be clear from the beginning, the requirements can be derived iteratively over time as in our example.
An organisation can evaluate current, identify needed, and adopt new capabilities around data storage, processing and analytics with the right Big Data strategy. In fact, this agility is the underpinning of the modern data-driven organisation that allows it to operate in a rapidly evolving technology landscape. Data Science leverages the capabilities delivered by a flexible organisation evaluating and adopting these technologies. Data Science then also provides deep insight and adaptive solutions to the growing challenges that come from both technology, i.e. more, faster and diverse data, while at the same time the expectation on data to drive products, services, insight and decisions is growing limitless.
In our example, the traditional data warehouse approach was challenged with the first task alone being insufficiently flexible to address any unknown future needs. The solution was also not trivial since this particular business operates in the finance industry with sensitive data and is highly regulated. But the business needs to generate deep insight which requires access for many data scientists and business users. That tension exists in most organisations, i.e. make all the data available to all possible consumers and at the same time be safe and ensure no data can be abused or leaked.
With government, healthcare and finance customers these challenges have also to pass the front page test, i.e. data security issues, perceived or real, could lead to disastrous news coverage. So security is both a matter of the actual security and also a matter of perception. Interestingly, this one reason is why many customers are hesitant about the cloud, where perception and reality are increasingly diverging with the improving security options. Compliance may be an issue for some companies, i.e. where to store the data, and cloud providers are increasingly providing footprints within regulatory zones to address these needs.
For our customer, who chose an on premise solution, we outlined the key capabilities needed for the immediate use cases and designed a platform that can be extended flexibly into the future. The first goal was to build a platform using Hadoop and its ecosystem at the core to ingest legacy and new data, secure it with masking and encryption and then report on it. The analytics tools required were basic and straightforward decisions mostly leveraging SQL interfaces into the Hadoop ecosystem for legacy tools and using Apache Hive. Hive was the first choice since it is an integral part of the distribution chosen, is stable, has a rich SQL coverage, is accessible via standard connections from legacy systems, and integrates tightly with the distribution’s security model. Moreover, the performance needed in the first phase was more relevant in dealing with small and large batches of data for reporting and analysis.
The core platform build and integration, as well as the PCI compliance necessary, were the key challenges at this stage. Because of the tight deadline the work had to commence immediately and all stakeholders were happy to 'fail fast' with proof of concept implementations of key elements of the platform to find organisational blockers and technology limitations swiftly. Naturally, failing fast is only beneficial if the issues discovered are then addressed. Therefore, the work was accompanied by workshops whenever we reached a milestone or failed, i.e. learned something, to bring in new business and technology stakeholders to resolve issues or plan for the next steps.
This approach is effective with senior leadership support though also tough at times. Existing processes and technologies, as well as established vendors, may need to be evaluated as part of the solution. That leads to difficult conversations at times with vendors and business stakeholders managing failure honestly either when the organisation falls short or the vendors and partners. Senior stakeholders need to have the strength to take a strategic view and weather some issues since being at the forefront of data-driven development means also being one of the few to find out what doesn’t work. That is only possible with a constructive, collaborative approach and here the combination of workshops to engage stakeholders and listen to their needs and processes, with the ability to iterate in proof of concept environments to establish viable and unviable paths was essential and allowed us to quickly progress the work.
A good example of failing fast was the choice of tooling to encrypt and mask sensitive data. A prominent market player offered their solution and was adamant that their established relevant finance use cases made them the first choice for evaluation. It turned out though that the market had gotten away from them and that new abilities of the Hadoop ecosystem like transparent data encryption in combination with a true multi-tenancy paradigm were too big of a change to adapt to for their product and security approach. The ability to quickly fail and bring in other vendors within the proof of concept environment meant that the incurred delay was manageable and the work progressed with another provider after another round of evaluations.
As the completion of the first phase of work was approaching, the demand across the organisation increased for access to the platform, the data, and the addition of tools to support the Data Scientists and advanced Business Analysts in the organisation. The desires ranged from explorative analytics, advanced near real-time reporting to smart applications and products. These needs all demand numerous capabilities and tools. Additionally, many Data Scientists have different tool preferences often including R, Python (scikit-learn), Spark ML(with Python, Scala or Java), various commercial solutions, and notebooks like Jupyter or Zeppelin. Together the many requirements and preferences, which are often still unclear and preliminary, have to be matched with the tools that achieve these. We also need to keep in mind the often overlooked aspects of governance, security, business continuity, software and dataset development lifecycles as well as the cost, complexity and risk aspects. In short, can the organisation become one that can continuously innovate in a timely and profitable fashion with minimal risk or will it drown in technologies?
Too much innovative flexibility and wild adoption of technologies will introduce risk and paralyse the organisation. Data may be leaked or diminish in quality from lack of governance and inadequate security. Resources may be an issue when too many technologies need to be supported, and integration becomes unmanageable. On the other hand, tight minimalistic technology choices with only security in mind will stifle innovation in the organisation, talent will leave, capabilities will be lacking, and the organisation will find itself unable to react to new opportunities and risks. And the orthogonal idea of a lengthy waterfall process to plan the perfect solution has little merit in a situation of innovation where the requirements can’t be gathered, and technology capabilities continue to change.
When we visualise the organisation, correctly, as an entity that has a limited pool of resources and aims to get the maximum relevant capabilities from them then an agile-like approach becomes the best alternative. The framework to develop these is similar to the workshops we used to evaluate technology choices and resolve issues along the path of the core platform development and build. We can bring together the various Data Science and Analytics stakeholders from relevant business units to discuss the situation. What are the well-understood use cases, their priority and impact for the organisation, and the capabilities needed to implement them and the less well understood innovative future ideas and potentially required capabilities? The question of technologies comes as a second part. What are the technology preferences and existing skills in their teams? What are development lifecycle requirements of the various demands and organisational standards that have to be satisfied? Ideally, the latter question is supported by stakeholders from units like security, infrastructure and operations, and software development.
Our customer was advanced and had already significant independence because of some important senior leaders who are Big Data and Analytics experts. However, they did appreciate the external support, independent guidance and evaluation from experts who also were part of the platform development and future analytics work. For consultants, this is the dream outcome when customers accept you as an independent authority and trusted advisor. Together we executed a workshop in preparation for the Data Science work. We collected the information and already during the workshop we were able to prioritise across business units pieces of work and exclude ill-fitting technology candidates.
The benefit of the exercise is immediate. All the stakeholders know of each other, their desires and preferences, which in itself is valuable. Moreover, we were able to identify a significant piece of work, a decision service based on streaming near real-time data, which could serve all parties, i.e. everyone had a use case that required such a service. We were able to avoid parallel development and pool effort as well as prioritise it as a pilot project. In an unmanaged situation, we could have ended up with multiple variations of the service using different technologies developed by multiple business units. This way we were able to consolidate effort and tool choices.
Our next step was to select the first set of technologies to be added to the platform for Data Science work, in particular, Spark with ML, Java, Python, and Kafka for streaming data. These bring in the capabilities necessary for the use case at hand and will also cover some of the future and secondary use cases. The choice was made after the workshop discussion shortlisted candidates and with adding operational and organisational aspects. For example we needed to determine which technologies are market leaders with broad support, maturity, and hirable skills. The latter was an influencing factor in choosing Java over Scala at this stage.
An important aspect is to not dismiss any options and engage stakeholders in constructive discussions. Even when options seem unlikely, we can deprioritize them with the above framework.
We are about to engage in the development of the service now. The future benefit is that it brings along a range of technologies and their capabilities to the organisation. They are immediately evaluated in business critical projects for their non-functional abilities, e.g. around security, reliability and performance. Furthermore, when proven they are likely to be adopted by the business stakeholders because they work and are available. That reduces the demand for overlapping alternative options. With a good selection and demonstrated success the desire to continuously adopt more technologies wanes and a fondness of the proven and available solutions at hand becomes widespread.
The plan is to continue with the framework and solicit feedback from stakeholders and users for the evaluation and further technology adoption where existing capabilities are lacking. Subsequent workshops will then move from broad adoption discussions naturally into maintenance conversations and eventually towards ones where we will discuss phasing out technologies as the market continues to develop.
About the Author
Dr Christian Prokopp is the International Engineering Practice Lead at Think Big, a Teradata company, working with systems, software and data engineers as well as architects to build a global community of best practices for Big Data engineering and consulting. Christian worked in the past as a software engineer, data scientist, architect, consultant and manager. He has also written and publicly spoken about Big Data. He was part of two successful exits including one with Google. In his spare time, Christian competes with his wife and daughter on who will visit all the world’s countries first.
Data science is fast becoming a critical skill for developers and managers across industries, and it looks like a lot of fun as well. But it’s pretty complicated - there are a lot of engineering and analytical options to navigate, and it’s hard to know if you’re doing it right or where the bear traps lie. In this series we explore ways in to making sense of data science - understanding where it’s needed and where it’s not, and how to make it an asset for you, from people who’ve been there and done it.
This InfoQ article is part of the series "Getting A Handle On Data Science" . You can subscribe to receive notifications via RSS.