Michael Berthold on End-to-End Data Science Using KNIME Software

Open source data analytics platform KNIME CEO and co-founder Michael Berthold gave the keynote presentation at this year's KNIME Fall Summit 2019 Conference. He spoke about the end-to-end data science cycle which mainly includes Create and Productionize categories. The Create category includes "Gather & Wrangle" and "Model & Visualize" phases, whereas the Productionize category consists of "Deploy & Manage" and "Consume & Optimize" phases.

The Gather & Wrangle phase, when using KNIME software, supports several connectors & transformation nodes and big data extensions to integrate different technologies like Amazon Redshift, H2, Hive, Impala, and Apache Spark into your data science solutions. KNIME now supports the extended cloud file system connectivity.

The Model & Visualize phase includes integrations of libraries and tools for the data pipeline management. KNIME supports application specific libraries for use cases in the areas of life science, text processing, image & time-series analysis. There are also specialized implementations of data science frameworks like H2O and XGBoost. The users can also write custom code, if they need to, using languages like R and Python.

The deployment of the data science models to end users and the servers are the main focus of Deploy & Manage phase. This includes analytics applications and web services. The users can also use the KNIME Model Process Factory to retrain the models on demand. The factory supports automated model initialization and monitoring & alerting capabilities.

KNIME also provides the production support and governance of data science applications and services in terms of versioning, backwards compatibility, compliance and best practices.

Finally, the Consume & Optimize phase supports the consumption of ML workflows through direct deployment with workflow as a service as well as an analytics application.

The first generation of automation is a black box that typically covers ingestion of pre-processed data and ends with releasing a model. Usually, there are no interaction points during that process, and it's not transparent how data gets further processed and how a model is built.

Berthold presented what he calls "second generation of automation"; it allows the data scientist to choose between three levels of abstraction: entirely custom, guided automation and complete automation. The process owner strategically decides which section of the data science life cycle should be automated to which degree.

Custom data science is important for all areas in which a competitive advantage is expected. Guided automation applies where the data science expert and the domain expert contribute. And full automation makes sense where good standard solutions are available and an organization wouldn't expect any competitive advantage.

This approach to mixing and matching levels of automation is supported by KNIME.

Berthold concluded the presentation with a discussion on data science abstraction using components for automated or guided interaction for ML pipeline phases like feature selection & engineering, interpretation, model management and most importantly, model deployment to production.

Data science adoption should be determined based on the business requirements and the needs of the users. This adoption can fall into one or more of the following four categories:

Standard problems without business impact where simple data science tools can do the job.
Applications which require no in-house domain or data science expertise can be solved using simple ML models.
Productionizing data science is critical for value creation which requires the continuous incorporation of new technologies and user feedback.
Data science innovation deeply embedded in business, which involves exploring latest ML/AI trends and injecting corporations' own R&D into data science.

KNIME also now has a new integration with Amazon's Cognitive Services. If you would like to learn more about the tools, check out KNIME Analytics Platform, a commercial product called KNIME Server which includes KNIME Web Portal and Data Science as a Service, and KNIME Hub.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter