Key takeaways
|
Data science is about the design and development of solutions to extract insights from data (structured and unstructured) using machine learning and predictive analytics techniques and tools. Data Science as a discipline and Data Scientist as a role have been getting lots of attention in the recent years to solve real world problems with solutions ranging from fraud detection to recommendation engines.
Christine Doig, Senior Data Scientist at Continuum Analytics, spoke at this year’s OSCON Conference about data science as a team discipline and how to navigate the data science Python ecosystem.
She talked about how to transition from data to models to applications. Christine also discussed the different roles and skillsets needed for the data science discipline: Statistician, Computational Scientist, and Developer.
She elaborated on the deliverables of these different roles.
- Statistician: Insights, predictions, visualizations
- Computational Scientist: Algorithms, libraries, performance
- Developer: Software, applications, containers
Data science teams face challenges in different areas like collaboration, big data, deployment and sharing insights.
- Collaboration: Get diverse data teams (languages, tools, data models, deliverables) to collaborate effectively
- Big Data: Move Data Scientists (Stats / Analyst) to use Big Data infrastructure
- Deployment: Deploy predictive models into production applications
- Sharing insights: Share insights with decision makers
She also spoke about Continuum Analytics' contributions to the Python ecosystem with frameworks like Bokeh, Datashader, Dask and Blaze.
InfoQ spoke with Christine about Data Science as a team discipline and challenges Data Science teams need to address to be more effective in the Big Data and Data Science initiatives in their organizations.
InfoQ: Can you define Data Science?
Doig: According to Wikipedia, Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms. I like to think of it as the glue that is bringing together different fields and lines of thought, to commonly solve problems around data and transform information to knowledge and actionable insights.
InfoQ: Can you discuss the Data Science Venn Diagram you talked about in your presentation?
Doig: I like to refactor the traditional Data Science diagram, to a field oriented one, to include some areas that are traditionally left out of the conversation, such as data visualization or traditional analytics and business intelligence. Even if Data Science is a new term and field, there are many business people, researchers, and scientists who have been working on hard data problems for a long time, and Data Science is leveraging all their work.
Data Science is much more than just Machine Learning. It requires knowledge in other very different areas as well:
- Visualization and Storytelling: How do you visualize and present data to different audiences?
- Business Intelligence, ETL and databases: How do you store, extract and transform data? What does your business actually need to drive value out of that information?
- Machine Learning, Statistics and Artificial Intelligence
- CS / Programming: How do we implement custom processes and algorithms that satisfy the data needs of our business?
- Scientific computing and High performance computing: How do we leverage the work that scientists have been doing for a long time with high performance computing and scientific libraries?
- Big Data: How do we scale those processes and analyses to the ever growing amounts of data?
InfoQ: Can you talk about Data Science responsibilities as an individual (Data Scientist) as well as a team (Data Science Team)?
Doig: Because the need of the data scientist arose before there was any formal education on Data Science, people with the current title of Data Scientist come with different backgrounds: statistics, math, computer science, sciences, operations research, business, artificial intelligence. Therefore, when building a data science team, companies should focus on finding people that can complement each other well, both in terms of skills and knowledge, instead of trying to find this so called rare unicorn “data scientist”. There’s so much depth into each of those fields that it would be impossible for one single person to master effectively. Data Science is a team sport, and as such, different players have different roles. They all should have a general understanding of the game, but play on their strengths.
InfoQ: What are the challenges that Data Science teams need to address to be more effective in their initiatives?
Doig: I believe Data Science teams face several challenges:
- Getting teams to collaborate effectively: Because of their different backgrounds, data scientists have their own preferred tools, languages. They are mostly used to working on their own laptop and workstation and setting up their own environment to work on. Therefore, sharing analysis and collaborating is not always trivial, e.g. setting up your environment, package versions management, running cloud services. It's important to make sure that you have all those skills in your team and that data scientists can rely upon each other.
- Leveraging Big Data infrastructure: Large companies have been adopting Big Data infrastructures, but it is still taking some time for data scientists to leverage it and effectively use it, mainly because not all their used algorithms exist in those ecosystems and it’s not a workflow they are accustomed to.
- Deploy predictive models into production applications: Data teams have struggled to deploy their predictive analytics models into production.
- Share insights with decision makers: Helping the business understand the value of data teams within the organization can be challenging. How can they translate complex mathematical models into a return on investment mindset? How can the insights from those models be put into action at a company? More storytelling and data narratives can help.
InfoQ: What does an ideal Data Science team workflow look like?
Doig: I believe a Data Science team should ideally:
- Share insights with managers and decision makers, both to make sure the team is adding value to the business and to make sure those insights are transformed into actions, and the effects of those action are tracked and understood.
- Effectively use Big Data technologies. Companies should make sure they are not only buying the infrastructure for distributed environments, but also that their data scientists have the right tools to make it easy for them to switch from a local environment to a distributed one, without much overhead.
- Collaborate and share their analysis within the team, making sure processes exist for data science collaboration and internal publishing. Close collaboration with the engineering teams is necessary.
InfoQ: What tools are available to data scientists and engineers for data analytics?
Doig: There are many different programming languages used in data science and engineering: Python, R, Scala, Java, Julia, and Stan. They all have an ecosystem of packages that fulfill different data science needs.
InfoQ: Can you discuss the Python ecosystem and tools used in data science projects?
Doig: The Python data science ecosystem has grown tremendously in the past 2-4 years. It has a great community, mature open source libraries for scientific and array computing that are the base for algorithm development. Many users are moving from commercial tools to languages like SAS or Matlab, or R, because of their general purpose and extensive libraries.
InfoQ: What are the upcoming trends in data science space, especially when it is applied as a team discipline?
Doig: We see a great interest in deep learning, better visualization libraries for plotting large datasets, and making big data technologies more accessible to data scientists.
About the Interviewee
Christine Doig is a data scientist at Continuum Analytics. Christine loves Python and sharing her open source findings with others. She has taught tutorials and presented many talks on data science and Python libraries like conda, Blaze, Bokeh, and scikit-learn at EuroPython, PyTexas, PyGotham, PyCon Spain, PyData (Dallas, Berlin), SciPy, and local meetup groups. In her free time, Christine loves to travel and tweet.