BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Podcasts Building a Data Science Capability with Stephanie Yee, Matei Zaharia, Sid Anand and Soups Ranjan

Building a Data Science Capability with Stephanie Yee, Matei Zaharia, Sid Anand and Soups Ranjan

In this podcast, recorded live at QCon.ai, Principal Technical Advisor & QCon Chair Wes Reisz and InfoQ editor-in-chief Charles Humble chair a panel discussion with Stephanie Yee, data scientist at StitchFix, Matei Zaharia, professor of computer science at Stanford and chief scientist at Data Bricks, Sid Anand, chief data engineer at PayPal, and Soups Ranjan, director of data science at CoinBase.

Key Takeaways

  • Before you start putting a data science team together make sure you have a business goal or question that you want to answer;  If you have a specific question, like increasing lift on a metric, or understanding  customer usage patterns, you know where you can get the data from, and you can then figure out how to organise that data.
  • You need to make sure you have the right culture for the team - and find people who are excited about solving the business problems and be interested in it.  Also look at the environment you are going to provide. 
  • Your first hire shouldn’t be a data scientist (or quant).  You need support to productionise the models - and if you don’t have a colleague to help productionise it then don’t hire the quant first.
  • Given the scarcity of talent it is worth remembering that Data Scientists come from a variety of different backgrounds - Some people have computer science backgrounds, some may be astrophysicists or neuroscientists who approach problems in different ways.
  • There are two common ways to structure a data science team: one is a vertical team that does everything, the other, more common in large companies, is when you have a separate data science team and an infrastructure team.

What is the most important skillset in AI/ML or Data Engineering?

  • 02:05 [SY] I’m director of client algorithms at StitchFix, and I oversee modelling of clients like marketing, forecasting.
  • 02:15 [SY] One of the things we’re working on is how you take a scientific approach to performance marketing.
  • 02:30 [SY] Problem framing is one of the most important skills for a data scientist; you want to make sure you’re answering the right question.
  • 02:45 [SY] Sometimes when working with clients, you can have a really good question but the actual problem solution might lie in asking a different question.
  • 02:50 [MZ] Both at Stanford and at DataBricks, I’m working on platforms and infrastructure.
  • 03:05 [MZ] I’m part of a group called DAWN which is four faculty members.
  • 03:15 [MZ] Our thesis is that there are lots of platforms for machine learning training, but just learning a model is a small part of the overall work.
  • 03:25 [MZ] So we look at the other things, like data cleaning, how can we reduce the cost of labelling, how to use machine learning in more domains.
  • 03:40 [MZ] An important skill for a data scientist is to be able to talk to different stakeholders, and understand both the business and the technical users.
  • 04:10 [SA] In my current role, I oversee multiple data engineering projects - so it needs a highly available data framework.
  • 04:40 [SA] Risk analysis needs a robust analytics practice, and needs to be tied together.
  • 04:45 [SA] So, how do you take data that has been written at a high pitch to a transactional store, and then ship it for off-line transformation, enrichment and analysis.
  • 05:05 [SA] All data that lands in our databases (7 NoSQL and 5 RDMS databases) - how do we capture and ship it to a Kafka stream or off-line where any number of analysts can work with it.
  • 05:35 [SA] I look for data scientists with breadth and depth - it may take someone 15 years to have a full portfolio of talent.
  • 05:55 [SA] So you need to know how a database works, how streaming systems work, how graph search works - a mix of that talent.
  • 06:10 [SR] We don’t officially label people as data scientists (even though my title says that) because ‘data science’ can mean different things to different people.
  • 06:30 [SR] I ask people to describe themselves as a histogram - on the x-axes they have SQL, stats, ETL, streaming, machine learning models.
  • 06:50 [SR] Very naturally you find out what different sorts of personas they are.
  • 07:00 [SR] I then ask where they want to be a couple of years from now - so where they should focus the next couple of years.
  • 07:10 [SR] So I see a machine learning engineer as a software engineer who has picked up machine learning.
  • 07:20 [SR] From the histogram, if someone excels at SQL then I see them as a data analyst.
  • 07:25 [SR] Someone knows both SQL and stats, then I see them as a quant.
  • 07:30 [SR] Someone who does ML is a machine learning engineer.
  • 07:35 [SR] Someone who does ETL and enjoys bringing data into warehouses is a data engineer.
  • 07:40 [SR] Someone who enjoys building pipelines and event streaming is an infrastructure engineer.
  • 07:55 [SR] We have people from each of those personas; each of them are tasked with different things.
  • 08:05 [SR] For instance, data analysts and quants are tasked with different business units and products.
  • 08:15 [SR] Data engineers and machine learning engineers are tasked with laying the foundation for our next generation architecture.
  • 08:25 [SR] Machine learning engineers are also matrixed into different products, like risk or user campaigns.
  • 08:40 [SA] Every company I’ve been at has fuzzy lines.
  • 08:40 [SA] At the startup I was at, data scientists did mostly engineering work.
  • 08:55 [SA] At LinkedIn, the data scientists were mostly statistics people who didn’t want to touch code.
  • 09:00 [SA] Other places, there were very clear separations between data engineers and others.
  • 09:20 [SA] I think each division runs it slightly differently; in my department, there are end-to-end data engineers.
  • 09:40 [SA] In a small company, it’s hard to hire specialists and so you have to pivot.
  • 09:55 [MZ] At DataBricks, we see hundreds of different data science teams from different companies organised in different ways.
  • 10:00 [MZ] There isn’t a single structure that’s perfect for everything.
  • 10:05 [MZ] I’ve seen two broad types of organisations; one is a vertical team that does everything.
  • 10:25 [MZ] This can require people who learn multiple skills, or with a cohesive set of people.
  • 10:40 [MZ] The other thing I’ve seen - especially in large companies - is when you have a separate data science team and an infrastructure team.
  • 10:55 [MZ] You can have specialists in these teams, but then they become a bottleneck.
  • 11:00 [MZ] There are also downsides since they no longer have business specific knowledge.
  • 11:15 [SY] Our organisation is called the analytics and algorithms organisation.
  • 11:20 [SY] You can split it into three different groups: the data science team, the data platforms team, and the analytics engineering.
  • 11:40 [SY] Within the data science team, we organise functionally, so we try and organise to follow the structure of our customers.
  • 11:45 [SY] We have a styling algorithms team, a merchandising algorithms team, and a client algorithms team.

What makes a good data science professional?

  • 12:10 [SY] I was at a Java shop in a prior job, and we were looking for unicorns that knew Java and had a PhD in statistics.
  • 12:20 [SY] I’m sure that these people exist, but it’s hard to find them.
  • 12:25 [SY] The way I look at it is that everyone has their own super-power, and every context is going to need its own super-power.
  • 12:30 [SY] Some people have computer science backgrounds; some are astrophysicists or neuroscientists who approach problems in different ways.
  • 12:50 [MZ] It will take a while before you have people who have a lot of the skills and who learn it over years.
  • 13:00 [MZ] A lot of work is figuring out what backgrounds people have and what infrastructure you need to create an efficient team.
  • 13:15 [SA] I remember at LinkedIn we had a relevance team and a search infra team.
  • 13:25 [SA] The separation of the two orgs was a file - Conway’s law: the purpose of one team was to create a file, and the purpose of the other team was to consume that file.
  • 13:40 [MZ] I have found the histogram works really well; quants have usually had advanced degrees in a STEM field, and they have picked up stats and some software engineering.
  • 13:55 [MZ] We also see software engineers who have picked up machine learning, and they know not only how to create a model and deploy to production.
  • 14:10 [MZ] Data learning and machine learning folks are back-end developers who know what it means to move data and build micro-services.

Can models be standardised and accessed with a lightweight API?

  • 14:45 [MZ] The short answer is no, there isn’t a standard tool.
  • 14:50 [MZ] There are many different attempts: one extreme is that a model is a docker container with a REST API - good for a few predictions at a time, but not good for a Spark query.
  • 15:15 [MZ] There are formats like Aniz for deep networks as a model format.
  • 15:25 [MZ] It would be great if there was a way of trying to standardise on this, but people have been trying to do this and it hasn’t happened yet.
  • 15:35 [SA] Sometimes the interface is a NoSQL key-value store, and the model is looking up a particular key.
  • 15:40 [SA] One way of packaging is to have the scoring and the model in the same container or AMI, and that is then co-versioned and shipped.
  • 15:50 [SA] If there’s a problem, and you don’t know if it’s a problem in the scorer or the data, then you can just roll them both back.

How do your companies cope with a lack of talent in the machine learning market?

  • 16:25 [SR] There’s lots of bootcamps out there, which we can seek for talent, and one thing that’s worked for us at CoinBase is to provide access to all the data.
  • 16:45 [SR] We’ve seen lots of good results - when people are internally motivated, they can join in.
  • 16:55 [SR] We’ve had people who’ve transferred between risk analysts to software engineer or to data analysts or data scientist.
  • 17:30 [SR] If you bring someone in, they need to evaluate the domain.
  • 17:40 [SY] One thing StitchFix did well in the beginning is to invest in creating an environment which is really friendly for data scientists.
  • 18:00 [SY] IT meant that we could encourage people and say it was fun to play around.
  • 18:05 [SY] There is an up-front investment component of tech branding; on some level, we had to do this.

StitchFix has about 90 data scientists and less engineers. Can you talk a bit about the philosophy of this?

  • 18:30 [SY] When Katrina started the company, she thought we can use machine learning and data science to change the way companies work.
  • 18:45 [SY] Our chief algorithms officer was one of the first ten employees at the company.

Where do you start to build a data science capability?

  • 19:20 [SY] I was talking to the CTO of a company who was thinking of this a while back.
  • 19:30 [SY] The first thing is you need to ask yourself: what do you want this capability to do? What’s the goal?
  • 19:40 [SY] It’s very easy to say that you’re going to hire data scientists and do things, but that’s not often successful if there isn’t a mandate.
  • 19:50 [MZ] I’ve seen different organisations at different stages; what I found is the first thing that has to be done right is the pipeline to collect data itself.
  • 20:05 [MZ] If you don’t do that, you can’t really learn anything from it after.
  • 20:15 [MZ] If you do that, very simple questions like observing change might give you a lot of insights.
  • 20:25 [MZ] So we talk about what data to collet and what they need and how to reliably collect and compare that.
  • 20:30 [MZ] You don’t need any machine learning to get value from this.
  • 21:10 [MZ] So you have a business goal or question that you want to answer - and it’s important to pick one.
  • 21:20 [MZ] If you have a data science team and you just check in six months later to see if they’ve found anything, it’s not going to work.
  • 21:30 [MZ] If you have a specific question, like increasing lift on this metric by this amount, or you want to understand customer’s usage, so the customer success team can talk to them.
  • 21:50 [MZ] For those problems, you know where you can get the data from, and you can then figure out how to organise that data.
  • 22:00 [MZ] You can get much of the value from very simple analysis once you have the data.
  • 22:05 [MZ] When you need go further to get the last 10%, you can look at more complex solutions - but it’s much better to do something simple to capture value immediately and iterate.
  • 22:10 [SA] When I joined LinkedIn all of the relevance models were hand-built.
  • 22:20 [SA] I always find it funny that a data scientist or machine learning attendee at conferences say they cannot reveal the features that I’m using.
  • 22:35 [SA] The features are always specific data that that company collects.
  • 22:50 [SA] The set of 20 features that LinkedIn has are unique to them, and no other company.
  • 22:55 [SA] NetFlix has different features for its data.
  • 23:10 [SA] From that data they do feature engineering.
  • 23:20 [SR] Every company could benefit from a data scientist capability; if you have any manual process (like insurance) you can extract the data and use machine learning to do credit risk assessment.
  • 23:45 [SR] If you are an internet company that is trying to grow its user base, then you can try to optimise that growth or minimise the cost of user acquisition.
  • 23:55 [SR] The key is to have an executive team that buys into data intelligence can improve the top or bottom line.
  • 24:05 [SR] Building a solid foundation of bringing the data into a data warehouse where people can play with it is important.

What is the state of the art for using machine learning to support the infrastructure platforms?

  • 24:25 [SR] One of the key things is anomaly detection - you want to know if you have a distributed system and know where things went wrong.
  • 24:45 [SR] In such cases, gathering all of the data from the logs, CPU usage, and put it centrally to perform anomaly detection is important.
  • 25:00 [MZ] Peter Baylis is working on Macrobase, and what it does is taking two tables of data and a metric, and it computes a difference and show which groups contribute to different amounts.
  • 25:45 [MZ] When companies plug this into their metrics infrastructure they immediately find anomalous groups that are a problem.
  • 25:55 [MZ] With infra, you collect metrics and have data - but no-one looks unless there’s a problem.
  • 26:10 [MZ] You can’t pro-actively catch these, but using these summaries you can see what the top few problems are.
  • 26:25 [SA] We had a talk at QCon London on straggler detection [https://www.infoq.com/presentations/cloud-dataflow-processing] - that was pushing the envelope by splitting the job on the fly without requiring any prior knowledge.

Given how many data scientists have PhDs - do their managers need to have PhDs or specific skills?

  • 27:05 [SA] You have to be able to deal with a bunch of prima-donas
  • 27:20 [MZ] You have the right culture for the team - and find people who are excited about solving the business problems and be interested in it.
  • 27:30 [MZ] If the manager is a tech lead, they should understand it - but you can have great managers who understand the business and how to help people in their careers.
  • 27:55 [SY] You don’t have to have a PhD to manage data scientists, but having exposure to the tools in the toolchest is useful.
  • 28:15 [SR] You can have a substitute for it, like deep expertise which is usually applied experience in this industy.
  • 28:25 [SR] One thing I want to say is that after you have done a PhD you realise that you don’t know anything.
  • 28:40 [SR] So a strong manager understands that aspect of knowledge, and be able to distinguish between the prima-donas and the ones that are humble.
  • 29:00 [SR] A manager that is skilled at conflict resolutions and disagree but commit - those attributes are more core than having a PhD in order to be a good manager.

Are there different skills that you look for in a manager of a data science team?

  • 29:30 [SR] A front-end manager may not translate into a data scientist manager - there has to be some expertise.
  • 29:50 [SR] In order to understand at a high level, and what the team is working on, and have a vision that a team buys into - those skills have to be domain specific.

AI and data science is moving fast - how do you keep up?

  • 30:20 [SR] By coming to events like this - hallways are great for talking to others, but talks are useful.
  • 30:40 [SY] Building time for it in a day is important - a lot of it is encouraging people to do that.
  • 30:55 [SY] If you’re going to hire people to do a role they’re interested in, they will quite often do that on their own time as well.
  • 31:00 [MZ] Having a way to find out if a good idea is going to work out is a good thing - comparing older models to newer ones is going to give benefits.
  • 31:20 [MZ] If that’s too onerous, then no-one will do it.

How should data science teams be structured?

  • 31:40 [SA] There’s two ways of doing it; vertical by domain, where they work with the product manager or business, and together the team works as a scrum.
  • 32:10 [SA] The problem with that approach is that it’s a bit wild west, where everyone is using HDFS as their scratch, and there’s no organisation.
  • 32:20 [SA] It becomes hard for data generated by one team to be shared with other teams.
  • 32:25 [SA] Then what happens is they say it doesn’t work, and they need a central team.
  • 33:30 [SA] The central team can then create the provenance of this data, but that team then becomes the bottleneck.
  • 32:40 [SA] I came into LinkedIn at a similar transition - we had an HDFS share called /data.
  • 32:50 [SA] Then there was another, called /dataderived, or /database.
  • 33:05 [SA] No-one knew who owned which data set or lineage - so you needed some kind of ownership of the data.
  • 32:25 [SR] I think the best functioning model that I’ve seen - they have data mining teams, machine learning engineers, backend engineers and data engineers working together across a functional domain.
  • 33:45 [SR] We have done the same thing at CoinBase - everyone on the risk side of the org works together.
  • 34:00 [SA] When jobs are associated with organisations, and the organisation restructures, then jobs can be left without an owner.
  • 34:20 [SA] I don’t think this is a solved problem.
  • 34:35 [MZ] It’s very important in documenting steps on how the data set is built, and you need to have a plan for how to keep producing it.

What do you think that data scientists could learn from software engineers?

  • 35:05 [SA] There’s no specific process around data science, and not around machine learning.
  • 35:20 [SA] We had a question about how to package models - there are some basic things we do in software engineering like version control.
  • 35:40 [SA] There’s also how you monitor a machine learning system; it’s not just up or down, but it’s a complicated problem.
  • 36:00 [SA] It used to be difficult to build a web application at the start of the web - but today we have frameworks and standard packaging steps.
  • 36:15 [SA] So it’s ten or a hundred times easier than it was, and so companies are building applications which are ten or a hundred times more complicated.

What are the top software or patterns that you would recommend?

  • 36:45 [MZ] Apache Spark might be good to look at - I think it can handle data stuff.
  • 36:55 [SA] Cloud learning are great in the public clouds today.
  • 37:15 [SR] If you wanted to have a tech-stack in-house, then Python or Java based languages are useful.
  • 37:35 [SR] I’ve found it difficult to productionise R - it’s quick to prototype but not easy to productionise.

How do you balance the cognitive diversity of people with commonality of a team?

  • 38:10 [SA] One idea is that if you have a sprint-organised flow you can have a demo at the end of the sprint.
  • 38:15 [SA] You can send a message that all of the disciplines on the team are important.

What mistakes do people make that you can help people avoid?

  • 38:50 [SR] If you’re starting a data science team from scratch, your first hire shouldn’t be a data scientist (or quant).
  • 39:00 [SR] You need support to productionise the models - and if you don’t have a colleague to help productionise it then don’t hire the quant first.
  • 39:15 [SR] You should be hiring a data analyst, if you just want to have insights, or a data engineer to build a data warehouse.
  • 39:25 [SR] If you’re just starting, then you probably don’t need machine learning - you just need a data analyst.
  • 39:30 [SR] You can initially get away with a rules-based system, and only later hire a machine learning engineer.
  • 39:35 [MZ] One of the main pitfalls I’ve seen is that without a metric for success, you can’t keep monitoring it.
  • 39:50 [MZ] The other is to perform some data collection and manual analysis, and then later can’t reproduce the result.
  • 40:00 [MZ] You need to think about how you’re going to run this again, and how you are going to measure the success metric.
  • 40:10 [MZ] I’ve seen data scientists who start by defining those metrics.
  • 40:15 [SY] If you’re starting out and convincing colleagues that data science is a good idea, you want to find a nice problem that it will very clearly solve.
  • 40:40 [SY] There are some hairy problems that don’t need to be in production.

How can an executive protect an org from human error of a data science team?

  • 41:30 [SY] There are things you can put in place - if you’re trying to solve certain problems that are important, you can put a constraint on the data science team.
  • 41:50 [SY] For example, using deep nets might not give you what you want as it will be difficult to explain afterwards.
  • 42:00 [MZ] You often need to try and measure it against a very simple approach, and the metric that you’re using to measure it needs to be well defined.
  • 42:15 [MZ] For example, compare with a linear model which can give you a baseline.

Resources

Blog post: "How we think about data at Coinbase" - Soups Ranjan provides more detials on his histogram hiring rubric.
Tools: Apache Kafka, Apache Spark, Macrobase
More InfoQ content on: Stich Fix, LinkedIn

 

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article

Adoption
Style

BT