Getting the Data Needed for Data Science

Data science is about the data that you need; deciding which data to collect, create, or keep is fundamental, argues Lukas Vermeer, an experienced Data Science professional and Product Owner for Experimentation at Booking.com. True innovation starts with asking big questions, then it becomes apparent which data is needed to find the answers you seek. Vermeer spoke about Data Science versus Data Alchemy at the GOTO Amsterdam 2016 conference.

Christine Doig, Senior Data Scientist at Continuum Analytics, defined data science in Data Science as a Team Discipline as:

I like to think of [data science] as the glue that is bringing together different fields and lines of thought, to commonly solve problems around data and transform information to knowledge and actionable insights.

Ed Jones explains in the InfoQ article The Role of a Data Scientist in 2016 why big data and data science matter:

The age of big data is upon us, and it’s here to stay. With more data being collected than ever before, extracting value from this data is only going to become more intricate and demanding as time goes on. The logic behind the big data economy is shaping our personal lives in ways that we probably can’t even conceive or predict; every electronic move that we make produces a statistic and insight into our life.

"We want to check if people like the changes that we make to our website" said Vermeer. Booking.com uses experimentation and other forms of data collection to continuously improve their website to create a better customer experience.

Vermeer stated that "You can have a lot of data, but that’s not useful if you don’t know what to do with it." More information doesn’t always lead to better decisions. Data science is about the data that you need, which is often different from the data that you have. Science is limited by data, and data is limited by engineering said Vermeer. You have think about how to create the data that you need to make progress.

In his talk Vermeer used examples from the history of science on the solar system. To show how data can be limited by engineering he went back to science in astronomy. Ptolemy could not observe the Coriolis effect and the stellar parallax, because he did not have accurate enough measurement equipment and both effects are very weak. This absence of evidence, among other things, led him to conclude the earth was not moving. For Ptolemy, data for both these effects was clearly limited by the engineering of the time. This is more easy to see in hindsight, but no less true for present day.

Vermeer argued that models don’t have to be necessarily true, but they can be useful if they help you to predict the future. Multiple models can probably explain the data that you already have. You cannot prove which model is correct using the data that you have. Determining which one is closer to the truth requires that you collect new data.

Vermeer mentioned Kaggle.com, a community of data scientists where you can learn by solving complex data science problems and meet with other data scientists.

You can do sentiment analysis by parsing reviews from customers and look for keywords; words that indicate if people like or dislike a hotel. But instead you can provide two boxes in the review form, one where people can state what they like and one for what they don’t like. This approach solves the sentiment analysis challenge at data collection time said Vermeer.

Vermeer suggested to think about the data that you can create. Where this data overlaps with data that you already have, you can decide between keeping that data or recreating it when it is needed. Cost and risk (for instance of leaking personally identifiable information (POII) data) are the main reasons to decide between keeping or recreating. The cost for keeping data can be significant. There may be other considerations, depending on the data at hand.

There will also be data that you need but that you cannot get. As a solution, you can use proxy data: data that is related to data that you need and which is available so that you can use it as a stand-in.

Vermeer gave an example from a Booking.com mailing campaign where they used personalization to promote destinations for travelers. The way the email was phrased was considered creepy by some customers, because it gave them the idea that a human being had personally analyzed the customers past purchases to come up with suggestions. The suggestions were actually based on a machine learning model, not human judgement. For a next campaign the text was rephrased, which doubled the impact without making any changes to the predictive model.

For data science to be science and not alchemy, deciding which data to collect, and how, is a fundamental step, said Vermeer.

"Can you afford to be wrong?" "Can you afford not to know?" These are the questions that Vermeer asked the audience at the end of his talk. He quoted Voltaire: "Judge a man by his questions rather than by his answers." If people ask questions and make me think about things I had not considered before myself, that’s good, said Vermeer.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Culture & Methods topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter