Data science is fast becoming a critical skill for developers and managers across industries, and it looks like a lot of fun as well. But it’s pretty complicated - there are a lot of engineering and analytical options to navigate, and it’s hard to know if you’re doing it right or where the bear traps lie. In this series we explore ways of making sense of data science - understanding where it’s needed and where it’s not, and how to make it an asset for you, from people who’ve been there and done it.
This InfoQ article is part of the series "Getting A Handle on Data Science" . You can subscribe to receive notifications via RSS.
Key takeawaysAs a data scientist working primarily in the Clojure programming language, I'm often asked why I don't use one of the more popular alternatives. In this article I'd like to set out my reasons for adopting Clojure, including its:
|
Although Clojure lacks the extensive toolbox and analytic community of the most popular data science languages, R and Python, it provides a powerful environment for developing statistical thinking and for practicing effective data science.
If you're already working as a data scientist, I hope to show what learning Clojure will bring to your toolchain. But if you're a programmer looking to learn the principles of data science for the first time, I hope that you'll consider Clojure as a language with which to explore this exciting analytical world.
Highway to the Danger Zone
Like so many others, my own journey to data science came via software engineering. In 2015, the State of Data Science Survey found that the most popular major for data scientists on LinkedIn was computer science, and that most had added their skill within the previous four years. So, when Packt Publishing invited me to write a book introducing data science for Clojure programmers, I gladly agreed. In addition to appreciating the need for such a book, I saw it as an opportunity to consolidate much of what I'd learned during the four years prior as chief technologist of my own data analytics company.
If you're thinking of developing your skills in data science, you've probably already considered Python or R. Python is an especially popular choice for those coming from a programming background since it's a good general-purpose scripting language which also provides access to excellent statistical and machine learning libraries. When I first started out in data science I used Python and scikit-learn to tackle a clustering project. I had some data gathered from social media on users' interests and I was trying to determine if there were cohorts of users within the whole. I chose spectral clustering because it could identify non-globular clusters (so must be better, I reasoned), and the first results were promising. My confidence quickly evaporated when I re-ran the clustering and got different results. Behind the scenes, the library made use of k-means, a stochastic algorithm, and I was finding a new local minimum. Frustrated by the apparent lack of objective truth, but needing to show some results, I simply selected the result that looked most like what I expected and moved on.
I was wielding machine learning without a licence and using a black box I didn't really understand to justify the outcome I hoped for. I didn't know then about cluster evaluation metrics and methods for quantifying and controlling variance; let alone cross-validation, high bias or the curse of dimensionality. I was sitting squarely in the segment Drew Conway famously termed the "danger zone".
Back in 1711 Alexander Pope observed in his first major poem An Essay on Criticism:
A little learning is a dangerous thing
His advice was directed at student literary critics, but it could readily apply to apprentice data scientists as well.
How do we get people to understand data science?
Advice abounds on the prerequisites for being a successful data scientist: lists which often contain intimidating-sounding mathematical techniques such as multivariable calculus and linear algebra or probability theory. Whilst such suggestions are well intentioned, it's is not useful advice for most aspiring data scientists. Advanced practitioners making retrospective suggestions forget how overwhelming starting out can be.
Bret Victor's excellent article Learnable Programming emphasises the importance of following flow when learning programming. Since it's also a highly technical skill, the guiding principles he presents can be easily paraphrased for data science:
- Data science is a way of thinking, not a rote skill. Learning about clustering algorithms is not learning to do data science any more than learning about pencils is to learning to draw.
- People understand what they can see. If a data scientist cannot see what an algorithm is doing, she can't understand it.
I think the best way to get started with data science, as with programming, is by exploring a problem domain interactively with real-time feedback. This could involve writing little programs which make use of libraries that provide algorithms, or by implementing the algorithms yourself from tutorials and books.
The REPL is a tool for experimenting with code and which allows you to interact with a running program and quickly try out ideas. It does this by presenting you with a prompt where you can enter code. It then reads your input, evaluates it, prints the result, and loops, presenting you with a prompt again. A REPL environment falls short of Bret Victor's ideal programming environment (by requiring the programmer to write a full expression at a time), but a well set up REPL environment comes a lot closer than a workflow which treats writing a script, perhaps compiling it, and then running it, as discrete operations.
Python's interactive interpreter and the R prompt also permit exploration of code expression-by-expression. But REPL-driven development is embedded in Clojure culture, and popular Clojure development environments such as Emacs and Cursive give the REPL prominent status. Fantastic tools such as Cider make interactive REPL development, debugging, and refactoring a real pleasure.
Follow the flow
The fundamental way of thinking which a programmer must develop as they learn data science is characterised by Richard Hamming's assertion:
In science, if you know what you are doing, you should not be doing it. In engineering, if you do not know what you are doing, you should not be doing it.
This is only a slight oversimplification. Doing data science will, by definition, involve open-ended exploration. For me, skills to manage and contain this uncertainty were developed when I began to understand descriptive and inferential statistics, as well as how the common distributions relate to each other. Along with data cleansing, they're the majority of what I bring to the data science projects I'm involved in, and I devoted the first two chapters of Clojure for Data Science simply to building a strong foundation in them. Unfortunately, they're often overlooked by programmers because they can't be understood by looking at code alone, and there are few resources introducing them to a programming audience.
The Incanter code examples at data-scorcery.org (particularly the probability and statistics sections) provide useful starting points. Incanter itself provides Clojure's most complete set of statistical computing libraries and contains implementations of common statistical functions, common distributions, and charting capabilities. Following along the examples in the REPL and seeing the output at each stage is an excellent way to learn.
To consolidate this new knowledge, I read the fantastic free book Think Stats. Adapting the book's many exercises and solving them in Clojure helped to build my Clojure and statistics knowledge simultaneously. A web search for 'think stats in Clojure'will throw up many examples from others who've done the same thing.
The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'
-- Isaac Asimov
Incidentally, some of the most counter-intuitive and interesting things I learned researching the book didn't involve any code at all. A prime example is Simpson's Paradox, in which the same trend appears in different groups of data but disappears or even reverses when these groups are combined. The Wikipedia article helpfully includes visualisations that neatly illustrate this apparent contradiction and makes clear its causes.
Visualisation
Visualisations provide useful insights into data that would be lost if you looked only at summary statistics and Incanter provides support for the most common charts and graphs. Unfortunately though, it's hard to overlook the fact that Incanter's charts are not particularly attractive. Additionally, when producing visuals from within a REPL, Incanter's charts will open up a separate window in the background.
The notebook format (which freely mixes interactive code and charts in a browser window) has become a popular way to do data science. Jupyter is the most popular framework for producing notebooks and, with the clojupyter kernel, Jupyter notebooks can also run Clojure code too. For tighter integration between Clojure and charting Gorilla REPL provides an alternative notebook stack written entirely in Clojure. Although it ships with an attractive set of default charts, it also implements its renderer in an extensible way. For example the gg4clj library extends Gorilla REPL to display visualisations using R's formidable ggplot2 charting library. A modified version of Gorilla REPL is used by Huri, a library providing "tools for the lazy data scientist".
The most flexible charting library is one that allows access to low-level drawing APIs. thi.ng/geom-viz provides this level of abstraction. It contains utility functions for drawing axes, mapping between chart and data scales, and functions for quickly rendering data as common chart types, but it otherwise offers a blank canvas and a set of primitives for rendering charts in the SVG graphics format. Since SVG can be rendered natively in web browsers, thi.ng/geom-viz enables the creation of sophisticated custom web-based visualisations. Producing visualisations using the same language as the core analysis can mitigate the sorts of issues that arise when the visualisation is done in a drawing program by someone with no understanding of the data.
Interactivity
Like many Clojure libraries, thi.ng/geom-viz doesn't depend on any Java host features and can be compiled to both JVM bytecode and JavaScript. Even low-level Clojure matrix manipulation libraries such as core.matrix can readily be compiled into JavaScript. This means that it's not only the visual output which can be rendered in the browser: almost any part of the Clojure code (with some exceptions relating to file access and Java-specific features such as threads) can be executed in the browser too. Thanks to JavaScript interoperability, it's possible to make use of native JavaScript libraries such as jStat and D3.js for your visualisations too.
One of the most helpful ways in which I built my understanding of inferential statistics was through interactive visualisation. In chapter 2, for example, I showed how to build a browser application which took repeated samples from several normal distributions and calculated the sample mean and the 95% confidence interval of the population mean for each of them. It was clear to see the relationship between the confidence interval and sample size. And, since the confidence interval represents the bounds within which a value is expected to lie, repeated sampling demonstrated empirically that 5% of the time the population mean fell outside. Building simple tools like this can aid the development of deep understanding that might otherwise take months or years to develop.
(Click on the image to enlarge it)
Access to Facebook's React library for user interfaces is provided by both the Reagent and Om libraries. React Native opens the possibility to use the same Clojure code to create dedicated mobile applications.
Data orientation
The reasons I would recommend Clojure for data science are also why I think Clojure is an excellent language for any programmer to learn.
I'm confused by all these lines people draw between programming and data science. It's about encoding and representing information.
-- Carin Meier
Michael Nygard described Clojure as a data-oriented language in his article The New Normal: Data Leverage. In his words:
For example, in Clojure, you deal with data as data. You don’t try to force the data into an object-oriented semantic. Clojure has functions that work on any kind of sequential data, any kind of associative data and any kind of set data. Instead of writing a class that has ten or fifteen methods, you get to use hundreds of functions. (Over 640 in clojure.core alone.)
So, it doesn’t matter if you’re using a map to represent a customer or an order or an order line item, you use the same functions to operate on all of them. The functions are more general, so they are more valuable because you can apply them more broadly.
The flexibility to model items in my domain simply as collections of associative or sequential data structures (or hierarchies thereof) provides ready access to data without prematurely flattening it into tables. It remains simple to extract numeric or categorical data from such structures to pass into the statistical functions of Incanter when required. The Specter library aims to augment Clojure's primitive data-manipulating functions specifically for extracting values from deeply nested data structures. As Simon Belak, author of Huri, succinctly put it: "data frames considered harmful".
Recently, Clojure 1.9 has been released with clojure.spec, a feature of the language which augments this flexible data-oriented approach to modelling with validation, error reporting, and generative test generation. Spec provides a means to encode the expected schema of data to catch errors and faulty assumptions early. Also, and perhaps more importantly for a data scientist, specs can serve as documentation:
The best and most useful specs (and interfaces) are related to purely information aspects. Only information specs work over wires and across systems. We will always prioritise, and where there is a conflict, prefer, the information approach.
Spec aims explicitly to make the code we write more comprehensible to our colleagues and also to our future selves.
Ladders of abstraction
Even where Incanter has a function to perform some core statistical function, I found it useful when learning to implement my own version. I found this served two purposes: one was simply that writing my own version forced me to understand the function I was studying. The second was more surprising: the Clojure implementation became a way for me to recall how the function behaved. It was something I could use to remind myself of what a function did much more effectively than by referring to the original mathematical notation.
Resources such as statistical formulas for programmers are useful only if you understand how to transform mathematical notation into code. Although it can appear obscure and upsetting, there are really only a handful of symbols that occur frequently in formulae. For example, Σ is pronounced sigma and means sum. When you see it in mathematical notation it means that a sequence is being added up, and maps directly to the Clojure code:
(reduce + xs)
Where xs is the sequence over which to be summed. In my experience, Clojure's syntax aids the process of transcribing statistical approaches I encounter elsewhere. Zachary Tellman puts it like this in his book Elements of Clojure:
This is possibly Clojure’s most important property: the syntax expresses the code’s semantic layers. An experienced reader of Clojure can skip over most of the code and have a lossless understanding of its high-level intent.
Assembling more sophisticated algorithms out of simple building blocks became a key feature of how I conveyed an intuition for machine learning algorithms in Clojure for Data Science. For example, I could start off writing small pure functions for calculating information, entropy, weighted entropy, and information gain, each in terms of the one before. Information gain quantifies how successful a partitioning of items is at grouping similar items together and different items apart. Applying this partitioning logic recursively yielded a simple tree classifier. Clojure allowed me to express the essence of a particular approach because of its extremely succinct way of expressing such algorithms.
Bret Victor has written about how a deep understanding can be developed by moving up and down the Ladder of Abstraction:
Likewise, the most powerful way to gain insight into a system is by moving between levels of abstraction. Many designers do this instinctively. But it's easy to get stuck on the ground, experiencing concrete systems with no higher-level view. It's also easy to get stuck in the clouds, working entirely with abstract equations or aggregate statistics.
I found Clojure allowed me to easily move between low-level functions and high-level behaviours. Building my own versions of statistics functions gave me a much deeper understanding of how things worked and how they fit together. It also threw up some surprises that I might otherwise have glossed over, such as the fact that there is no universal agreement on selecting quartile values.
The JVM as a platform
No practicing data scientist has the time to routinely write their own algorithms from scratch. Even if they did, doing data science is about more than just algorithms; there's likely to be a component of data engineering too: getting your data out of a source system, scrubbing it, and perhaps capturing or presenting data to users of a live website too.
A fantastic aspect of using Clojure for data science is having access to the wealth of Java libraries available through Clojure's Java interop capabilities. For example, the parallel statistical computation library ParallelColt powers much of Incanter's number crunching. Apache Commons Math also contains an enormous amount of functionality for flexible numeric analysis and optimisation, as well as genetic programming and machine learning with clustering. The documentation is excellent for describing how particular algorithms work. Likewise for neural networks, the Deeplearning4j library provides concrete implementations of many algorithms and its documentation is a veritable trove of useful tutorials on the subject.
Clojure wrappers already exist for some Java machine learning libraries. For example, clj-ml provides Clojure interfaces for running linear regression as well as classification with logistic regression, naive Bayes, decision trees, and many other algorithms. It's built on top of Weka, an open source machine learning project written in Java. It's hard to beat Python's support for natural language processing, but the Clojure library clojure-opennlp wraps Java's Open NLP library.
If you really do have big data, Parkour and Cascalog are Clojure libraries which take very different approaches to wrapping Hadoop's MapReduce framework. Cascalog was the reason that I adopted Clojure for data analysis in the first place. I was attracted by its terse syntax for expressing MapReduce jobs as solutions to logic programs. Parkour is now the Hadoop library I use most often, and it provides idiomatic Clojure wrappers around Hadoop's native capabilities. For example, it permits mappers and reducers to access to Hadoop's MapReduce context object at runtime, and access to the distributed cache to avoid costly joins for in-memory lookup data. It also allows custom schemas to be defined and serialised using Apache Avro using a pleasant and Clojure-friendly DSL.
Interop also permits access to distributed machine learning libraries such as Mahout and Spark's MLlib and GraphX which are written in Scala. They're somewhat more convoluted to call from Clojure code since the bytecode emitted by Scala is more complicated than the equivalent Java. The from-scala library makes this easier, but it's still something of a dark art. I hope comprehensive wrappers emerge for them in the way that they have for Spark: Clojure has not one, but two libraries for interacting with Spark, Sparkling and Flambo.
Recently, the Cortex library has been developed in collaboration with the author of core.matrix. It provides neural networks, regression and feature learning implemented in pure Clojure. At the time of writing it's newly-released, but it already promises a coherent set of machine learning abstractions freed from Java's legacy, combined with fast GPU-based number crunching. Onyx is a relatively new Clojure framework for distributed computation with an API based on immutable data structures.
The area of distributed machine learning addressed by Mahout, MLlib and GraphX is where I hope the Clojure ecosystem will develop most over the next few years. Libraries such as Cascalog demonstrated the power that could come from applying Clojure's functional principles and terse syntax as a DSL for a large and complicated data processing framework. I would love to see this successful approach continue into the world of distributed machine learning too, and am following the development of Cortex and Onyx with interest.
Functional programming
Whilst Clojure provides access to tools for doing machine learning at significant scale, I'm usually asked to work on much smaller datasets. A colleague and good friend of mine Bruce Durling coined the term awkward-sized data for those volumes of data which are too large to be processed quickly on a desktop computer, but also too small to warrant a distributed framework such as Hadoop.
It is about hundreds of thousands of data points and not millions, it is about data sets that fit into the memory on a reasonably specified laptop.
This is where the majority of my work is done. Clojure provides an approach to solving this problem with Reducers. Reducers provide implementations of reducible collections (a collection that can reduce itself across several machine cores) together with means of manipulating reducing functions. What this means in practice is that you build up a sequence of computations which are then calculated simultaneously across your data. To address the problem of getting data into a reducible collection in the first place, Iota provides support for efficiently creating reducible collections from large files.
More recently Clojure has introduced Transducers, which separate the implementation of the reducing function from the collection to which they are applied. I gave a talk at London's Clojure eXchange on Expressive Parallel Analytics with transducers. The library kixi.stats is the result of that talk: a library of statistical reducing functions. Each of the functions in kixi.stats will calculate its output in a single pass over the input data. Such algorithms are typically called streaming algorithms and are celebrated for their ability to produce approximate solutions in linear time.
Since reducing functions are pure functions, they can be combined together using functional composition. In this way, multiple statistics (such as the mean, standard deviation, skewness, etc) can all be calculated in a single pass over the data sequence. Libraries such as xforms and redux provide reducing function combinators and will work with any reducing function, including the ones you write yourself.
The implementation of kixi.stats was heavily influenced by Tesser, a Clojure library which also provides implementations of streaming algorithms in such a way as they can either be run locally or on Hadoop.
Code as data
One of Clojure's most useful properties isn't a property of Clojure specifically, but of the Lisp language family of which it is a member. Lisp elicits high praise as the ultimate high-level language or language of the Gods. It's the programmable programming language that was discovered rather than invented. And it inspires tributes from great computer scientists:
If you give someone Fortran, he has Fortran. If you give someone Lisp, he has any language he choses. -- Guy Steele
Lisp is not a language, it's a building material. -- Alan Kay
These qualities of Lisp (and also of Clojure) can be the hardest to convey to programmers from other languages. Whilst I understand where the authors above are coming from, these statements are as alienating to newcomers as the suggestion that novice data scientists learn multivariate calculus before they run their first A/B test.
Instead, I'll simply provide some examples about why having a programmable programming language is an asset. Michael Nygard again:
With Clojure, it’s very easy to support multiple versions of our API. In fact, it so easy, we can even have an API about APIs. Now, our data-orientation approach takes everything up a level. By using data rather than classes or annotations we give ourselves the ability to “go meta” [...]. If we wanted to do that in Java, we would need to build some kind of a new language that emitted Java as its bytecode. We don’t need to do that with Clojure.
This capability is really the natural extension of Clojure's succinct way of representing data and its consistent and unambiguous syntax. Clojure code itself is data, expressed as lists of symbols which can be manipulated by those same data-manipulating functions. Through a language construct called a macro, Clojure allows programmers to extend the language itself. Functions accept values and return new values. Macros accept code and return new code.
In practice, this is used by libraries such as Tesser to provide succinct ways of expressing transforms. Within the library is a macro called deftransform which allows for the elimination of boilerplate code for specifying transforms. Below is the implementation of map:
(deftransform map "Takes a function `f` and an optional fold. Returns a version of the fold which finally calls (f input) to transform each element. (->> (t/map inc) (t/into []) (t/Tesser [[1 2] [3 4]])) ; => [2 3 4 5]" [f] (assoc downstream :reducer (fn reducer [acc x] (reducer- acc (f x)))))
The code which remains embodies the essence of a map transform in Tesser: it associates a downstream reducer function which calls (f x). This internal representation enables Tesser to take the same computation and target either a local or Hadoop-based context. Incidentally, the author of Tesser has written a series of articles on learning Clojure, including macros.
Macros aren't just used by advanced Clojurians to express new concepts. They enable you (even as a Clojure beginner) to get more done. For example, the threading macros make code more legible by removing nested expressions. So:
(reduce + (take 10 (filter even? (map square (range)))))
with the thread-last macro ->> becomes:
(->> (range) (map square) (filter even?) (take 10) (reduce +))
Whilst compiling to identical code as the previous version, the latter expresses the sequence of expressions more clearly. Daniel Higginbotham puts it like this in Brave Clojure:
Yes, macros are cooler than a polar bear’s toenails, but you shouldn’t think of macros as some esoteric tool you pull out when you feel like getting extra fancy with your code. In fact, macros allow Clojure to derive a lot of its built-in functionality from a tiny core of functions and special forms.
Macros drive some commentators into paroxysms of delight, but don't let that put you off. An appropriate use of macros will make your code easier, rather than harder, to comprehend.
Learning to learn
Learning a powerful language such as Clojure together with a technical discipline such as data science is an ambitious task. But don't be disheartened! U.S. Government data hacker Becky Sweger has said:
My best technical skill isn't coding. It's a willingness to ask questions, in front of everyone, about what I don't understand.
Besides, just by coming to data science via programming you bring a wealth of practical skills, borne of experience, which are very relevant. For example:
- User stories. You're used to decomposing large problems into smaller steps which deliver incremental, measurable business value. You know it's foolish to embark on a project without a clear statement of the desired outcomes and priorities first, even though the implementation may be unclear.
- Version control. You know you'll inevitably be required to repeat statistical analysis at some point in the future. Luckily you've already committed your code to version control in a branch specifying its intent. You'll also be able to point anyone who wants to scrutinise your approach at this code so they can replicate your methodology.
- Kanban boards. You track in-flight work to provide a context for discussions about progress, and archive completed work to remind yourself how far you've come (and the results of your attempts, whether they were successful or not).
- Devops. You're mindful when there's a need to be able to move algorithms into production. You can communicate with development and operations teams about how your model can be integrated into the live site in ways that they can understand.
Take heart. The diversity of your experience as a developer moving into data science is a tremendous asset.
In summary
Data scientists are expected to have a wide and deep level of technical understanding. The Clojure programming language enjoys an enormously broad array of applications and ClojureScript extends its reach even further.
If your data science project requires you to apply an off-the-shelf algorithm and move on, Clojure is not the language I would recommend. If instead you're looking to build a strong foundation in statistical thinking, Clojure's language, tooling, and community will repay your investment many times over. This approach will enable you to effectively apply fundamental principles in the diverse situations you will encounter as a practicing data scientist.
I quoted a line from Alexander Pope's Essay on Criticism at the beginning of this article. The full couplet is:
A little learning is a dangerous thing / Drink deep, or taste not the Pierian spring
The Pierian spring was the metaphorical source of knowledge of art and science. If you're seeking to learn powerful analytical techniques, do so with a language that will move up and down the ladder of abstraction with you. This conceptual grounding will serve you well even as the technical options—and perhaps even your choice of programming language itself—continue to evolve in the future.
Thanks to Luke Snape for reviewing preliminary drafts of this article.
About the author
Henry Garner is a freelance data scientist working primarily in Clojure. He’s author of the Packt book Clojure for Data Science and he managed to squeeze the buzzwords big data and machine learning onto the cover. And also into this biography.
Data science is fast becoming a critical skill for developers and managers across industries, and it looks like a lot of fun as well. But it’s pretty complicated - there are a lot of engineering and analytical options to navigate, and it’s hard to know if you’re doing it right or where the bear traps lie. In this series we explore ways of making sense of data science - understanding where it’s needed and where it’s not, and how to make it an asset for you, from people who’ve been there and done it.
This InfoQ article is part of the series "Getting A Handle on Data Science" . You can subscribe to receive notifications via RSS.