## The Rise of the (Data Science) Machines

- 02/03/2016
- 85
- 0 Like

**Published In**

- Big Data
- Analytics
- Internet of Things (IoT)

I mentioned in a recent post the emergence of the Data Science Machine: an application which will automate the role of the data scientist, or at least put data science activities within reach of a relatively unskilled analyst. This is an attractive idea. Data scientists are expensive. If you can substitute labour for capital, history says that you will make money.

This begs two questions: How good would this Data Science Machine be? And would it make you any money? I will address the first of these questions in this piece, and leave the second for later.

Trivial machine learning problems, those that simply require the application of an algorithm to a dataset and reporting a result, can be automated straightforwardly. However problems are harder to automate if they require domain knowledge, iteration or trial-and-error investigation to make them meaningful or tractable.

A particular problem results from applying analytical techniques to large datasets, or to medium-sized datasets in the presence of complexity or dimensionality. One solution, distributed computation (aka “Big Data”) is subject to bottlenecks caused by shuffle/sort steps and reduce/aggregation steps. Care also needs to be taken around memory use and when writing data to and from disk. If you get the optimisation wrong execution time can increase by orders of magnitude. It is possible to optimise automatically in some straightforward use cases but it is not easy to see how this can be generalised.

But Big Data techniques scale linearly (you add more CPUs) and this rapidly becomes infeasible if you have an O(n3) problem (where processing time increases as the cube of the data size). Usually you need to use some kind of trick to make these run, and again it is hard to see how this can be automated.

My team at Barclays has found that we have had to think up one of these “tricks” in every single project we have been involved in. Here are some examples of use cases from over the last year:

- You want to run a large number of different queries over a large dataset but the time it would take to read the disk multiple times makes this impractical. But if you express your groups, filters and aggregations as monoids you can compose them all into a single function and traverse the data only once.

- You want to search for nearest-neighbour string pairs in a large list. You realise that because the pair distance is in a metric space you can use a vantage point tree-structure to make the search quick enough to return.

- You are building an event-driven prediction engine but the heterogeneity of event types is giving you dimensionality problems. You realise that event sequences have commonalities with language structures and simplify the problem by borrowing from natural language processing concepts.

- You realise that there is significant repetition in your application’s functionality, but it is buried deep within compound functions. It would be impractical to break down the functions yourself, but you make use of lazy evaluation and execution graphs to identify common sub-functions and compute them only once.

- You want to compute a graph of probability paths built by a Markov transition matrix. The graph rapidly becomes large but you realise that an appropriate factorisation will reduce the computational complexity and memory requirements to make the result tractable.

- You need to join two large datasets. By bit-packing one dataset you can broadcast it across the nodes of your cluster so that joining is straightforward.

I guess I could go on. Any ideas on how you would spot how to do these things automatically? Me neither.

The commonality is that simple problems are often intractable without some clever trick, the idea behind which comes from knowing your business, your problem, your methodology and your infrastructure. Moreover coming to the solution required intuition and some trial and error. This kind of stuff is hard to automate.

So what are these promised Data Science machines? Most implement standard machine learning algorithms that you can get for free in R, Python, Knime or Weka. They work well with toy problems like classifying the Iris data set. They rarely work at large scale or low latency. And they often require ingenious data transformation and feature extraction to get the best results. So they can do the easy bit, but you are still left with the inventive bit if you want to get the best results.

Some automation is possible. There are some good optimisers for linear algebra functionality, and since many machine learning methods are based on linear algebra, they can work well. I have also recently come across SystemML, a recently-open-sourced project built originally by IBM, which optimises certain standard problems efficiently. And of course there is Watson (another IBM product) which can perform remarkable inference but requires enormous datasets to learn from.

So my experience is that Data Science Machines can do the easy bits of machine learning, but they tend to struggle with non-standard problems or with finding the optimal solution. And this is an issue because, as I will argue in the next piece, that's where the value is.

- 02/03/2016
- 85
- 0 Like

## The Rise of the (Data Science) Machines

- 02/03/2016
- 85
- 0 Like

#### Harry Powell

Barclays, London at Head of Advanced Data Analytics

Opinions expressed by Gladwin Analytics members are their own.

#### Top Authors

I mentioned in a recent post the emergence of the Data Science Machine: an application which will automate the role of the data scientist, or at least put data science activities within reach of a relatively unskilled analyst. This is an attractive idea. Data scientists are expensive. If you can substitute labour for capital, history says that you will make money.

This begs two questions: How good would this Data Science Machine be? And would it make you any money? I will address the first of these questions in this piece, and leave the second for later.

Trivial machine learning problems, those that simply require the application of an algorithm to a dataset and reporting a result, can be automated straightforwardly. However problems are harder to automate if they require domain knowledge, iteration or trial-and-error investigation to make them meaningful or tractable.

A particular problem results from applying analytical techniques to large datasets, or to medium-sized datasets in the presence of complexity or dimensionality. One solution, distributed computation (aka “Big Data”) is subject to bottlenecks caused by shuffle/sort steps and reduce/aggregation steps. Care also needs to be taken around memory use and when writing data to and from disk. If you get the optimisation wrong execution time can increase by orders of magnitude. It is possible to optimise automatically in some straightforward use cases but it is not easy to see how this can be generalised.

But Big Data techniques scale linearly (you add more CPUs) and this rapidly becomes infeasible if you have an O(n3) problem (where processing time increases as the cube of the data size). Usually you need to use some kind of trick to make these run, and again it is hard to see how this can be automated.

My team at Barclays has found that we have had to think up one of these “tricks” in every single project we have been involved in. Here are some examples of use cases from over the last year:

- You want to run a large number of different queries over a large dataset but the time it would take to read the disk multiple times makes this impractical. But if you express your groups, filters and aggregations as monoids you can compose them all into a single function and traverse the data only once.

- You want to search for nearest-neighbour string pairs in a large list. You realise that because the pair distance is in a metric space you can use a vantage point tree-structure to make the search quick enough to return.

- You are building an event-driven prediction engine but the heterogeneity of event types is giving you dimensionality problems. You realise that event sequences have commonalities with language structures and simplify the problem by borrowing from natural language processing concepts.

- You realise that there is significant repetition in your application’s functionality, but it is buried deep within compound functions. It would be impractical to break down the functions yourself, but you make use of lazy evaluation and execution graphs to identify common sub-functions and compute them only once.

- You want to compute a graph of probability paths built by a Markov transition matrix. The graph rapidly becomes large but you realise that an appropriate factorisation will reduce the computational complexity and memory requirements to make the result tractable.

- You need to join two large datasets. By bit-packing one dataset you can broadcast it across the nodes of your cluster so that joining is straightforward.

I guess I could go on. Any ideas on how you would spot how to do these things automatically? Me neither.

The commonality is that simple problems are often intractable without some clever trick, the idea behind which comes from knowing your business, your problem, your methodology and your infrastructure. Moreover coming to the solution required intuition and some trial and error. This kind of stuff is hard to automate.

So what are these promised Data Science machines? Most implement standard machine learning algorithms that you can get for free in R, Python, Knime or Weka. They work well with toy problems like classifying the Iris data set. They rarely work at large scale or low latency. And they often require ingenious data transformation and feature extraction to get the best results. So they can do the easy bit, but you are still left with the inventive bit if you want to get the best results.

Some automation is possible. There are some good optimisers for linear algebra functionality, and since many machine learning methods are based on linear algebra, they can work well. I have also recently come across SystemML, a recently-open-sourced project built originally by IBM, which optimises certain standard problems efficiently. And of course there is Watson (another IBM product) which can perform remarkable inference but requires enormous datasets to learn from.

So my experience is that Data Science Machines can do the easy bits of machine learning, but they tend to struggle with non-standard problems or with finding the optimal solution. And this is an issue because, as I will argue in the next piece, that's where the value is.

- 02/03/2016
- 85
- 0 Like

## Harry Powell

Barclays, London at Head of Advanced Data Analytics

Opinions expressed by Gladwin Analytics members are their own.