Sometimes people ask what is the difference between what statisticians and machine learning researchers do. The best answer that I have found so far can be found in

"Statistical Modeling: The Two Cultures" by Leo Breiman (Statistical Science, 16:199-231, 2001).

According to this, statisticians like to start by making modeling assumptions about how the data is generated (e.g., the response is a noise added to the linear combination of the predictor variables), while in machine learning people use algorithm models and treat the data mechanism as unknown. He estimates that (back in 2001) less than 2% of statisticians work in the realm when the data mechanism is considered as unknown.

It seems that there are two problem with the data model approach.

One is that the this approach does not address the ultimate question which is making good predictions: if the data does not fit the model, this approach has nothing to offer (it does not make sense to apply a statistical test if the assumptions are not valid).

The other problem is that as data become more complex, data models become more cumbersome. Then why bother? With complex models we lose the advantage of easy interpretability, not talking about the computational complexity of fitting such models.

The increased interest in Bayesian modeling with Markov Chain Monte Carlo is viewed as the response of the statistical community to this problem. True enough, this approach might be able to scale to complex data, but does this address the first issue? Are not there computationally cheaper alternatives that can achieve the same prediction power?

He characterizes the machine learning approach, as the pragmatic approach: You have to solve a prediction problem, hence take it seriously: Estimate the prediction error and choose the algorithm that gives a predictor with the better accuracy (but let's not forget about data snooping!).

But the paper offers more. Amongst other things it identifies three important recent lessons:

"Statistical Modeling: The Two Cultures" by Leo Breiman (Statistical Science, 16:199-231, 2001).

According to this, statisticians like to start by making modeling assumptions about how the data is generated (e.g., the response is a noise added to the linear combination of the predictor variables), while in machine learning people use algorithm models and treat the data mechanism as unknown. He estimates that (back in 2001) less than 2% of statisticians work in the realm when the data mechanism is considered as unknown.

It seems that there are two problem with the data model approach.

One is that the this approach does not address the ultimate question which is making good predictions: if the data does not fit the model, this approach has nothing to offer (it does not make sense to apply a statistical test if the assumptions are not valid).

The other problem is that as data become more complex, data models become more cumbersome. Then why bother? With complex models we lose the advantage of easy interpretability, not talking about the computational complexity of fitting such models.

The increased interest in Bayesian modeling with Markov Chain Monte Carlo is viewed as the response of the statistical community to this problem. True enough, this approach might be able to scale to complex data, but does this address the first issue? Are not there computationally cheaper alternatives that can achieve the same prediction power?

He characterizes the machine learning approach, as the pragmatic approach: You have to solve a prediction problem, hence take it seriously: Estimate the prediction error and choose the algorithm that gives a predictor with the better accuracy (but let's not forget about data snooping!).

But the paper offers more. Amongst other things it identifies three important recent lessons:

- The multiplicity of good models: If you have many variables, there can be many models of similar prediction accuracy. Use them all by combining their predictions instead of just picking one. This should increase accuracy, reduce instability (sensitivity to perturbations of the data). Boosting, bagging, aggregation using exponential weights are relevant recent popular buzzwords.
- The Occam dilemma: Occam's razor tells you to choose the simplest predictor. Aggregated predictors don't look particularly simple. But aggregation seems to be the right choice otherwise. I would think that Occam's razor tells you only that you should have a prior preference to simple functions. I think this is rather well understood by now.
- Bellman: dimensionality -- curse or blessing: Many features are not bad per se. If your algorithm is prepared to deal with the high-dimensional inputs (SVMs, regularization, random forests are mentioned) then extracting many features can boost accuracy considerably.

## Comments

Judging by comments I have read online, this gap between classical statisticians and the machine learning/data mining crowd is still very large.

On a less technical note, Breiman, in the paper you mention, warns that classical statisticians have already begun to marginalize themselves. I am not suggesting that the analytical community dispense with the strong technical legacy which the classical era has bestowed, but the historical pattern is clear regarding what happens to those who refuse to learn to use new tools.

-Will Dwinnell

Data Mining in MATLAB

Anyway, this distinction between ML and Stat. is interesting for me too. Still, I'm not sure about the "real" differences. Maybe there is none, but my current belief is that the difference between these two fields is mainly on the emphasize they put on "prediction".

The way it is usually interpreted in ML is not the same as the way physical equations (e.g. Maxwell's equation describing electrodynamical fields) describe, is it?

The way I used prediction did not include this latter description, and I believe it is not the way most ML people consider it.

So, rephrasing my main sentence in the previous comment: "I doubt that the ultimate goal of analyzing data is to do machine learning/blackbox style predicts".

Maybe I'm wrong!

Since 2001, nothing really new, but statistical methods.

Take an example of microarrays, a huge number of statistical methods have been invented but in vain. Hard to reproduce, hard to predict, hard even to concieve what's biology or just statistics ...

"...black box approaches, which is more common in ML than Stat., is not preferable."

The term "black box" is possibly loaded in this context. While it is true that many machine learning (data mining, etc.) techniques produce models which are more opaque than their statistical cousins, such techniques need

notbe used blindly. Use of rigorous methods, such as error resampling, can provide ample evidence of the merit of such models.About the prediction issue, it can be described in a way that one can say he has found some causal relationship in a data. But, it has to be verified. The only way to do this is to let him predict and then verify from the future observations.

If it cannot make a verifiable prediction on an observation, then one could say that the knowledge about the claimed casual relationship might not be valid. Again, because it is not verifiable.

Knowledge, in any form such as parameter estimation or causal relationship, thus could be interpreted as the ability to make predictions.

I think ML stuff and physical equations and all the sciences do the same thing in this sense.

I think statisticians care about the notion of "population" and "sampling" from it. This forms the basis for statistical inference and prediction. Does machine learners think about these?

Statisticians use different methods depending on the goals of their applications (e.g., testing hypothesis, making decisions, estimating quantities). Knowing well about the applications (e.g., specific scientific fields) is also considered important in order to make important contributions. These are probably true to machine learners as well?

I am not sure if there are substantial differences between statisticians and machine learners.