Constrained MDPs and the reward hypothesis

It's been a looong ago that I posted on this blog. But this should not mean the blog is dead. Slow and steady wins the race, right? Anyhow, I am back and today I want to write about constrained Markovian Decision Process (CMDPs). The post is prompted by a recent visit of Eugene Feinberg , a pioneer of CMDPs, of our department, and also by a growing interest in CMPDs in the RL community (see this , this , or this paper). For impatient readers, a CMDP is like an MDP except that there are multiple reward functions, one of which is used to set the optimization objective, while the others are used to restrict what policies can do. Now, it seems to me that more often than not the problems we want to solve are easiest to specify using multiple objectives (in fact, this is a borderline tautology!). An example, which given our current sad situation is hard to escape, is deciding what interventions a government should apply to limit the spread of a virus while maintaining economic

Statistical Modeling: The Two Cultures

Sometimes people ask what is the difference between what statisticians and machine learning researchers do. The best answer that I have found so far can be found in
"Statistical Modeling: The Two Cultures" by Leo Breiman (Statistical Science, 16:199-231, 2001).
According to this, statisticians like to start by making modeling assumptions about how the data is generated (e.g., the response is a noise added to the linear combination of the predictor variables), while in machine learning people use algorithm models and treat the data mechanism as unknown. He estimates that (back in 2001) less than 2% of statisticians work in the realm when the data mechanism is considered as unknown.
It seems that there are two problem with the data model approach.
One is that the this approach does not address the ultimate question which is making good predictions: if the data does not fit the model, this approach has nothing to offer (it does not make sense to apply a statistical test if the assumptions are not valid).
The other problem is that as data become more complex, data models become more cumbersome. Then why bother? With complex models we lose the advantage of easy interpretability, not talking about the computational complexity of fitting such models.
The increased interest in Bayesian modeling with Markov Chain Monte Carlo is viewed as the response of the statistical community to this problem. True enough, this approach might be able to scale to complex data, but does this address the first issue? Are not there computationally cheaper alternatives that can achieve the same prediction power?
He characterizes the machine learning approach, as the pragmatic approach: You have to solve a prediction problem, hence take it seriously: Estimate the prediction error and choose the algorithm that gives a predictor with the better accuracy (but let's not forget about data snooping!).
But the paper offers more. Amongst other things it identifies three important recent lessons:
  1. The multiplicity of good models: If you have many variables, there can be many models of similar prediction accuracy. Use them all by combining their predictions instead of just picking one. This should increase accuracy, reduce instability (sensitivity to perturbations of the data). Boosting, bagging, aggregation using exponential weights are relevant recent popular buzzwords.
  2. The Occam dilemma: Occam's razor tells you to choose the simplest predictor. Aggregated predictors don't look particularly simple. But aggregation seems to be the right choice otherwise. I would think that Occam's razor tells you only that you should have a prior preference to simple functions. I think this is rather well understood by now.
  3. Bellman: dimensionality -- curse or blessing: Many features are not bad per se. If your algorithm is prepared to deal with the high-dimensional inputs (SVMs, regularization, random forests are mentioned) then extracting many features can boost accuracy considerably.
In summary, I like the characterization of the difference between (classical) statistical approaches and machine learning. However, I wonder if these differences are still as significant as they were (must have been) in 2001 when the article was written and if the differences will become smaller over time. Then it will be really difficult to answer the question on the difference between the statistical and the machine learning approaches.

Comments

  1. You ask "...I wonder if these differences are still as significant as they were (must have been) in 2001 when the article was written..."

    Judging by comments I have read online, this gap between classical statisticians and the machine learning/data mining crowd is still very large.

    On a less technical note, Breiman, in the paper you mention, warns that classical statisticians have already begun to marginalize themselves. I am not suggesting that the analytical community dispense with the strong technical legacy which the classical era has bestowed, but the historical pattern is clear regarding what happens to those who refuse to learn to use new tools.


    -Will Dwinnell
    Data Mining in MATLAB

    ReplyDelete
  2. I doubt that the *ultimate* goal of analyzing data is to do prediction. Sometimes we just like to understand the underlying phenomenon and finding the relevant parameters and casual relations. In this case, black box approaches, which is more common in ML than Stat., is not preferable.

    Anyway, this distinction between ML and Stat. is interesting for me too. Still, I'm not sure about the "real" differences. Maybe there is none, but my current belief is that the difference between these two fields is mainly on the emphasize they put on "prediction".

    ReplyDelete
  3. Amir massoud: Imagine you have gained some knowledge by analyzing data, such as that there exists a causal relationship between some variables. What is next? What do you do with this knowledge? How do you turn it into something practical?

    ReplyDelete
  4. I see your point and I believe that it is true that "prediction" is very important in science (maybe can be considered as the essence of science - I'm not sure though), but it depends on how you interpret the word "prediction".
    The way it is usually interpreted in ML is not the same as the way physical equations (e.g. Maxwell's equation describing electrodynamical fields) describe, is it?
    The way I used prediction did not include this latter description, and I believe it is not the way most ML people consider it.
    So, rephrasing my main sentence in the previous comment: "I doubt that the ultimate goal of analyzing data is to do machine learning/blackbox style predicts".
    Maybe I'm wrong!

    ReplyDelete
  5. Random Forests invented by the author is a real example of very efficient method without using "unchecked statistical assumptions".

    Since 2001, nothing really new, but statistical methods.

    Take an example of microarrays, a huge number of statistical methods have been invented but in vain. Hard to reproduce, hard to predict, hard even to concieve what's biology or just statistics ...

    ReplyDelete
  6. Amir massoud writes:
    "...black box approaches, which is more common in ML than Stat., is not preferable."

    The term "black box" is possibly loaded in this context. While it is true that many machine learning (data mining, etc.) techniques produce models which are more opaque than their statistical cousins, such techniques need not be used blindly. Use of rigorous methods, such as error resampling, can provide ample evidence of the merit of such models.

    ReplyDelete
  7. This contrast between ML and statistics is very real. Statistics could be caricatured as the interpretation of poorly-fitting (overly simple) models. Machine learning can be caricatured as fitting accurate but uninterpretable models. It isn't obvious that a poorly-fitting model provides much insight, even though it is interpretable. But the challenge for ML is to find ways of interpreting our much more accurate models. Because they are more accurate, they have presumably captured more of the real phenomena, but interpreting them is very difficult.

    ReplyDelete
  8. Interesting, I must say.

    About the prediction issue, it can be described in a way that one can say he has found some causal relationship in a data. But, it has to be verified. The only way to do this is to let him predict and then verify from the future observations.

    If it cannot make a verifiable prediction on an observation, then one could say that the knowledge about the claimed casual relationship might not be valid. Again, because it is not verifiable.

    Knowledge, in any form such as parameter estimation or causal relationship, thus could be interpreted as the ability to make predictions.

    I think ML stuff and physical equations and all the sciences do the same thing in this sense.

    ReplyDelete
  9. I think it is true that statisticians are generally concerned about data-generating probability distributions, but I am not sure if it is correct to say "statisticians start by making model assumptions" on the data-generating process. That depends on the amount of data at hand. If the data are sparse, modeling assumptions are needed. If the data are ample, assumptions are not needed. Also, statisticians say that "all models are wrong but some are useful."

    I think statisticians care about the notion of "population" and "sampling" from it. This forms the basis for statistical inference and prediction. Does machine learners think about these?

    Statisticians use different methods depending on the goals of their applications (e.g., testing hypothesis, making decisions, estimating quantities). Knowing well about the applications (e.g., specific scientific fields) is also considered important in order to make important contributions. These are probably true to machine learners as well?

    I am not sure if there are substantial differences between statisticians and machine learners.

    ReplyDelete
  10. A facile answer is that statisticians are interested in different types of questions (e.g. they seem to care more about asymptotics), and have different publishing venues. The same holds true for people working in probability (e.g. percolation theory). I find that in general one can hold a meaningful conversation with either group. Why the diverging interests? Perhaps different backgrounds.

    ReplyDelete

Post a Comment

Popular posts from this blog

Constrained MDPs and the reward hypothesis

Bayesian Statistics in Medical Clinical Trials

Keynote vs. Powerpoint vs. Beamer