Saturday, 29 March 2008

Statistical Modeling: The Two Cultures

Sometimes people ask what is the difference between what statisticians and machine learning researchers do. The best answer that I have found so far can be found in
"Statistical Modeling: The Two Cultures" by Leo Breiman (Statistical Science, 16:199-231, 2001).
According to this, statisticians like to start by making modeling assumptions about how the data is generated (e.g., the response is a noise added to the linear combination of the predictor variables), while in machine learning people use algorithm models and treat the data mechanism as unknown. He estimates that (back in 2001) less than 2% of statisticians work in the realm when the data mechanism is considered as unknown.
It seems that there are two problem with the data model approach.
One is that the this approach does not address the ultimate question which is making good predictions: if the data does not fit the model, this approach has nothing to offer (it does not make sense to apply a statistical test if the assumptions are not valid).
The other problem is that as data become more complex, data models become more cumbersome. Then why bother? With complex models we lose the advantage of easy interpretability, not talking about the computational complexity of fitting such models.
The increased interest in Bayesian modeling with Markov Chain Monte Carlo is viewed as the response of the statistical community to this problem. True enough, this approach might be able to scale to complex data, but does this address the first issue? Are not there computationally cheaper alternatives that can achieve the same prediction power?
He characterizes the machine learning approach, as the pragmatic approach: You have to solve a prediction problem, hence take it seriously: Estimate the prediction error and choose the algorithm that gives a predictor with the better accuracy (but let's not forget about data snooping!).
But the paper offers more. Amongst other things it identifies three important recent lessons:
1. The multiplicity of good models: If you have many variables, there can be many models of similar prediction accuracy. Use them all by combining their predictions instead of just picking one. This should increase accuracy, reduce instability (sensitivity to perturbations of the data). Boosting, bagging, aggregation using exponential weights are relevant recent popular buzzwords.
2. The Occam dilemma: Occam's razor tells you to choose the simplest predictor. Aggregated predictors don't look particularly simple. But aggregation seems to be the right choice otherwise. I would think that Occam's razor tells you only that you should have a prior preference to simple functions. I think this is rather well understood by now.
3. Bellman: dimensionality -- curse or blessing: Many features are not bad per se. If your algorithm is prepared to deal with the high-dimensional inputs (SVMs, regularization, random forests are mentioned) then extracting many features can boost accuracy considerably.
In summary, I like the characterization of the difference between (classical) statistical approaches and machine learning. However, I wonder if these differences are still as significant as they were (must have been) in 2001 when the article was written and if the differences will become smaller over time. Then it will be really difficult to answer the question on the difference between the statistical and the machine learning approaches.

Friday, 21 March 2008

Bayesian Statistics in Medical Clinical Trials

I came across a very interesting document.
The document is titled "Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials". It is a draft guidelines poster by the Center for Devices and Radiological Health of FDA, dated May 23, 2006.
Why is this interesting? The job of FDA (the US Food and Drug Administration) is to make sure that the decisions in any clinical trial are made in a scientifically sound manner. Clearly, when following the Bayesian approach the choice of the prior and the model can influence the decisions. What does FDA do in this situation?
The establish a process where they require a pre-specification (and agreement on) both the prior and the model, including an analysis of the operating characteristics of the design. This latter includes estimating the probability of erroneously approving an ineffective or unsafe device (the Type I error). This will typically be done by conducting Monte-Carlo simulations, where the Type I error is measured for the borderline cases when the device should not be approved. In the case of a large estimated Type I error, the trial will be rejected.
Is this a good procedure? If the simulations use a biased model then the estimated Type I error might be biased. Their response is that both the prior and the model should be backed up with scientific arguments and existing statistics. Yet another problem is that the calculations often use MCMC. How do you determine if your samples converged to the posterior? The samples of the posterior are not iid. How do you know that you took enough samples of the posterior? (Think of a mixture of Gaussian, with a narrow Gaussian proposal. If you sample from the mixture and then sample just a few points with Metropolis-Hastings, you will likely miss the second mode if the two modes are sufficiently far away.)
On the other hand, there are a number of potential advantages to a Bayesian design. If we accept that the model and the prior is good, then often the Bayesian analysis will require smaller sample sizes to reach a decision (if they are not, the conclusion might be wrong). It can also provide flexible methods for handling interim analyses (stopping when enough evidence is available for either approval or rejection) and sometimes good priors are available such as earlier studies on previous generations of a device or from
overseas studies. Such approaches can be used with a fequentist approach, too, but the frequentist analysis of deriving a procedure is often non-trivial, while the Bayesian "only" needs to be concerned about computational issues.
The document cites two trials that used Bayesian analysis. It appears that in both studies Bayesian analysis was used only as a supplementary information, i.e., the critical decisions (if a device is safe and minimally effective) were made using traditional,
frequentist methods.
Common to both the frequentist and the Bayesian approaches is the use of a number of unverified assumptions. In the frequentist case, if the design is simple then the typical assumption is only that there is a common underlying distribution to the outcome-patient pairs and that patients are selected uniformly at random from the population. This looks fairly minimal, but can be questioned nevertheless (drifts, environmental effects, sample biases, etc.). In a more complicated scenario there will be more assumptions. If the set of assumptions for the methods satisfy some containment relation then one naturally trusts the method that relies on less information. In the lack of containment the decision of which method to prefer is not so simple. In any case, it is very interesting to see how a regularity body (like FDA) wrestles with these fundamental issues. They look to act in a pretty reasonable manner. The existence of this document predicts that we should expect to see more decisions that used Bayesian analysis in the future. Is this good or bad? One could be concerned by the use of more unverified assumptions in the Bayesian analysis and that the probability of making an error can also increase because the calculations are non-trivial. Life is dangerous, is not it? But how dangerous will it be if Bayesian analysis is used routinely in assessing success in clinical trials? Time will tell for sure. Well, assuming some form of stationarity.

Saturday, 15 March 2008

Curse of dimensionality

I came across two papers today that discuss the curse of dimensionality. I thought this is just enough to write a short blog about the topic that definitely deserves attention. So, here we go:

The first paper is by Flip Korn, Bernd-Uwe Pagel and Christos Faloutsos, the title is On the "Dimensionality Curse" and the "Self-Similarity Blessing". This is a 2001 paper that won The paper is about nearest neighbor retrieval: You have $n$ datapoints that you can store and the task is to find the nearest neighbor among these datapoints of a query point. If the data lies in a $D$-dimensional space Euclidean space, it was a common wisdom to believe that the time required to find the nearest neighbor scales exponentially with $D$. The essence of the paper is that if the data lies on a low dimensional manifold then the complexity of search will depend only on the intrinsic dimensionality of the manifold. (The cover tree package due to Alina Beygelzimer, Sham Kakade, and John Langford should be mentioned here. This software is able to deal with data that does not necessarily lies in a vector space. The algorithms comes with guarantees similar to the above mentioned one. For details and further references see the accompanying paper. I guess there were quite a few precursors to this paper since around 1998, where a lot of excitement was generated when people started to realize this phenomenon.) This paper has some relation to our recent work that I will write about at the end of this blog.

So the good news here is that although with a little bit of a luck, a clever algorithm might run significantly faster than what is predicted by the worst case bounds. Note that a not so clever algorithm will not run faster. So you need luck for your data to display some regularities, but you need a clever design to take advantage of those regularities.

The other paper is by David Donoho from 2000, High-dimensional data analysis: The curses and blessings of dimensionality, an Aide-Memoire lecture delivered at the conference "Math Challenges of the 21st Century". Donoho mentions three curses (in optimization, Bellman's original use; in function approximation and in numerical integration). When talking about the blessings, he first talks about the concentration of measure (CoM) phenomenon, which roughly states that any Lipschitz function defined over the $D$-dimensional sphere is "nearly constant": If we place a uniform measure on the sphere then the the probability that it deviates from its expectation by more than $t$ is bounded by $C_1 exp(- C_2 t)$, where $C_i$ are universal constants (i.e., they neither depend on $D$, nor $f$)! This is important since often we are interested in an expected value of a Lipschitz function and we hope to estimate the expected value by the observed (random) value of the function. This phenomenon is important in model selection, as explained to some extent in the paper. The second blessing, related to the CoM phenomenon is "dimension asymptotics". The observation is that when the CoM applies, often as $D$ goes to infinity, the distributions converge to some limiting distribution. Sometimes then it becomes possible to obtain predictions that work for moderate dimensions but which are derived by using the limiting distributions (the example mentioned is the prediction of the top eigenvalue of a data covariance matrix). The third blessing is when the data is a sampled version of a continuous phenomenon (he calls this "approach to continuum"). Since what is measured is continuous, the space of observed data will show signs of compactness that can be exploited. Here the example is that a basis derived from some sampled data with some procedure tends to resemble wavelets, an object from the continuous world. For larger $D$ the resemblance becomes stronger and so the interpretation becomes easier. The is the bless of dimensionality in action.

Back in 2001 he mentioned 3 areas of potential interest: The first is high-dimensional approximate linear algebra (use randomization, the concentration of measure phenomenon to speed up calculations). Here he mentions rank $k$ approximation to data matrices. The second area is when the data are curves or images and the problem is to select a good basis. He mentions an interesting example here: We have a single data point and $D$ is very large. Then if the data is the realization of a stationary Gaussian stochastic process (i.e., coordinates are correlated) then we can in fact learn the full probability distribution. The third example is coming from approximation theory. Here, he cites a well-known result of Barron, that imposes a constraint on the derivative of the function: The derivative must be such that its Fourier transform is integrable. Since assuming $s$ times differentiabily typically yields that the function can be approximated only at a rate of $n^{-s/D}$, we expect to see a slow approximation rate of $n^{-1/D}$. However, Barron has shown that this type of functions can be approximated at the rate $n^{-1/2}$. What this and similar results point to that there are non-classical spaces of functions where things might work differently from what one expects. Again, the need arises for algorithms that are clever enough to take advantage if the situation is such that the curse of dimensionality can be avoided.

On another note, recently with Amir massoud Farahmand and Jean-Yves Audibert we published a paper related to these topics at ICML. The subject is dimensionality estimation of manifolds based on random samples supported by the manifold. Here again, the good news is that the embedding dimension does not play a direct role in the quality of the estimates (only through the properties of the embedding). One motivation for coming up with this method was that we wanted to modify existing regression procedures so that when the data lies on a low dimensional manifold, the methods should converge faster. We figured that many algorithms tune their parameters to the dimensionality of the data points and if we replace that dimensionality by the manifold dimensionality, we get a "manifold adaptive" method, which is clever enough to converge faster when the input lies on a manifold. Interestingly, these results can be obtained without ever estimating the manifold itself. However, this should be the subject of another post!

On a last note, quite a few references to works dealing with the curse of dimensionality can be found in the annotated bibliography of Damien Francois. Enjoy!