Constrained MDPs and the reward hypothesis

It's been a looong ago that I posted on this blog. But this should not mean the blog is dead. Slow and steady wins the race, right? Anyhow, I am back and today I want to write about constrained Markovian Decision Process (CMDPs). The post is prompted by a recent visit of Eugene Feinberg , a pioneer of CMDPs, of our department, and also by a growing interest in CMPDs in the RL community (see this , this , or this paper). For impatient readers, a CMDP is like an MDP except that there are multiple reward functions, one of which is used to set the optimization objective, while the others are used to restrict what policies can do. Now, it seems to me that more often than not the problems we want to solve are easiest to specify using multiple objectives (in fact, this is a borderline tautology!). An example, which given our current sad situation is hard to escape, is deciding what interventions a government should apply to limit the spread of a virus while maintaining economic

Curse of dimensionality

I came across two papers today that discuss the curse of dimensionality. I thought this is just enough to write a short blog about the topic that definitely deserves attention. So, here we go:

The first paper is by Flip Korn, Bernd-Uwe Pagel and Christos Faloutsos, the title is On the "Dimensionality Curse" and the "Self-Similarity Blessing". This is a 2001 paper that won The paper is about nearest neighbor retrieval: You have $n$ datapoints that you can store and the task is to find the nearest neighbor among these datapoints of a query point. If the data lies in a $D$-dimensional space Euclidean space, it was a common wisdom to believe that the time required to find the nearest neighbor scales exponentially with $D$. The essence of the paper is that if the data lies on a low dimensional manifold then the complexity of search will depend only on the intrinsic dimensionality of the manifold. (The cover tree package due to Alina Beygelzimer, Sham Kakade, and John Langford should be mentioned here. This software is able to deal with data that does not necessarily lies in a vector space. The algorithms comes with guarantees similar to the above mentioned one. For details and further references see the accompanying paper. I guess there were quite a few precursors to this paper since around 1998, where a lot of excitement was generated when people started to realize this phenomenon.) This paper has some relation to our recent work that I will write about at the end of this blog.

So the good news here is that although with a little bit of a luck, a clever algorithm might run significantly faster than what is predicted by the worst case bounds. Note that a not so clever algorithm will not run faster. So you need luck for your data to display some regularities, but you need a clever design to take advantage of those regularities.

The other paper is by David Donoho from 2000, High-dimensional data analysis: The curses and blessings of dimensionality, an Aide-Memoire lecture delivered at the conference "Math Challenges of the 21st Century". Donoho mentions three curses (in optimization, Bellman's original use; in function approximation and in numerical integration). When talking about the blessings, he first talks about the concentration of measure (CoM) phenomenon, which roughly states that any Lipschitz function defined over the $D$-dimensional sphere is "nearly constant": If we place a uniform measure on the sphere then the the probability that it deviates from its expectation by more than $t$ is bounded by $C_1 exp(- C_2 t)$, where $C_i$ are universal constants (i.e., they neither depend on $D$, nor $f$)! This is important since often we are interested in an expected value of a Lipschitz function and we hope to estimate the expected value by the observed (random) value of the function. This phenomenon is important in model selection, as explained to some extent in the paper. The second blessing, related to the CoM phenomenon is "dimension asymptotics". The observation is that when the CoM applies, often as $D$ goes to infinity, the distributions converge to some limiting distribution. Sometimes then it becomes possible to obtain predictions that work for moderate dimensions but which are derived by using the limiting distributions (the example mentioned is the prediction of the top eigenvalue of a data covariance matrix). The third blessing is when the data is a sampled version of a continuous phenomenon (he calls this "approach to continuum"). Since what is measured is continuous, the space of observed data will show signs of compactness that can be exploited. Here the example is that a basis derived from some sampled data with some procedure tends to resemble wavelets, an object from the continuous world. For larger $D$ the resemblance becomes stronger and so the interpretation becomes easier. The is the bless of dimensionality in action.

Back in 2001 he mentioned 3 areas of potential interest: The first is high-dimensional approximate linear algebra (use randomization, the concentration of measure phenomenon to speed up calculations). Here he mentions rank $k$ approximation to data matrices. The second area is when the data are curves or images and the problem is to select a good basis. He mentions an interesting example here: We have a single data point and $D$ is very large. Then if the data is the realization of a stationary Gaussian stochastic process (i.e., coordinates are correlated) then we can in fact learn the full probability distribution. The third example is coming from approximation theory. Here, he cites a well-known result of Barron, that imposes a constraint on the derivative of the function: The derivative must be such that its Fourier transform is integrable. Since assuming $s$ times differentiabily typically yields that the function can be approximated only at a rate of $n^{-s/D}$, we expect to see a slow approximation rate of $n^{-1/D}$. However, Barron has shown that this type of functions can be approximated at the rate $n^{-1/2}$. What this and similar results point to that there are non-classical spaces of functions where things might work differently from what one expects. Again, the need arises for algorithms that are clever enough to take advantage if the situation is such that the curse of dimensionality can be avoided.

On another note, recently with Amir massoud Farahmand and Jean-Yves Audibert we published a paper related to these topics at ICML. The subject is dimensionality estimation of manifolds based on random samples supported by the manifold. Here again, the good news is that the embedding dimension does not play a direct role in the quality of the estimates (only through the properties of the embedding). One motivation for coming up with this method was that we wanted to modify existing regression procedures so that when the data lies on a low dimensional manifold, the methods should converge faster. We figured that many algorithms tune their parameters to the dimensionality of the data points and if we replace that dimensionality by the manifold dimensionality, we get a "manifold adaptive" method, which is clever enough to converge faster when the input lies on a manifold. Interestingly, these results can be obtained without ever estimating the manifold itself. However, this should be the subject of another post!

On a last note, quite a few references to works dealing with the curse of dimensionality can be found in the annotated bibliography of Damien Francois. Enjoy!

Comments

  1. Hi Csaba,

    Speaking of high-dimensional approximate linear algebra,
    Drineas and his colleagues have some new results on approximating products of huge matrices:
    http://www.cs.rpi.edu/~drinep/matrixI_SICOMP.pdf

    Why is that important? You can use it, for example, for regression ( http://www.cs.rpi.edu/~drinep/l2sample.pdf ), or, in the case of reinforcement learning, for doing value iteration in exponentially large state spaces
    (which we have been working on recently)

    cheers,
    István Szita

    ReplyDelete
  2. Thanks for the comment. Yes, you can see that instead of inverting a matrix you could be a little lazy and not loose too much. This can be useful in so many ways.

    Re using this in RL: so you want to use the approximate regression approach to tackle the computational challenge of dealing with the very large number of possible features that you must use in multi-dimensional state spaces to ensure low approximation errors? Nice idea. I am curious how far you got and what algorithm you are trying to enhance.

    ReplyDelete
  3. Thanks for your interest :-) We were attacking Value Iteration in factored MDPs.
    The number of features was kept fixed (and low). The approximate regression trick was used to evade the sweep-through of the whole state space.
    Another trick was to rescale the L2 regression operator so that it becomes a max-norm non-expansion. This way, the algorithm converges (to somewhere... ) in polynomial time.

    A techrep is available at http://arxiv.org/abs/0801.2069

    ReplyDelete
  4. Hi Csaba,

    I have a question regarding high dimensions and binary data. In one of the research problems I'm working on, the results are a bit weird. In 7500-d space, 1-NN on binary data performs better than the same data projected in the principal component directions. Have you come across this phenomena?

    If it is binary data, we are dealing with the axis directions and not between the axes. So my conclusion is, the so-called curse of dimensionality doesn't seem to affect this type of data. Does this make sense to you ? Can you provide any insights?

    ReplyDelete

Post a Comment

Popular posts from this blog

Constrained MDPs and the reward hypothesis

Bayesian Statistics in Medical Clinical Trials

Keynote vs. Powerpoint vs. Beamer