Constrained MDPs and the reward hypothesis

It's been a looong ago that I posted on this blog. But this should not mean the blog is dead. Slow and steady wins the race, right? Anyhow, I am back and today I want to write about constrained Markovian Decision Process (CMDPs). The post is prompted by a recent visit of Eugene Feinberg , a pioneer of CMDPs, of our department, and also by a growing interest in CMPDs in the RL community (see this , this , or this paper). For impatient readers, a CMDP is like an MDP except that there are multiple reward functions, one of which is used to set the optimization objective, while the others are used to restrict what policies can do. Now, it seems to me that more often than not the problems we want to solve are easiest to specify using multiple objectives (in fact, this is a borderline tautology!). An example, which given our current sad situation is hard to escape, is deciding what interventions a government should apply to limit the spread of a virus while maintaining economic

The Loss Rank Principle by Marcus Hutter

I found the paper posted by Marcus Hutter on arxiv quite interesting. The paper is about model (or rather predictor) selection. The idea is a familiar one, but the details appear to be novel: You want to find a model which yields small loss on the dataset available, while yielding a larger loss on most other datasets.

Classification: The simplest case is when we consider supervised learning and the target set is finite. Then you can count the number of target label variations such that the predictor's loss is smaller than its loss when the true targets are used. This idea sounds very similar to the way Rademacher complexity works, see e.g. the paper of Lugosi and Wegkamp, where a localized version of Rademacher complexity is investigated.

Regression: For continuous targets you can use a grid with an increasing resolution (assume that the range of targets is bounded) and count the number of gridpoints such that the predictor's loss is less than its loss on the true dataset.
With an appropriate normalization this converges to the volume of such target values (hopefully this set is measurable:)).

The paper does not go very far: Some examples are given that demonstrate that the criterion gives a computable procedure and that this procedure is reasonable. A quick comparison to alternatives is given. It will be interesting to see the further developments!

Comments

Popular posts from this blog

Constrained MDPs and the reward hypothesis

Bayesian Statistics in Medical Clinical Trials

Keynote vs. Powerpoint vs. Beamer