Constrained MDPs and the reward hypothesis
It's been a looong ago that I posted on this blog. But this should not mean the blog is dead. Slow and steady wins the race, right? Anyhow, I am back and today I want to write about constrained Markovian Decision Process (CMDPs). The post is prompted by a recent visit of Eugene Feinberg , a pioneer of CMDPs, of our department, and also by a growing interest in CMPDs in the RL community (see this , this , or this paper). For impatient readers, a CMDP is like an MDP except that there are multiple reward functions, one of which is used to set the optimization objective, while the others are used to restrict what policies can do. Now, it seems to me that more often than not the problems we want to solve are easiest to specify using multiple objectives (in fact, this is a borderline tautology!). An example, which given our current sad situation is hard to escape, is deciding what interventions a government should apply to limit the spread of a virus while maintaining economic ...
I think this question has been answered in some ways in algorithmic complexity/probability and at a more concrete instantiation, the minimum description length literature.
ReplyDeleteThere is an additional element here, since in MDL you typically just want to compress $x \in X$ (or a sequence of $x_i$). However, there might be additional gains in our case, since we only care about preserving those aspects of $x$ required to approximate $f$.
Then I suppose that the question becomes in what sense you want to approximate $f$ - if you want something that works uniformly over $X$ then why should what the original input matter at all?