Musings about machine learning and other things

Constrained MDPs and the reward hypothesis

2020-03-20T21:25:00.001-07:00

It's been a looong ago that I posted on this blog. But this should not mean the blog is dead. Slow and steady wins the race, right? Anyhow, I am back and today I want to write about constrained Markovian Decision Process (CMDPs). The post is prompted by a recent visit of Eugene Feinberg, a pioneer of CMDPs, of our department, and also by a growing interest in CMPDs in the RL community (see this, this, or this paper).
For impatient readers, a CMDP is like an MDP except that there are multiple reward functions, one of which is used to set the optimization objective, while the others are used to restrict what policies can do. Now, it seems to me that more often than not the problems we want to solve are easiest to specify using multiple objectives (in fact, this is a borderline tautology!). An example, which given our current sad situation is hard to escape, is deciding what interventions a government should apply to limit the spread of a virus while maintaining economic productivity. A perhaps less cheesy example is what interventions to apply to protect the environment, while, again, maintaining economic productivity. But we may also think of the problem of designing the software that runs an autonomous car and that must balance safety, efficiency and comfort. Or think of a robot that must maintain a success rate of achieving some task while running cheaply and smoothly.
The modest goal of of this post is to explore the relationship between CMDPs and MDPs. By definition, CMDPs are a superclass of MDPs: Every MDP is a CMDP (with no constraining reward functions). But could it be that CMDPs do not actually add much to MDPs? This relates to Rich Sutton's reward hypothesis that states

"That all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward)."

My interpretation of this hypothesis is that, given where we are now in, to maximize the speed of progress, we should focus on single-objective problems. A narrower interpretation of the hypothesis is that a single reward function (objective) suffices whatever problem we want our "intelligent agents" solve. Clearly, CMDPs go directly against this latter interpretation. But perhaps multi-objective problems can be reexpressed in terms of a single objective? Or not? The rest of this post will explain my understanding of the current state of affairs on these questions.

Constrained MDPs: Definitions

I will assume that readers are familiar with MDPs. Readers who need a refresher on MDP definitions and concepts (policies, etc.) can scroll to the bottom of the post and then come back. Why to use MDPs should be the topic of another post; for now the reader is asked to accept that we start with MDPs.
As it was noted before, constrained MDPs are like normal ones, except that there are multiple reward functions, say $r_0,r_1,\dots,r_n$. These are functions that map $Z\subset S \times A$ to reals. Here, $S$ is the set of states, $A$ is the set of actions and $Z$ (possibly, a strict subset of $S\times A$) determines what actions are available in each state. In particular, $Z$ is such that for each state $s\in S$, $(s,a)\in Z$ for at least one action $A$. The transition structure $P = (P_z)_{z\in Z}$ is a collection of probability distributions $P_z$ over $S$: For $z=(s,a)\in Z$, $P_z$ gives the distribution of the state that one arrives at after taking action $a$ in state $s$. For the sake of simplicity and to avoid distracting technicalities, we assume that there are finitely many states and actions.
In the discounted version of constrained MDPs, our focus in this post, one considers the total expected discounted reward under the $n+1$ reward functions to create a number of objectives. Assume for simplicity a common discount factor $0\le \gamma <1$ across all rewards. For a policy $\pi$, state $s$ and reward function $r_i$, let \[ V^\pi(r_i,s) = \mathbb{E}_s^{\pi}[\sum_{t=0}^\infty \gamma^t r_i(S_t,A_t)] \] denote the total expected discounted reward under $r_i$ collected while policy $\pi$ is followed indefinitely from state $s$. Here, $S_t$ is the state at time $t$, $A_t$ is the action at time $t$, $\mathbb{E}_s^{\pi}$ is the expectation under the probability measure $\mathbb{P}_s^\pi$ over the set of infinitely long state-action trajectories where $\mathbb{P}_s^\pi$ is induced by $P$, the policy $\pi$ (which directs what actions to take given what history), and the initial state $s$. By abusing notation, we will also use $V^\pi(r_i,\mu)$ to denote the total expected discount reward under $r_i$ when $\pi$ is started from an initial state chosen at random from $\mu$, a distribution over the states. We will also write $\mathbb{P}_\mu^\pi$ (and $\mathbb{E}_{\mu}^\pi$) to denote the probability measure over trajectories induced by $\mu$, $\pi$ and $P$ (and the underlying expectation operator, respectively).
Now, given some fixed initial state distribution $\mu$ and reals $\theta_1,\dots,\theta_n\in \mathbb{R}$, the CMDP optimization problem is to find a policy $\pi$ that maximizes $V^\pi(r_0,\mu)$ subject to the constraints $V^\pi(r_i,\mu)\ge \theta_i$, $i=1,\dots,n$: \[ \max_\pi V^\pi(r_0,\mu) \quad \text{ s.t. } \quad V^\pi(r_i,\mu) \ge \theta_i,\, i=1,\dots,n \,. \tag{$\text{CMDP}_\mu$} \] Note that here the optimization is over all possible policies. In particular, the policies can use the full history at any given stage and can also randomize. The CMDP feasibility problem is to verify whether there exists any policy that satisfies the constraints in $\text{CMDP}_\mu$. When $n=0$, we get back MDPs -- and the feasibility question becomes trivial.
An interesting and important property of MDPs is that history is not needed to act optimally. In fact, in MDPs it also holds that one can find some deterministic state-feedback policy that is uniformly optimal in the sense that it maximizes the value simultaneously over all initial distributions.

Main result

To jump into the middle of things, I state a "theorem" that lists some of the ways in which CMDPs are different from MDPs. To state this theorem, recall that a policy is called stationary if it uses state-feedback, but can also randomize. Stationary policies are preferred as they have a simple structure: The policy does not need to remember the past; i.e., it is memoryless.

Theorem:

Uniformly optimal stationary policies may not exist and optimal policies may need to randomize;

Stationary policies suffice in CMDPs;

Feasibility can be hard to check in CMDPs;

CMDPs can be recasted as linear programs, but they cannot be casted as MDPs with identical state-action spaces.

Gradient algorithms designed for MDPs can be made to work for CMDPs.

Parts 1,2, and 4 are from the classic book of Eitman Altman, while Part 3 is from a paper of Eugene Feinberg (the paper appeared at MOR in 2000). In what follows, I will briefly describe the reasoning behind these results, and then wrap up by describing what these results imply for the reward hypothesis. Part (5) is semi-original and while there is a sizeable literature that this part relates to, there is no obvious single reference to point to. Rest assured, I will list a number of relevant related works at the end of the post. As to the technical contents of the rest of the post, the reader will meet occupancy measures, their relationship to value functions, linear programs, Hamiltonians, Lagrangians and stochastic approximation.

Part 1: Lack of uniformly optimal stationary policies and the need for randomization

This part only needs an example. This is best done graphically:

Figure: 2-state MDP example for illustrating CMDPs

The figure shows a two-state MDP; the states are marked by "L" and "R" (standing for "left" and "right"). In state "L" there are two actions: Staying in the left state ("LEFT") and moving to the right ("RIGHT"). In state "R" there is only one action available: Staying in that state ("STAY"). Transitions are deterministic. The rewards under the two reward functions $r_0$ and $r_1$ are shown next to each transition. The two reward functions agree on all transitions except the one that results in moving from left to right. In particular, under both reward functions, staying in the left state results in no reward and staying in the right state results in a reward of $+1$, while the transition from left to right incurs a reward of $+1$ under $r_0$ and it incurs the negative reward of $-1$ under $r_1$, hence, resulting in a conflict between the two rewards, as typical for constrained problems.
Forgetting for a moment about $r_1$, under $r_0$, regardless of $0\le \gamma<1$, the unique optimal policy $\pi^*$ chooses to go right in state "L". Let us, for simplicity, assume that the long-term future is discounting away: $\gamma=0$. Further, let the threshold $\theta_1$ be chosen to be zero: $\theta_1 = 0$. Since $V^{\pi^*}(r_1,L)=-1\lt 0$, $\pi^*$ is eliminated by the constraint.
Any stationary policy in this CMDP is characterized by a single number $p\in [0,1]$: The probability of choosing to stay in the left state. By abusing notation, we identify these policies with these probabilities. Similarly, we identify the initial state distributions with the probability $\beta\in [0,1]$ of starting from the left state. Then, for $p\in [0,1]$, it holds that \begin{align*} V^p(r_0,\beta) = \beta(1-p)+1-\beta\,,\\ V^p(r_1,\beta) = \beta(p-1)+1-\beta\,. \end{align*} Thus, for some $\beta\in [0,1]$, the set of feasible stationary policies are those where the value of $p$ satisfies \[ \beta(p-1)+1-\beta \ge 0 \quad \Leftrightarrow \quad p \ge \frac{2\beta-1}{\beta}\,. \] When $\beta \gt 0.5$, the valid range for $p$ is $[(2\beta-1)/\beta,1]\subsetneq [0,1]$. Thus, there is only one optimal stationary policy, which is given by the parameter $p^* = (2\beta-1)/\beta$ (this is because $V^p(r_0,\beta)$ is decreasing in $p$). Since this depends on $\beta$, there is no uniformly optimal policy among stationary policies. We also see that policies that randomize can achieve better value than policies that do not randomize. The example also shows that by restricting policies to be deterministic one can lose as much as the difference between the respective values of the best and the worst policies.

Part 2: Sufficiency of stationary policies

First, let us clarify the meaning of sufficiency: we call a set of policies sufficient if for any CMDP, the set contains an optimal policy. The sufficiency of stationary policies will follow from a close correspondence between value functions and occupancy measures and that stationary policies "span" the set of all occupancy measures. But what are these occupancy measures? Given an arbitrary policy $\pi$ and initial state distribution $\mu$, for state-action pair $(s,a)\in Z$, let \[ \nu_\mu^\pi(s,a) = (1-\gamma) \sum_{t=0}^\infty \gamma^t \mathbb{P}_\mu^\pi( S_t = s,A_t = a)\,. \tag{*} \] We call $\nu_\mu^\pi$ the discounted normalized occupancy measure underlying $\pi$ and $\mu$: It is the mesure that counts the expected discounted number of "visits" of the individual admissible state-action pairs. The normalizer $(1-\gamma)$ makes $\nu_\mu^\pi$ a probability measure: $\sum_{s,a} \nu_\mu^\pi(s,a)=1$. To save space, in what follows, we will shorten the name of $\nu_\mu^\pi$ and will just say that it is an occupancy measure.
The promised relationship between values and occupancy measures is as follows: For any reward function $r$, policy $\pi$, initial state distribution $\mu$, \begin{align*} (1-\gamma) V^\pi(r,\mu) & = (1-\gamma) \sum_{t=0}^\infty \gamma^t \mathbb{E}_\mu^\pi[r(S_t,A_t)] \\ & = (1-\gamma) \sum_{t=0}^\infty \gamma^t \sum_{s,a} \mathbb{P}_\mu^\pi(S_t=s,A_t=a) r(s,a) \\ & = \langle \nu_\mu^\pi, r \rangle\,. \end{align*} Here, the last equation uses the inner product notation ($\langle f,g \rangle = \sum_z f(z) g(z)$) to emphasize that this expression is linear in both the occupancy measure and the reward. Based on the last expression, it is clear that to prove the sufficiency of stationary policies it suffices to prove that the occupancy measure of any policy $\pi$ can be replicated using the occupancy of some stationary policy.
To prove the latter, pick an arbitrary policy $\pi$ and a distribution $\mu$. We want to show that there is a stationary policy $\pi'$ such that \begin{align*} \nu_\mu^\pi(s,a) = \nu_\mu^{\pi'}(s,a), \qquad \forall (s,a)\in Z\,. \tag{ST} \end{align*} By abusing notation, define $\nu_\mu^\pi(s)=\sum_a \nu_\mu^\pi(s,a)$. Note that this is the discounted state visitation probability distribution induced by $\mu$ and $\pi$. (In fact, $\nu_\mu^\pi(s) = (1-\gamma) V^\pi_\mu(e_s,s)$ where $e_s$ is the binary-valued reward function that rewards taking any action in state $s$.) Now, define the stationary policy $\pi'$ as follows: For states $s$ such that $\nu_\mu^\pi(s)\ne 0$, we let \begin{align*} \pi'(a|s) = \frac{\nu_\mu^\pi(s,a)}{\nu_\mu^\pi(s)}\,, \end{align*} while for states with $\nu_\mu^\pi(s)=0$, we pick any probability distribution.
We will now show that $\nu_\mu^\pi(s) = \nu_\mu^{\pi'}(s)$ for any state $s$, which immediately implies $(\text{ST})$. Recall that $P_{s,a}(s')$ is the probability of transitioning to state $s'$ when action $a$ is taken in state $s$. Define $P^{\pi'}$ by $P_s^{\pi'}(s') = \sum_{a} \pi'(a|s) P_{s,a}(s')$. This is thus the probability of transitioning to state $s'$ when policy $\pi'$ is followed from state $s$ for one time step. Now, \begin{align*} \nu_\mu^\pi(s) & = (1-\gamma) \sum_{t\ge 0} \gamma^t \mathbb{P}_\mu^\pi(S_t=s) \\ & = (1-\gamma) \mu(s) + (1-\gamma) \gamma \sum_{t\ge 0} \gamma^t \mathbb{P}_\mu^\pi(S_{t+1}=s) \\ %& = (1-\gamma) \mu(s) % + \gamma (1-\gamma) \sum_{t\ge 0} \gamma^t % \sum_{s',a} \mathbb{P}_\mu^\pi(S_t=s',A_t=a) P_{s',a}(s) \\ & = (1-\gamma) \mu(s) + \gamma \sum_{s',a} \left\{ (1-\gamma) \sum_{t\ge 0} \gamma^t \mathbb{P}_\mu^\pi(S_t=s',A_t=a) \right\} P_{s',a}(s) \\ & = (1-\gamma) \mu(s) + \gamma \sum_{s',a} \nu_\mu^\pi(s',a) P_{s',a}(s) \\ & = (1-\gamma) \mu(s) + \gamma \sum_{s'} \nu_\mu^\pi(s') P^{\pi'}_{s'}(s)\,. \tag{**} \end{align*} Fixing a canonical ordering over the states, we can view $(\nu_\mu^\pi(s))_{s}$ as a row-vector and $(P^{\pi'}_{s'}(s))_{s',s}$ as a matrix. Then, the above identity can be written in the short form \[ \nu_\mu^\pi = (1-\gamma) \mu + \gamma \nu_\mu^\pi P^{\pi'}\,. \] Reordering, and solving for $\nu_\mu^\pi$, we get, \[ \nu_\mu^\pi = (1-\gamma) \mu (I - \gamma P^{\pi'})^{-1} = \nu_\mu^{\pi'}, \] where the second equality follows from $(**)$ and noting that by $(*)$, $\nu^{\pi'}(s,a)= \pi'(a|s) \nu^{\pi'}(s)$ (that the inverse of $I-\gamma P^{\pi'}$ exists follows from that for arbitrary $z$, $x \mapsto z + \gamma x P^{\pi'}$ defines a max-norm contraction). Together with our previous argument, this finishes the proof that stationary policies are sufficient in CMDPs.

Part 3: Feasibility can be hard

This claim is based on a connection between Hamiltonian cycles and CMDPs. For brevity, call deterministic state-feedback policies "deterministic". We have the following theorem:

Theorem:
Checking feasibility in CMDPs among deterministic policies is NP-hard.

But why would one care about deterministic policies? We already know that the best deterministic policies may be arbitrarily worse than the best stationary policy. Deterministic policies are still not completely out because in many applications randomization is just too hard to sell: If every time a customer comes to a website, a dice decides what they see, questions of accountability, reproducibility and explainability will inevitable arise. The same holds for many other engineering applications, such as autonomous driving. While reproducibility could be mitigated by using carefully controlled pseudorandom numbers, accountability and explainability remain a problem. Assuming that this is a convincing argument, how does it happen that feasibility checking among deterministic policies is NP-hard?
The construction to prove this claim is based on a reduction to the Hamiltonian cycles problem. Recall that the Hamiltonian cycle problem is to check whether there exists a cycle in a (say) directed graph that visits every vertex exactly once. Also recall that this is one of the classic NP-complete problems. The reduction is based on the following idea: Given a directed graph, we construct a Markov transition structure whose state space is the set of vertices, the set of admissible actions at a given state $s$ are the outgoing edges at the vertex $s$, with the transitions deterministically moving along the corresponding edge. In this graph, every Hamiltonian cycle $H$ gives rise to a deterministic policy that at a state $s$ picks the edge that is included in the cycle $H$.
Now, fix any state (vertex) $s_0$ and define the reward $r$ so that every transition out of state $s_0$ gives a unit reward and there are no other rewards: \[ r(s',a) = \begin{cases} 1, & \text{ if } s'=s_0\,;\\ 0, & \text{ otherwise}\,. \end{cases} \] If there are $N$ states (vertices), there is a Hamiltonian cycle $H$ and $\pi$ is the deterministic policy corresponding to this cycle, then \[ V^\pi(r,s_0) = 1 + \gamma^N + \gamma^{2N} + \dots = \frac{1}{1-\gamma^N}\,. \tag{H} \] The proposed problem is then as follows: \begin{align*} \textit{Check whether there exists a deterministic policy }\pi \textit{ such that (H) holds}. \tag{CC} \end{align*} For the sake of specificity, from now on we let $\gamma = 1/2$.
To see why (CC) is sufficient for the reduction, suppose that $\pi$ is a deterministic policy that satisfies (H). Since $\pi$ is deterministic, when started from $s_0$, it generates a deterministic sequence of states $s_0,s_1,\dots$. Either $s_0$ occurs a single time in this sequence ("$s_0$ is transient"), or it appears infinitely many times. If it appears a single time, $V^\pi(r,s_0) = 1$, which contradicts (H). Thus, $s_0$ must appear infinitely many times. Let $m$ be the smallest positive integer such that $s_m=s_0$. Clearly then $V^\pi(r,s_0) = 1/(1-\gamma^m)$. Hence, if (H) holds then $m=n$. It is also clear that $(s_1,\dots,s_m)$ does not have duplicates because then $s_0$ would need to be transient, which we have already ruled out. Hence, all states are visited exactly once in the sequence $s_0,s_1,\dots,s_{m-1}=s_{n-1}$, which means that this sequence defines a Hamiltonian cycle.
To finish, note that $V^\pi(r,s_0) = 1/(1-\gamma^n)$ is equivalent to $V^\pi(r,s_0)\ge 1/(1-\gamma^n)$ and $V^{\pi}(-r,s_0) \ge -1/(1-\gamma^n)$, thus showing that checking feasibility among deterministic policies with $r_1 = r$, $r_2 = -r$ and respective thresholds $1/(1-\gamma^n)$, $-1/(1-\gamma^n)$ is NP-hard. Since with $\gamma=1/2$, these thresholds are also of polynomial length in the size $n$ of the input, the proof is finished.
As to whether hardness applies to feasibility testing among stationary policies, first note that the set of occupancy measures can be described as a subset of $\mathbb{R}^Z$ that satisfies $|S|+1$ linear constraints. Indeed, our previous argument shows that for arbitrary $\pi$ and $\mu$, \begin{align*} \sum_a \nu_\mu^\pi(s,a) = (1-\gamma)\mu(s) + \gamma \sum_{s',a'} \nu_\mu^\pi(s',a') P_{s',a'}(s)\,. \end{align*} It follows that the set $X_\gamma$ of such occupancy measures satisfies \begin{align*} X_\gamma & \subset \{ x \in [0,\infty)^Z \,:\, \sum_a x(s,a)=(1-\gamma)\mu(s) + \gamma \sum_{s',a'} x(s',a') P_{s',a'}(s), \forall s\in S \}\,. \end{align*} In fact, reusing the argument used in Part (2), it is not hard to show that this display holds with equality. Now, the feasibility problem is to decide whether the set \[ F = \{ x\in X_\gamma \,:\, \langle x,r_i \rangle \ge (1-\gamma) \theta_i, i=1,\dots,n \} \] is empty. Clearly, this is also a set that is described by $|S|+1+n$ linear constraints. Hence, checking whether this set is empty is a linear feasibility problem. Linear feasibility problems can be solved in polynomial time -- assuming that the feasibility problem is well-conditioned. But is there a guarantee that the above feasibility problem is well-conditioned? When $n=1$, the answer is yes: We can just consider the single-objective MDP and solve it in polynomial time. If the objective is above $\theta_1$, we return with "Yes, the problem is feasible", otherwise we return with "No". The situation is different when feasibility under a number of different discount factors is considered:
Theorem (Feinberg, 2000):
Checking feasibility for CMDPs with $m$ states and $2m-1$ constraints with distinct discount factors is NP-hard.

Part 4: Solving CMDPs -- the scalarization fallacy

From the previous parts, it follows that one can solve CMDPs by solving the linear program \begin{align*} \min_x & -\langle x, r_0 \rangle \\ & \text{ s.t. } \qquad x\in X_\gamma\,, \\ & \qquad \qquad (1-\gamma)\theta_i - \langle x,r_i \rangle \le 0, \quad i=1,\dots,n\,. \end{align*} Indeed, a solution to this problem gives the occupancy measure $x^*$ underlying some stationary policy which can be read out from $x^*$ by a simple marginalization (see Part (2)). It remains to be seen, how to solve the above linear program.
One possibility is to fold the extra value constraints into the optimization objective. For this, introduce the Lagrangian function \[ L(x,\lambda) = - \langle x, r_0 \rangle + \sum_i \lambda_i ((1-\gamma) \theta_i - \langle x,r_i \rangle)\,, \] where $\lambda = (\lambda_i)_i$ are nonnegative reals, the so-called Lagrangian multipliers. Clearly, if $x$ satisfies the value constraints in $(\text{CMDP}_\mu)$ then $\sup_{\lambda\ge 0} L(x,\lambda) = - \langle x, r_0 \rangle$, while if it does not satisfy them then $\sup_{\lambda\ge 0} L(x,\lambda)=\infty$. Hence, the CMDP problem is equivalent to solving \begin{align*} \min_{x\in X_\gamma} \max_{\lambda \ge 0} L(x,\lambda) \quad \tag{MM} \end{align*} in $x$. For a fixed $x$, let $\lambda_*(x)$ denote a maximizer of $L(x,\cdot)$ over $[0,\infty]^n$ (explicitly allowing for $\lambda_i(x)=\infty$). If $x^*$ is the solution to the above problem, by definition $L(x^*,\lambda_*(x^*))\le L(x,\lambda_*(x^*))$. The tempting conclusion then is that if $\lambda^* = \lambda_*(x^*)$ was given then it will be sufficient to solve the minimization problem \[ \min_{x\in X_\gamma} L(x,\lambda^*)\,. \] This would be a great simplification of the CMDP problem, as this optimization problem corresponds to a single-objective MDP problem with reward function \[ r = r_0 + \sum_{i=1}^n \lambda_i^* r_i\,. \] One view of this is that the optimal Lagrangian multiplier choose the right conversion factors to "merge" the multiple rewards into a single reward.
Unfortunately, this argument is seriously flawed. Indeed, if it was sufficient to find a minimizer of $x \mapsto L(x,\lambda^*)$ then, given that we are in an MDP, this minimizer could always be chosen to be induced by some deterministic policy. However, such a policy cannot be optimal in the original CMDP if in there all deterministic policies are suboptimal! In Part (1), we saw an example of such an CMDP. Simple scalarization using optimal Lagrangian multipliers does not work.
The root of the problem is that while the solution $x^*$ of the CMDP problem is indeed one of the minimizers, the function $x \mapsto L(x,\lambda^*)$ will in general have many other minimizers (such as the ones that underlie deterministic policies!), and there is no guarantee that these minimizers would also solve the original problem. In fact, when $x^*$ corresponds to a non-deterministic policy, the set $\{ x\in X_\gamma\,:\, L(x,\lambda^*)=\min_{y\in X_\gamma} L(y,\lambda^*) \}$ will have infinitely many elements (in fact, we can always find a nontrivial line segment in this set).
To build up our intuition about what is happening, let us take a step back and just think a little bit about min-max problems (after all, with our Lagrangian, the CMDP problem became a min-max problem). For the sake of specificity, consider the game of Rock-Paper-Scissors.

The mechanics of the Rock-Paper-Scissors game.

This is a game between two players (say, Alice and Bob), where each player needs to choose, simultaneously, one of rock, paper or scissors. Paper beats rock, scissor beats paper, and rock beats scissor. The player who loses pays one "Forint" to the other player. In this situation, the equilibrium strategy, for both players, is to choose an action uniformly at random. Now, if Alice knows for sure that Bob plays uniformly at random (suggestively, $\lambda_1^* = \lambda_2^* = \lambda_3^* = 1/3$), then Alice will achieve the same expected winnings (zero) using any strategy! But if Alice chose anything but the uniform strategy, Bob could reoptimize their strategy (deviating from $\lambda^*$!) and could end up with a positive expected return. So Alice should better stick to the uniform strategy; the only strategy that leaves no room for Bob to exploit Alice.
Similarly, in CMDPs it is not sufficient to find just any minimizer of $L(x,\lambda^*)$, but one needs to find a minimizer such that the value would not move even if $\lambda^*$ was reoptimized! As noted above, in general, $\lambda^*$ will in fact be such that there will be infinitely many minimizers of $L(\cdot,\lambda^*)$.

Part 5: Solving CMDPs -- reduction to MDPs via gradient updates

The above argument paints a pretty dark picture on the prospects of solving a CMDP by reducing it to solving some MDP. As it turns out, the situation is actually not so bad. As we will now show, a reduction is possible, as long as we stick to "gradient based" MDP solvers. How can this be? The magic comes from what is known as Danskin's theorem.
To introduce this theorem, we need some extra notation. In what follows we use $L_x'$ ($L_\lambda'$) to denote the partial derivative of $L=L(x,\lambda)$ with respect to $x$ (respectively, $\lambda$). The interested reader can explore some of the generalizations of this theorem on Wikipedia. For the astute reader, we note in advance, that its conditions will not be satisfied for our $L$ as defined above; but the theorem is still useful to motivate the algorithm that we want to introduce.

Recall that $\lambda_*(x)$ is the maximizer of $\lambda \mapsto L(x,\lambda)$. Theorem (John M. Danskin):
Let $\Phi(x) = L(x,\lambda_*(x))$. If $\lambda_*(x)$ is in $(0,\infty)^n$ and is differentiable, and both $L_x'$ and $L_\lambda'$ exist then $\Phi$ is also differentiable and $\nabla \Phi(x) = L'_x(x,\lambda_*(x))$.

Proof: Since $\lambda_*(x)\in (0,\infty)^n$, it is an unconstrained maximizer of $L(x,\cdot)$ and thus by the first-order optimality condition for extrema of differentiable functions, $L_\lambda'(x,\lambda_*(x))=0$. Hence, $\nabla \Phi(x) = L'_x(x,\lambda_*(x)) + L'_\lambda(x,\lambda_*(x)) \lambda_*'(x) = L'_x(x,\lambda_*(x))$.
Q.E.D.

By this theorem, the ordinary differential equation \[ \dot{x} = - L_x'(x,\lambda_*(x)) \tag{XU} \] describes a dynamics that converges to a minimizer of $\Phi(x)$ as long as along the trajectories we can ensure that $\lambda_*(x)$ is inside $(0,\infty)^n$ (to stay inside $X_\gamma$, one also needs to add projections, but for now let us just sweep this under the rug). This is all good, but how would we know $\lambda_*(x)$? For a given fixed value of $x$, one can of course use a gradient flow \[ \dot{\lambda} = L_\lambda'(x,\lambda)\, \tag{LU} \] which, under mild regularity conditions, guarantees that $\lim_{t\to\infty} \lambda(t)=\lambda_*(x)$ regardless of $x$. The idea is then to put the two dynamics together, making the update of $\lambda$ work on "faster" timescale than the update of $x$: \begin{align*} \dot{x} & = - \epsilon L_x'(x,\lambda)\,, \\ \dot{\lambda} & = L_\lambda'(x,\lambda) \end{align*} for $\epsilon$ "infinitesimally small". This must sound crazy! Will this, for example, make the updates infinitely slow? And what happens when $\lambda_*(x)$ is not finite?
Interestingly, a simple Euler discretization, with decreasing step-sizes, one that is often used anyways when only a noisy "estimate" of the gradients $L_x'$ and $L_\lambda'$ are available, addresses all the concerns! We just have to think about the update of $\lambda$ being indefinitely faster than that of $x$, rather than thinking of $x$ being updated indefinitely slower than $\lambda$. This results in the updates \begin{align*} x_{k+1}-x_k &= - a_k (L_x'(x_k,\lambda_k)+W_k) \,,\\ \lambda_{k+1} - \lambda_k &= b_k (L_\lambda'(x_k,\lambda_k)+U_k)\,, \end{align*} where the positive stepsize sequences $(a_k)_k$, $(b_k)_k$ are such that $a_k \ll b_k$ and both converge to zero not too fast and not too slow, and where $W_k$ and $U_k$ are zero-mean noise sequences. This is what is known as a two-timescale stochastic approximation update. In particular, the timescale separation happens together with "convergence in the limit" if the stepsizes are chosen to satisfy \begin{align*} a_k = o(b_k) \quad k \to \infty\,,\\ \sum_k a_k = \sum_k b_k = \infty, \quad \sum_k a_k^2<\infty, \sum_k b_k^2 <\infty\,. \end{align*} This may make some worry about that $a_k$ will be too small and thus $x_k$ may converge slowly. The trick is to make $b_k$ larger rather than making $a_k$ smaller. For example, we can use $a_k = c/k$ if it is paired with $b_k = c/k^{0.5+\gamma}$ for $0<\gamma<0.5$. This makes the $\lambda_k$ iterates "more noisy", but the $x_k$ updates can smooth this noise out and since now $\lambda_k$ will also converge a little faster than usual, the $x_k$ can converge at the usual statistical rate of $1/\sqrt{k}$. The details are not trivial and a careful analysis is necessary to show that these ideas work and in fact I have not seen a complete analysis of this approach. To convince the reader that these ideas will actually work, I include a figure that shows how the iterates evolve in time in a simple, noiseless case. In fact, I include two figures: One when $x$ is updated on the slow timescale (the approach proposed here) but also the one when $\lambda$ is updated on the slow timescale. Surprisingly, this second approach is the one that makes an appearance in pretty much all the CMDP learning papers that use a two-timescale approach. As suggested by our argument of Part 3, and confirmed by the simulation results shown, this approach just cannot work: The $x$ variable will keep oscillating between its extreme values (representing deterministic policies).

Simulation results for the case when there is a single state, two actions.
If $x$ is the probability of taking the first action, the payoff under $r_0$ is $-x$ (e.g., $r_0(1)=-1$, $r_0(2)=0$). The payoff under $r_1$ is $x$ and the respective threshold is $0.5$. Altogether, this results in the constrained optimization problem "$\min_{x\in [0,1]} x$ subject to $x\ge 1/2$". The Lagrangian is $L(x,\lambda) = x + \lambda_*(1/2-x)$. The left-hand subfigure shows $x$ and $\lambda$ as a function of the number of updates when the stepsizes are $1/k$ ($1/\sqrt{k}$) for updating $x$ (respectively, $\lambda$). The stepsizes are reversed for the right-hand side figure. As it is obvious from the figures, only the variable that is updated on the slow timescale converges to its optimal value.

As the figures on the left-hand side attest, the algorithm works, though given how simple the example considered is, the oscillations and the slow convergence are rather worrysome. Readers familiar with the optimization literature will recognize this as a problem that has been long recognized in that literature, where various remedies have also been proposed. However, rather than exploring these here (for the curious reader, we give some pointers to the literature at the end of the post), let us turn to addressing the next major limitation of the approach presented so far, which is that in its current form the algorithm needs to maintain an occupancy measure over the states, which is usually quite problematic.
A trivial way to avoid representing occupancy measures is if search directly in a smoothly parameterized policy class. Let $\pi_\rho = \pi_\rho(a|s)$ be such a policy parameterization, where $\rho$ lives in some finite dimensional vector space. Then, we can write the Lagrangian \begin{align*} L(\rho,\lambda) & = - (1-\gamma) V^{\pi_\rho}(r_0,\mu) - \sum_i \lambda_i ((1-\gamma) V^{\pi_\rho}( r_i, \mu )-(1-\gamma)\theta_i) \\ & = - (1-\gamma) V^{\pi_\rho}\left(r_0+\sum_i \lambda_i r_i, \mu \right)+ (1-\gamma) \sum_i \lambda_i \theta_i\,, \end{align*} and our previous reasoning goes through without any changes. The gradient of $L$ with respect to $\rho$ can then be obtained using any of the "policy search" approaches available in the literature: As long as we can argue that these updates follow the gradient $L_\rho'(\rho,\lambda)$, we expect that the algorithm will find a locally optimal policy, which satisfies the value constraints.
Finally, to see what it takes to update the Lagrangian multipliers $(\lambda_i)_i$, we calculate \[ \frac{\partial}{\partial \lambda_i} L(\rho,\lambda) = (1-\gamma)\theta_i-(1-\gamma) V^{\pi_\rho}( r_i, \mu )\,. \] That is, $\lambda_i$ will be increased if the $i$th constraint is violated, while it will be decreased otherwise. Reassuringly, this makes intuitive sense. The update also tells us that the full algorithm will need to estimate the value of the policy to be updated under all the reward functions $r_i$.
Stepping back from the calculations a little, what saved the day here? Why could we reuse policy search algorithms to search for the solution of a constrained problem whereas in Part (4) we suggested that finding the optimal Lagrangian $\lambda^*$ and then solving the resulting minimization problem will not work? The magic is due to the use of gradient descent but the unsung hero is Danskin's theorem, which suggests that updating the parameters by following the partial derivative is a sensible choice.

Conclusions

As we saw, CMDPs are genuinely different from MDPs. Thus, if one faces a problem with value constraints, chances are good that the problem cannot be solved using techniques developed for MDPs alone. In particular, combining the rewards alone will be insufficient. As such, the reward hypothesis, appears to be limiting unless one in fact starts with a problem with no value constraints. However, we also saw that techniques developed for the single-objective case have the potential to be used in a more complicated setup where an outer loop optimizes the policy using a fixed linear combination of rewards, while an inner loop reoptimizes (on a faster timescale) the coefficients that combine the rewards. This latter finding implies that focusing on developing good methods for the single-objective case may have benefits even beyond the single-objective setting. Is this coincidence? I believe not. All in all, the story of the present post gives much needed ammunition to back a variant of the KISS principle applied to research:

Priority should be given to developing a thorough understanding of simpler problems over working on more complex ones.

Of course, as any general arvice, this rule works also best when it is applied with moderation. In particular, the advice does not mean that we should never work on new problems, or that we should necessarily shy away from "big, hairy problems". Good research always dances on the boundary of what is possible.
As to specific research questions that one can ask, it appears that while there has been some work on developing algorithms for CMDPs, much remains to be done. Example of open questions include but are not limited to the following:

Is there a strongly polynomial algorithm for finite CMDPs with a fixed discount factor?
Prove that the two-timescale algorithm as proposed above indeed "works" (see notes on what results exist in this direction below).
Can the two-timescale algorithm be made work for multiple discount factors? What changes?
Get sample complexity results for the case when a simulator is available (say, with reset) and linear function approximation is used to estimate value functions.
Design efficient exploration methods for CMDPs, with or without function approximation (the notes again list existing results).
Design efficient algorithms for the batch version of the CMDP problem.

Notes (bibliographical and other)

Uniform optimality can be defined either with respect to initial states (a policy is uniformly optimal if it is optimal no matter the initial state), or with respect to initial state distributions. The latter is the definition I used in this post. For brevity, call the former S-uniform optimality, the latter D-uniform optimality. Clearly, if a policy is D-uniformly optimal then it is also S-uniformly optimal (starting from a fixed state $s$ corresponds to choosing a special initial state distribution which assigns probability one to state $s$). The reverse does not hold, as attested by the two-state CMDP described in Part (1): In this MDP, there is an S-uniformly optimal stationary policy ($\pi^*$), while, there is no D-uniformly optimal stationary policy. It is not hard to construct MDPs where the set of S-uniformly optimal stationary policies is also empty.
As noted earlier, the basic reference to constrained MDPs is the book of Eitman Altman. The first part of this post is based partly on Chapters 2 and 3 of this book.
One of my favorite pet topics is hunting for examples where randomness (probabilities) provably "helps". As an example of a problem where I don't think we have a clear understanding of whether stochasticity helps is when we build world models for the purpose of control (again, there should perhaps be a post about why we like to think of the world as an MDP). The canonical example of where randomization provably helps is simultaneous action games (like rock-paper-scissors). In this post we saw that CMDPs provide another simple example: The appeal of this is that this example does not even need two agents. Of course, given what we now know about CMDPs and games, the two examples are not quite independent.
So in Part (1) we established that randomization is essential. But "how much" randomization does one need? It turns out that one can show that in any finite CMDP there always exists an optimal stationary policy $\pi^*$ such that $R(\pi^*):=\sum_s |\text{supp}(\pi^*(\cdot|s))|$ is at most $|S|+n$. Here, $|\text{supp}(\pi^*(\cdot|s))|$ is the number of nonzero values that $\pi(\cdot|s)$ takes. Note that if $\pi^*$ was deterministic, then $R(\pi^*)$ would be $|S|$. It follows that in a CMDP with $n$ constraints, one can always find a stationary policy that randomizes only at $n$ or fewer states. This result is stated in Chapter 3 of Altman's book.
The hardness results are taken from a paper of Eugene Feinberg. This paper also mentions some previous hardness results. For example, it mentions a result of Filar and Krass from 1994 that states that computing the best deterministic policy for a constrained unichain MDP with average rewards per unit time is also NP-hard. It also considers the so-called weighted discounted criteria where the goal is to maximize $\sum_i w_i V^\pi(r_i,\gamma_i,\mu)$ where $V^\pi(r_i,\gamma_i,\mu)$ stands for the total expected reward under $r_i$ and $\pi$ with start distribution $\mu$ and discount factor $\gamma_i$ (the original reference to these problems is Feinberg and Shwartz from 1994 and 95). These problems are quite different to CMDPs. For example, stationary policies are in general not sufficient for these problems. Based on the theorem cited in this post, Feinberg in his paper shows the hardness of finding a feasible stationary policy in such weighted discounted MDPs.
Unsurprisingly, multicriteria Markovian Decision Processes are a common topic in the AI/CS literature. A relatively recent survey by Roijers et al., just like this post, is also aimed at explaining the relationship between multi- and single-objective problems. This paper also starts from the reward hypothesis and postulates that this hypothesis implies that one should be able to convert any multiobjective MDP into a single-objective one in a two-step process. In the first step, the multiple value functions are combined by the help of a "scalarization function" a single value function, while in the second step one should find an MDP such that for all policies, the value function of the policy is equal to the "scalarized" value function of that policy (this means the new MDP must have the same policies, hence, same state and action space). This appears to me a rather specific interpretation of the reward hypothesis. The reader who got to this point in the post will not be surprised to learn that the Roijers et al.' main point is that this scalarization will hardly ever work.
The constructive part of this paper is concerned with the construction of coverage sets; a generalization of Pareto optimality. The coverage set construction problem is as follows: One is given a family $\mathcal{F}$ of so-called scalarization functions of the form $f: \mathbb{R}^n \to \mathbb{R}$. Given $f\in \mathcal{F}$, solving the $f$-scalarized problem means finding a policy $\pi$ that maximizes $f(V^\pi(r_1,\mu),\dots,V^\pi(r_n,\mu))$. Call the resulting policy $f$-optimal. A coverage set is a subset $\Pi$ of all policies, such that no matter what scalarizer $f\in \mathcal{F}$ is chosen, one can always find an $f$-optimal policy in $\Pi$. The goal is to find a minimal coverage set. A relaxed version of the problem is to find $\Pi$ so that for any $f\in \mathcal{F}$, an $f$-optimal policy can be computed from $\Pi$ efficiently. For example, if $\mathcal{F}$ contains linear functions with "weights" from a nonempty set $W \subset \mathbb{R}^n$, it suffices to compute optimal policies for the extrema of $W$: A "convex" interpolation (randomizing over elements in $\Pi$) is then sufficient to reconstruct any optimal policy. While this is interesting, it is unclear what user would be happy with such randomizing solutions.
Constraints that rule out visiting states can be incorporated in the CMDP framework by introducing a reward that takes on $-1$ at the forbidden states and $0$ otherwise, setting both the discount factor associated with this reward, and the respective threshold to zero. This example underlines the importance of studying the case when there are multiple distinct discount factors.
Constraints can also be enforced by adding a safety filter: In this approach, actions of the policy are not sent to the environment, but are sent to the "safety filter" which modifies them as needed. This allows one to concentrate on one problem at a time: The environment together with the "box" that implements the safety filter becomes the new environment where one only needs to be concerned with maximizing value. The burden is then on designing the "safety filter". This is a classical approach. In a recent talk, which I can only warmly recommend, Melanie Zeilinger explored the problem of using learning to design better safety filters. Another recent work on designing safety filters by my colleagues at DeepMind is based on a local linearization approach.
A pioneer of the Lagrangian approach to learning to act in CMDPs is Vivek S. Borkar. Incidentally, and not accidantally, Borkar is also one of the main characters behind two-timescale stochastic approximation. (His short book on stochastic approximation is a true gem!) In his paper, Borkar considers finite CMDPs. He proposes to update the Lagrangian multipliers on the slow timescale, which, as we saw in Part 4, cannot guarantee convergence of the policy to the optimal solution (although, after Theorem 4.1, Borkar suggests otherwise). A similar approach is followed in the concurrent paper of Vikram Krishnamurthy, Katerina Martin and Felisa Vázquez-Abad, who also explore an algorithm similar to the use of augmented Lagrangians, which may fix these issues. In his paper, Borkar uses the envelope theorem (often used in economics) to argue for the convergence of the Lagrangian multipliers to their optimal value. In simple cases, this theorem essentially follows from Danskin's theorem applied to the Lagrangian of the constrained optimization problem. The stronger versions of the envelope theorem are useful in that they do not need the differentiability of the value function $\Phi$ which often fails to hold. A more recent paper by Shalabh Bhatnagar and K. Lakshmanan generalizes the approach of Borkar to the case when function approximators are used. The guarantees get correspondingly weaker. The work of Chen Tessler, Daniel Makowitz and Shie Mannor allows for general path dependent constraints. Besides this, the main contribution in this paper is allowing nonlinearly parameterized value functions and policies. Just like the previously mentioned works, curiously, the slow timescale is still used for updating the Lagrangian multipliers. One reason that I can see for this choice is that this allows both a simpler analysis and a straightforward reuse of all existing algorithms developed for MDPs (on the fast timescale), while if the MDP algorithm needs to work on the slow timescale then one is forced to use a "gradient-like" algorithm and a more complicated analysis needs to be developed.
Other early work on multi-objective RL includes that of Shie Mannor and Nahum Shimkin who consider the more general problem of learning policies (even in the presence of adversaries) that steer the average reward of a vector-valued reward function into a set of the user's choice. The algorithms here are based on Blackwell's approachability theorem which says that multiple objectives can be satisfied simultaneously in an adversarial environment if and only if they can be satisfied in a stochastic one with a known law. Mannor and Shimkin use this result then to create algorithms for multi-criteria MDPs for "steering" the objectives into some set that the user should specify.
The paper of my colleagues at DeepMind, which was also mentioned at the beginning, gives an example when the Lagrangian approach is combined with RL algorithms that use neural networks. This paper replaces the policy update of Tessler et al. with a mirror-descent like update, which underlies the RL algorithm known as MPO. The authors of this paper noted that the algorithm can be highly unstable (see the figures above). As a remedy, they propose heuristic techniques such as normalizing the Lagrangian multipliers and reparameterizing them to keep them nonnegative. The authors also suggest to impose constraints with respect to many initial states (or distributions) simultaneously. To facilitate this, a neural network is used to predict the Lagrangians (the input to the network is the state from where the constraint should hold). With these extra constraints, one may need to add memory to the policies to avoid losing too much on performance. They also suggest to consider a parameterized version of the CMDP problem where the value constraint thresholds are included as parameters. To reduce the computational cost of solving many instances and to generalize existing solutions to new values of these thresholds, they feed the thresholds as inputs to the neural networks. A pleasent side-effect of this is that the oscillations are reduced. The approach is illustrated on several robotic tasks, where task success rate is constrained and control cost (in terms of energy) is minimized. An alternative to this approach, is the earlier work of Achiam et al. who present an approach based on conservative policy iteration (and TRPO). Their problem setting is harder: Their goal is to maintain the constraints even during learning. Perhaps another blog should discuss the merits and limits of both MPO and TRPO (and related approaches).
When we use a policy parameterisation, the objective $L$ becomes nonconvex-concave. In this case, the two-timescale approach still works. In a recent paper, Chi Jin and co-workers derived finite time results on the rate of convergence of this method to a local optima who point out that in game-theory terms, the solution of the min-max problem corresponds to a (local) Stackelberg equilibrium, rather than a Nash equilibrium. It was mentioned before that instability of solving the minmax problem that arises from the Lagrangian approach is widely recognized in the optimization community. To stabilize the min-max solvers, a standard recommendation is to add a small penalty that penalizes constraint violations, resulting in the so-called augmented Lagrangian method. Averaging is another tool that is often used (on the top of, for example, a gradient algorithm). As it was mentioned earlier, the early paper of Krishnamurty et al. already explored this idea. A nice little paper by my former PhD student Yaoliang Yu and his student Guojun Zhang explores the (mis)behavior of gradient methods in bilinear problems, like the ones we have seen here. In another nice paper, the nonlinear min-max problem is looked at by the same team expanded with Pascal Poupart. The literature on min-max problems or solving constrained problems is huge -- so I need to stop here.
Provably efficient learning to solve a CMDP recently became a hot topic. Liyuan Zheng and Lillian J. Ratliff considered known transitions and rewards and the average reward criterion, and a known initial safe exploration policy. The UCRL2 algorithm is modified and it is assumed that the underlying controlled Markov chain has a finite diameter. The authors prove $O(T^{3/4})$ high probability regret up to log factors for the rewards from $r_0$ while they also show that the policies chosen satisfy the constraints with high probability at all time steps. (This seems to generalize "conservative bandits", which we explored with my group a while back.) The paper of Efroni, Mannor and Pirotta gives a nice overview of related work (including work on budgeted bandits, safe exploration and more). They consider the significantly easier finite horizon setting, but the problem is made more challenging because the transition structure is no longer assumed to be known. They give algorithms (also based on optimism) for which they show that both the regret with respect to the reward to be maximized and the constraints are controlled sublinearly (essentially, $\sqrt{T}$ is achieved). The concurrent work of Qui et al. also considers this case, but with adversarially chosen rewards (both for the main objective and the constraints). Ding et al. again considered the finite horizon setting, the so-called linear MDP case where linear function approximation is used. They gave similar results to the previous paper.

References

Eitan Altman: Constrained Markov Decision Processes, CRC Press, 2000. Free pdf
Eugene Feinberg: Constrained Discounted Markov Decision Process and Hamiltonian Cycles, MOR, 25:1, 2000. Free pdf from the author's homepage.
Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley: A Survey of Multi-Objective Sequential Decision-Making, JAIR, 2013.
Borkar, V. S. An Actor-Critic Algorithm for Constrained Markov Decision Processes, Systems and Control Letters 54 (3): 207--213, 2005.
Krishnamurthy, V., K. Martin, and F. Vasquez Abad. Implementation of Gradient Estimation to a Constrained Markov Decision Problem, In 42nd IEEE International Conference on Decision and Control, 2003. Repost to arXiv from 2008, with two authors.
Mannor, Shie, and Nahum Shimkin, A Geometric Approach to Multi-Criterion Reinforcement Learning, Journal of Machine Learning Research: JMLR 5: 325-360, 2004.
Bhatnagar, Shalabh, and K. Lakshmanan, An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes. Journal of Optimization Theory and Applications 153 (3): 688–708, 2012.
Lin, Tianyi, Chi Jin, and Michael I. Jordan. On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems, June, 2019, arXiv.
Zheng, Liyuan, and Lillian J. Ratliff. Constrained Upper Confidence Reinforcement Learning, arXiv 2020.
Efroni, Yonathan, Shie Mannor, and Matteo Pirotta. Exploration-Exploitation in Constrained MDPs, arXiv 2020.
Qiu, Shuang, Xiaohan Wei, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. Upper Confidence Primal-Dual Optimization: Stochastically Constrained Markov Decision Processes with Adversarial Losses and Unknown Transitions arXiv 2020.
Ding, Dongsheng, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo R. Jovanović. Provably Efficient Safe Exploration via Primal-Dual Policy Optimization, arXiv 2020.
Bohez, Steven, Abbas Abdolmaleki, Michael Neunert, Jonas Buchli, Nicolas Heess, and Raia Hadsell. Value Constrained Model-Free Continuous Control, February, arXiv 2019.
Tessler, Chen, Daniel J. Mankowitz, and Shie Mannor. Reward Constrained Policy Optimization, ICLR, January, 2019.
Achiam, Joshua, David Held, Aviv Tamar, and Pieter Abbeel. Constrained Policy Optimization, ICML, 2017.
Zhang, Guojun, and Yaoliang Yu. 2019/20: Convergence of Gradient Methods on Bilinear Zero-Sum Games, March, 2020.
Zhang, Guojun, Pascal Poupart, and Yaoliang Yu. 2020. Optimality and Stability in Non-Convex-Non-Concave Min-Max Optimization, February, 2020
Dalal, Gal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. Safe Exploration in Continuous Action Spaces, arXiv, 2018.

Acknowledgements

The author is grateful for Eugene Feinberg for the inspiring talk he gave and a whole evening of entertaining stories about math in Russia, the Soveit Union and also catching an embarrassing mistake in an early version of this writeup. Further typos were caught by Huizhen Yu and Martin Mladenov -- thank you Janey and Martin for your careful reading of the post!

Markov Decisions Processes: Basic concepts

To make things a bit more formal, let $M = (S,A,P)$ be a controlled Markov chain over the state space $S$: $A = (A(s))_{s\in S}$ gives the actions admissible in each state $s\in S$; and for each pair $(s,a)\in Z = \{(s,a)\,:\, s\in S, a\in A(s)\}$, $P_{s,a}$ gives a distribution over $S$. These transition probabilities are collected by $P$: $P = (P_z)_{z\in Z}$. A deterministic policy is a map $\pi: S \to A$ and following it creates a Markov chain $(S_t)_{t=0,1,\dots}$ over $S$ so that $S_{t+1} \sim P_{S_t,\pi(A_t)}$. If we allow random, state-dependent choices, we arrive at what are known as stochastic stationary Markov policies. Since this is a mouthful, let's just call these stationary policies. More general policies can use past states and/or actions, or just rely on a time counter but depend only on the last state (these are called Markov policies). An initial distribution $\mu$ over the state space, a policy $\pi$ and a controlled Markov chain together determine a distribution $\mathbb{P}_{\mu}^\pi$ over the set of state-action sequences in the natural way: $\mathbb{P}_{\mu}^\pi(s_1,a_1,s_2,a_2,\dots) = \mu(s_1) \pi(a_1|s_1) P_{s_1,a_1}(s_2) \pi(a_2|s_1,a_1,s_2) P_{s_2,a_2}(s_3) \dots $.

Numerical Errors, Perturbation Analysis and Machine Learning

2013-09-16T00:13:00.000-07:00

Everyone hates numerical errors. We love to think that computers are machines with infinite precision. When I was a student, I really hated error analysis. It sounded like a subject that is set out to study an annoying side-effect of our imperfect computers, a boring detail that is miles away from anything that anyone would ever consider a nice part of mathematics. I will not try to convince you today that the opposite is true. However, even in error analysis there are some nice ideas and lessons to be learned. This post asks the question whether, if you are doing machine learning, you should care about numerical errors.

This issue should be well understood. However, I don't think that it is as well appreciated as it should be, or that it received the attention it should. In fact, I doubt that the issue is discussed in any of the recent machine learning textbooks beyond the usual caveat "beware the numerical errors" (scary!).

In this blog, I will illustrate the question of numerical errors and stability with the problem of learning a real-valued function on the $[0,1]$ interval. Let $f$ be this unknown function. We will assume that $f$ is square integrable as our objective will be to produce an approximation to $f$ that is close to $f$ in the (unweighted) $2$-norm. For simplicity, let as assume that we have decided to approximate $f$ using polynomials of degree $d-1$. Thus, defining $G = \{ g \,:\, g(x) = \sum_{j=0}^{d-1} \alpha_j x^j, \alpha_0,\ldots,\alpha_{d-1}\in \mathbb{R} \}$, our goal is to find the minimizer

$P(f) = \arg\min_{g\in G} \|f-g\|_2$, (Proj)

where $\|f\|_2^2 = \int_0^1 f^2(x) dx$. Clearly, $G$ is a $d$-dimensional subspace of $L^2([0,1])$. To represent the functions in $G$, let us choose a basis of $G$; call the $j$th basis element $p_j$ ($j=1,\ldots,d$). Thus, $p_j:[0,1] \rightarrow \mathbb{R}$ is some polynomial. One possibility, in the lack of better idea, is to choose $p_j(x) = x^j$; if you want, I use $(p_j)_{j=1}^d$ just for the sake of increasing generality (and to play with you a bit later).

Since $(p_j)_j$ is a basis, any function $g$ of $G$ can be uniquely written as $g = \sum_{j=1}^d \theta_j p_j$. Since $g^* = P(f)$ is also an element of $G$, it can also be written in this form: $g^* = \sum_{j=1}^d \theta_j^* p_j$. To figure out $g^*$, it is thus sufficient to figure out the vector $\theta^* = (\theta_j^*)_j$. To find this vector, we can find the minimum of

$J(\theta) = \| f - \sum_{j=1}^d \theta_j p_j \|_2^2$

(since $u \mapsto u^2$ is monotone on $[0,\infty)$). For this, it is useful to remember that for $f\in L^2([0,1])$, $\|f\|_2^2 =\langle f,f \rangle$, where the inner produce $\langle \cdot,\cdot\rangle$ is defined for all $f,g\in L^2([0,1])$ using $\langle f,g \rangle = \int_0^1 f(x)g(x) dx$. Taking the derivative of $J(\theta)$ with respect to $\theta_i$ and equating the result with zero, we get,

$0=\frac{d}{d\theta_i} J(\theta) = 2 \langle \sum_{j=1}^d \theta_j p_j-f, p_i \rangle$.

Using the symmetry and linearity of the inner product $\langle \cdot,\cdot\rangle$, introducing the matrix $A$ whose $(i,j)$th element is $a_{ij} = \langle p_i, p_j \rangle$ and the vector $b$ whose $i$th element is $\langle f,p_i \rangle$, collecting the above equations for $i=1,\ldots,d$ and reordering the terms, we see that these equations can be written in the short form

$A\theta = b$.

Note that here $A$ is a positive definite matrix. Solving this linear system thus will give us $\theta^*$ and hence $g^*$.

So far so good. However, in practice, we rarely have access to the true function $f$. One commonly studied set-up in machine learning assumes that we have a noise "sample" of the form $((X_t,Y_t);t=1,\ldots,n)$ only, where $(X_t,Y_t)$ is i.i.d. and $\mathbb{E}[Y_t|X_t] = f(X_t)$ and $\mathrm{Var}(Y_t|X_t)\le c<\infty$. However, I do not want to go this far, but consider a conceptually simpler case when instead of $f$, we can access $\hat{f}: [0,1] \rightarrow \mathbb{R}$. Think of $\hat{f}$ as the "noise corrupted" version of $f$. Later it should become clear how the reasoning that follows extends to the case when only a finite dataset of the previously mentioned form is available.

Realizing that we do not have access to $f$, but only to $\hat{f}=f+\Delta f$, it is clear that we won't be able to calculate $g^* = P(f)$, but only (say) $P(\hat{f})$. Let's call the result $\hat{g}^*$: $\hat{g}^* = P(\hat{f})$. How far is $\hat{g}^*$ from $f$?

By the Pythagorean theorem, $\|f-\hat{g}^*\|_2^2 = \| f- g^* \|_2^2 + \|g^* - \hat{g}^* \|_2^2$.
Here, $ \| f- g^* \|_2^2$ is the error that can only be reduced by increasing the degree $d$ (or increasing $G$), while $\|g^* - \hat{g}^* \|_2^2$ is the error due to starting with the noisy function $\hat{f}$ and not with $f$. The problem of studying the size of $\|f- g^* \|_2^2$ is the subject of approximation theory (where various nice results show how, e.g., the degree of smoothness and $d$ jointly influence the rate at which this error decreases to zero as $d$ goes to infinite) and this error is called the approximation error. We won't get into this nice subject, perhaps in a future post.

So it remains to calculate how big the other term is (which is normally called the estimation error, as it measures the effect of the stochastic error $\Delta f$). One way to understand this term is by using the error analysis (a.k.a. stability/perturbation analysis) of linear systems. Perhaps this is not the simplest way, but by pursuing this direction, the article will be more entertaining perhaps and hopefully we will also make some useful observations. Readers suspecting mischief are asked to stay calm..

Let us start this error analysis by observing that if we write $\hat{g}^* =\sum_{j=1}\hat{\theta}_j^* p_j$ then the vector $\hat{\theta}^*$ will be the solution to the linear system $A \theta = b+\Delta b$, where $\hat{\Delta b}_j = \langle\Delta f,p_j\rangle$, $j=1,\ldots,d$. Now, $\| \sum_{j=1}^d \theta_j p_j \|_2^2 = \theta^\top A \theta$, hence

$\|g^* - \hat{g}^* \|_2^2 = ({\theta}^*-\hat{\theta}^*)^\top A ({\theta}^*-\hat{\theta}^*)$.

Since $A \hat{\theta}^* = b+\Delta b$ and $A {\theta}^* = {b}$, subtracting the two, we get $A ( \hat{\theta}^*- {\theta}^*) = \Delta b$, from where we get that

$({\theta}^*-\hat{\theta}^*)^\top A ({\theta}^*-\hat{\theta}^*) = (\Delta b)^\top A^{-1} \Delta b$.

Hence,

$\|g^* - \hat{g}^* \|_2^2= (\Delta b)^\top A^{-1} \Delta b\le \lambda_{\min}^{-1}(A) \|\Delta b \|_2^2$, (*)

where $\lambda_{\min}(A)$ denotes the minimum eigenvalue of $A$. Note that by appropriately choosing the error term $\Delta b$, the inequality can be made sharp. What inequality (*) thus means is that the error caused by $\hat{f}-f$ can be badly enlarged if $\Delta b$ has an unfortunate relation to $A$.

Readers familiar with the error-analysis of linear systems of equation will recognize that (*) is similar but still different to the inequality that shows how errors propagate in the solution of linear systems due to coefficient errors (including errors in $b$). The difference is easy to understand: It is caused by the fact that in numerical analysis the error of the parameter-vector is measured in the (unweighted) 2-norm, whereas here the error is shown for the norm defined using $\|x\|_{A}^2 = x^\top A x$; in this application this error is more natural than the usual $2$-norm. Thus, this is one of the lessons: Use the norm that is the most appropriate for the application (because of this change in the norms, instead of the conditioning number of $A$, we see the inverse of the minimum eigenvalue of $A$).

To see how bad (*) can be, consider the case when $p_j(x) = x^{j-1}$, i.e., the standard polynomial basis. In this case, $a_{ij} = \int_0^1 x^{i+j-2} dx = [x^{i+j-1}/(i+j-1)]_0^1 = 1/(i+j-1)$. The resulting matrix is an old friend, the so-called Hilbert matrix. In fact, this matrix is most infamous for its horrible conditioning, but most of the blame should go to how small the minimum eigenvalue of $A$ is. In fact, a result of Szegő (1975) shows that as $d\rightarrow \infty$, $\lambda_{\min}(A) \sim 2^{15/4} \pi^{3/2} d^{1/2} (\sqrt{2}-1)^{4d+4}$ (an amusing asymptotic relationship on its own). This the reciprocal value of the minimum eigenvalue of $A$ is exponentially large in $d$, thus, potentially totally defeating the purpose of increasing $d$ to reduce the approximation error.

Thus, perhaps we should choose a different system of polynomials. Indeed, once we know that the goal is to keep $\lambda_{\min}(A)$ large, we can choose the polynomials so as to make this happen. One way, which also helps considerably with speeding up the calculations, is to choose $(p_j)$ such that $A$ is diagonal and if diagonal, why not choose $(p_j)$ such that $A$ becomes the identity matrix. We can simply start by $p_1 = 1$ (because $\int_0^1 1^2 dx=1$) and then choose the coefficients of $p_2(x) = a_{20} + a_{21} x$ such that $\langle p_2,p_2 \rangle=1$ and $\langle p_1,p_2 \rangle = 0$ (this gives $[a_{20} x+ 2 a_{20}a_{21} x^2/2+a_{21}^2 x^3/3]_0^1=1$ and $[a_{20}x + a_{21} x^2/2]_0^1=0$, which can be solved for the coefficients $a_{20}$, $a_{21}$). Continuing the same way, we can find $p_1,\ldots,p_d$, which are known as the shifted Legendre polynomials (the unshifted versions are defined with the same way for the $[-1,1]$ interval).

At this stage, however, the astute reader should really wonder whether I have lost my sense of reality! How on the earth should the choice of the basis of $G$ influence the error caused by the "noisy measurement"??? In fact, a very simple reasoning shows that the basis (and thus $A$) should not change how big $\|\hat{g}^*-g^* \|_2^2$ is. To see why note that $\hat{g}^*-g^* = P(\hat{f})-P(f)$. Up to know, I have carefully avoided giving a name to $P$, but in fact, $P$ does not have to be introduced: It is the well-known orthogonal projection operator. As such, it is a linear operator. Hence, $ P(\hat{f})-P(f) = P( \hat{f}-f ) = P(\Delta f)$. It is also known, that projections cannot make vectors larger (they are non-expansions in the 2-norm that defines them). Hence, $\|P(\Delta f)\|_2 \le \|\Delta f\|_2$, which in summary means that

$\|\hat{g}^*-g^*\|_2 \le \|\Delta f \|_2$.

The "measurement" error of $f$ cannot be enlarged by projecting the measured function to the function space $G$. How do we reconcile this with (*)? One explanation is that (*) is too conservative; it is a gross over-estimate. However, we also said that the only inequality that was used could be sharp if $\Delta b$ was selected in an "unfortunate" fashion. So that inequality could be tight. Is this a contradiction? [Readers are of course encouraged to stop here and think this through before continuing with reading the rest.]

No, there is no contradiction. All the derivations were sounds. The answer to the puzzle is that $\Delta b$ cannot be selected arbitrarily. By writing $\Delta f = \Delta f^{\parallel}+ \Delta f^{\perp}$, where $\Delta f^{\parallel}\in G$ and $\Delta f^{\perp}$ is perpendicular to $G$, we see that $\Delta b_j = \langle \Delta f^{\parallel}, p_j\rangle$. Since $\Delta f^{\parallel}\in G$, it can be written as $\sum_j \gamma_j p_j$ with some coefficients $(\gamma_j)$. Hence, $\Delta b_j = \langle \Delta f, p_j \rangle = \sum_k \gamma_k \langle p_k,p_j\rangle$, or $\Delta b = A \gamma$! Thus $\Delta b$ is restricted to lie in the range of $A$. The rest is easy:

$\|g^* - \hat{g}^* \|_2^2= (\Delta b)^\top A^{-1} \Delta b = \gamma^\top A \gamma = \|P(\Delta f) \|_2^2$.

The conclusion? Was this a pointless exercise? Is the stability analysis of linear systems irrelevant to machine learning? No, not at all! What we can conclude is that although the basis chosen does not influence the error in the final approximator that can be attributed to errors in measuring the unknown function, this conclusion holds only if we assume infinite precision arithmetic. With finite precision arithmetic, our analysis shows that rounding errors can be blown up exponentially with the dimension of the subspace $G$ growing to infinity. Thus, with an improperly chosen basis, the rounding errors can totally offset or even reverse the benefit that is expected from increasing the dimensionality of the approximation space $G$.

If you cannot calculate it with a computer, you cannot estimate it either.

(The careful reader may wonder about whether the rounding errors $\tilde{\Delta b}$ can lie in the dreaded subspace of $A$, but in this case they can.) Hence, when consider a finite dimensional approximation, care must be taken to choose the basis functions so that the resulting numerical problem is stable.

Finally, one commenter on the blog once noted that I should post more about reinforcement learning. So, let me finish by remarking that the same issues exist in reinforcement learning, as well. A basic problem in reinforcement learning is to estimate the value function underlying a stationary Markov policy. Long story short, such a policy gives rise to a Markov chain, with a transition probability kernel $\mathcal{P}$. We can view $\mathcal{P}$ as an operator that maps bounded measurable $f:X \rightarrow \mathbb{R}$ functions to akin functions: if $g=\mathcal{P} (f)$, $g(x) = \int_{y} f(y) \mathcal{P}(dy|x)$, where the integration is over the state space $X$. Evaluating the said policy then amounts to solving

$(I-\gamma\mathcal{P}) f = r$ (PEval)

in $f$, where $0\le \gamma < 1$ is the so-called discount factor and $r:X \rightarrow \mathbb{R}$ is the so-called reward function (again, $r$ is assumed to be bounded, measurable). A common method to solve linear inverse problems like the above is Galerkin's method. Long story short, this amounts to selecting an orthogonal basis $(\psi_k)$ in $L^2(X;\mu)$ with $\mu$ being a probability measure and then solving the said system in $G= \{ f\,:\, f =\sum_{k=1}^d \alpha_k \psi_k, \alpha_1,\ldots,\alpha_d\in \mathbb{R} \}$ in the sense that the solution $g\in G$ is required to satisfy (PEval) in the 'weak sense': $\langle ((I-\gamma\mathcal{P})g,\psi_k\rangle = \langle r,\psi_k\rangle$, $k=1,\ldots,d$. As it turns out, there are numerically less and more stable ways of selecting the basis $(\psi_k)$, though figuring out how to do this in the practical settings when $I-\gamma\mathcal{P}$ is unknown but can be sampled remains for future work.

Bibliography:
Szegő G. 1982 On some Hermitian forms associated with two given curves of the complex plane, Collected Papers, vol 2. (Basle: Birkhauser) p 666.

Student Response Systems -- A Year Later

2013-09-15T21:19:00.004-07:00

I have promised to come back with my experience after the Fall semester of 2013. Though the end of that semester passed a long ago, here are some thoughts:

Overall my experience of socrative was very positive. During the first week I polled the class to see how large a percentage of the class has some kind of wifi enabled device that they could use. 90 out of the 95 students had some kind of device that they were willing to bring to the class, so I decided to give socrative a try.

Socrative helped me tremendously to stay on the top of what everyone in the class knows. The way I used socrative was as follows: After every block of new material (usually 5-10 slides), I inserted a few questions to verify whether the students "got" the new material. This was all the "extra work" that I had to put in designing the class material due to socrative. And I would have done something similar without socrative anyways, so in fact I did not feel much of a difference here. Once the question was projected, I started the question on socrative by pushing a button and then the students could start entering their answers. The answers are shown on the teacher's screen. Based on the feedback received, I could then decide whether I needed to re-explain something.

I used a tablet to connect to socrative, while I used a second computer to project my slides: This was handy as I did not have to exit the presentation to start the question in socrative. In the classroom there was a projector that could project the tablet's screen. Sometimes I have shown the feedback screen even before I got all the answers in. The students seemed to like this as it created an interesting dynamic in the class.

I used multiple choice questions mostly, but I also used yes/no questions occasionally. In addition to these, I also experimented with "short answer" questions. This latter question-type would be very useful and I generally prefer it. It is not only less work to prepare "short answer" questions, but the they are usually better at testing what the students know. To put in another way, it is really hard (and time-consuming) to design good multiple choice questions with valid-looking alternatives (if you have some ideas of how to automate this, I am all ears). Examples of the "short answer" questions I asked are: "What is the keyword that should go in this program in this place?" or "What will this program print?".

The feedback screen after a quiz-type question shows a histogram which is updated on the fly, so if you want, you can steer things in the class by showing how the histogram evolves. The same works for yes/no questions. Unfortunately, the feedback screens for the short answer questions are less than ideal as it shows the list of answers as they are coming in. On this list the same answers in fact can be repeated many times.. Needless to say, this is not the best way of presenting this information. It would be much nicer, for example, if there was a histogram representing the top answers.

I also replaced the end-of-class questionnaire of previous years with the "exit quiz" that socrative provided. Unfortunately, the questions on the exit quiz are not configurable and some of socrative's standard questions caused permanent confusion ("solve the problem on the board" -- we did not even have a board!). Also, to my surprise, the exit-quiz appeared to be less effective than the paper-based questionnaire in facilitating comments. Later, I figured that I can leave the exit-quiz open for a few days after class to collect more feedback; this helped, but unfortunately only just a little bit. Reading through the exit-quiz responses is also a bit of extra work if you have not done this before, but this was not the case for me. And I actually like to read through these extra comments; they are very useful most of the time.

Once, I also experimented with socrative's "quiz" component. This allows teachers to compile a longer list of questions (a quiz) that the students can solve in the class (either timed or not timed). This quiz was not marked. Unfortunately, there were too many technical difficulties: socrative was overloaded (the class size was around 90). Also, socrative eliminated leading whitespace characters from the answers, which was quite unfortunate as whitespaces are crucially important in Python, the programming language that we used in this class. Thus, I decided not to use the quiz again.

In conclusion, although the system is not perfect, it was usable and overall it helped both the students and me. I have received many comments from the student praising how socrative was useful to engage everyone, including the shy students. As this was my main motivation to try this system, I conclude that the experiment was successful.

PS: Next semester, I tried again. Unfortunately, in this semester we were in a different classroom, where first the wifi service became unreliable and then it completely stopped working. I have no idea why. No one had. I asked the technical support people to fix it but all I got was promises. This was sad. Luckily, this was a smaller class (around 50), so we could still have some true interaction with the students.

PPS: I have also used in the first semester a website where students could solve hundreds of little programming problems. The site provided the questions and it also have feedback, even including things like "did you forget to initialize your variable". Again, the tool had glitches (the question bank was not very well developed), but overall I think this was also a great tool.

Student Response Systems

2012-09-10T11:19:00.000-07:00

There are plenty Student Response Systems (SRSs) out there. Which one to choose? This brief document summarizes what I have found on the web before the Fall of 2012.

Radio communication-based systems

The systems differ in a few characteristics. The first is what type of devices the students can use. Classical, iClicker like systems require the student buy a device which communicates with a receiver that the teacher should possess. Since our school uses iClickers, let me focus on them. The overall cost of first generation iClickers is $10/device, assuming the student sells back the device to the bookstore (which they can). This is not a major cost but who likes to pay when you don’t have to? The first limitation of iClicker-like systems is that they are bound to the smart classrooms and the computers there. Thus, if you are like me and use your own computer for projection, you will need to switch between screens to show the results of a poll. This makes the use of iClicker quite cumbersome. A second major limitation of iClickers is that they are limited to multiple choice questions. In particular, they don’t allow free text entry, numerical responses, or other type of test questions (like matching). Free text and numerical responses are very useful for assessing the knowledge of students and designing meaningful multiple choice questions is hard and time-consuming.

Systems accepting text input from non-smart phones

An alternative to iClickers is to use systems that uses Wifi networks or the texting capabilities of conventional phones with texting capabilities. Wifi capable devices include smartphones, tablets and laptops.
The systems that I have found that support texting (i.e., receiving input from non-smart phones) are Poll Everywhere, LectureTools and Top Hat Monocle. These also support input from Web-capable devices. Their pricing differs quite a bit. Poll Everywhere allows instructors to pay for the semester. The price currently is USD350. LectureTools requires you to pay for two semesters at the price of USD800. Top Hat Monocle does not allow the instructors to pay. The per student price is USD20 for a semester, or USD38 for five years.
As said before, all these systems support Wifi and texting. I had a chance to test Poll Everywhere and LectureTools. I had some problems with LectureTools (importing my slides did not work). The concept of LectureTools is that you keep your slides on the web, tightly integrated with the questions. They support during the presentation annotation of slides, which is nice. However, overall LectureTools was not as smooth and easy to use as Poll Everywhere. For example, I could not figure out given the limited amount of time how the students will connect on the web to my questions. Poll Everywhere was really easy to use, on the other hand. It supports Powerpoint and Keynote. Compared to Top Hat Monocle, Poll Everywhere is not as feature rich (Top Hat Monocle has games, for example), but I was happy with the functionality Poll Everywhere provided.

Systems that use the Web-capable devices

If one is contempt with Web-enabled devices, the number of available SRSs soars high. In fact, one can use any Web-based solution, many of them being free. Starting with the free options, of course, one can create a quiz in Moodle. Controlling what the students see and what they don’t is pretty cumbersome. Moodle is not very well suited for the purpose of live polling.
Another method is to use Twitter. Students are pretty excited about Twitter (from the feedback I got), though I would be careful projecting everything that comes in to the screen -- some moderation might be essential to keep the class under control. Another problem with Twitter is that a tool for analyzing responses to questions would be needed (and I know of no such tool).
The next option is to use Google Forms. The very idea of Google Forms is that information submitted on the web is sent to a Google spreadsheet. Since you need the spreadsheet to get the URL of the form, start by creating a spreadsheet, then insert a form there. Once the form is created (give it a cool skin!), get back to the spreadsheet to get the URL to it. You will send this to the students (maybe compressing it using tinyurl.com, or a similar service). You can control the timing of when the form is accepting input from the spreadsheet. To support ad hoc questions, you can just create an empty form and recycle it through your presentation. If you want to use a fixed set of questions, you will need one form (and thus spreadsheet) per question that you can store on your google drive. The downside of Google Forms is that students can submit as many responses as they wish. With Google Apps, presumably there is a way around this, but this would need to be investigated.
Flisti is an extremely simple web-based polling systems (I guess there are many other similar systems). You go to their webpage, create the poll there and give the poll’s URL to the students. You can view the results of the poll online. I think that users are tracked based on their IP addresses, so no multiple submissions are possible from the same IP address to the same poll. Only multiple choice questions (with multiple answers, possibly) are supported.
Socrative is a fuller Web-based SRS currently under beta-testing. During the beta-testing phase, the system is free. The web-based interface is nice and sleek and it was extremely user-friendly. The teacher can control in real-time which questions are “live” in his/her “classroom”. The classroom is identified by a numbe, that the students go to. The only issue with Socrative is that every activity is limited to 50 students.
QuestionPress is another commercial system. The price for my class is $66. This seems to be a mature systems that I was truly impressed by. All interactions are Web-based. ClickerSchool is similar to QuestionPress, the price is $95 for my class for one semester. ClickerSchool are provided specialized smart phone apps (both for iOS and Android). eClicker, on the other hand, requires the teacher to buy a software. All their software supports Apple products only (Mac OSX and iOS). SRN (studentresponsenetwork.com) offers a campuswide license for $195. This is a client-server system that requires installing software on the teachers’, as well as the students’ computers.
I could not find pricing information on the web for TurningPoint/ResponseWare, which seems to be a mature product (but it cannot be “tried”). The same goes for vClicker.

Which type of system to choose?

The question remains: Which type of system to choose? One factor to consider is how important it is to have other than multiple choice questions in class. Personally, I think that multiple choice questions exist only for historical reasons -- their pedagogical value is rather questionable. Good multiple choice questions are extremely hard and time consuming to create. If this is not convincing enough and you don’t mind switching the projector back and forth between your computer and the one in the classroom, you can stay with iClickers.
In the opposite case, the next thing to consider whether you want to support phones with text input. I have just polled my students and out of the 52 responses so far, 43 can and are willing to bring their laptops to the classroom, 37 carry a smartphone, 6 a tablet and 8 carry a non-Web enabled phone (these are overlapping groups of students). Thus, the vast majority of students are able to use a system that is built around the Web. Since it will always be hard to achieve 100% coverage, one idea is to let the students pair up or form groups of three. Based on the statistics I gathered, overall, I am leaning towards that support of texting input should not be viewed as a major advantage.
However, it should also be mentioned here that a potential danger of using laptops or other fancy, Web-enabled devices is that they represent potential sources of distraction. Thus, with the use of these devices, the teacher will need to face the challenge of competing for the students’ attention with the social networks, email, and ultimately the whole Web.
The next factor consider is the ease of use of the SRS. The teacher may need to create a significant number of questions for every class. Integration with Powerpoint and Keynote may be a plus, but switching between a browser and a presentation software looks easy enough.
Since I will use the chosen SRS only for formative assessment and not for grading (the present common sense is that this would be a bad idea, not talking about that this would indeed require full coverage of the whole class), I don’t care about whether the SRS supports automated grading and keeps the identities.
Based on these considerations, I will probably go with Socrative.

References
I have created a google spreadsheet for comparing the systems listed here, in addition to a few more. The spreadsheet can be found here. The spreadsheet links a few other sources that I have used during my research.

Approximating inverse covariance matrices

2012-04-14T19:53:00.000-07:00

Phew, the last time I have posted an entry to my blog was a loong time ago.. Not that there was nothing interesting to blog about, just I always delayed things. (Btw, google changed the template which eliminated the rendering of the latex formulae, not happy.. Luckily, I could change back the template..) Now, as the actual contents:

I have just read the PAMI paper "Accuracy of Pseudo-Inverse Covariance Learning-A Random Matrix Theory Analysis" by D Hoyle (IEEE T. PAMI, 2011 vol. 33 (7) pp. 1470--1481).

The paper is about pseudo-inverse covariance matrices and their analysis based on random matrix theory analysis and I can say I enjoyed this paper quite a lot.

In short, the author's point is this:
Let $d,n>0$ be integers. Let $\hat{C}$ be the sample covariance matrix of some iid data $X_1,\ldots,X_n\in \mathbb{R}^d$ based on $n$ datapoints and let $C$ be the population covariance matrix (i.e., $\hat{C}=\mathbb{E}[X_1 X_1^\top]$). Assume that $d,n\rightarrow \infty$ while $\alpha = n/d$ is kept fixed. Assume some form of "consistency" so that the eigenvalues of $C$ tend to some distribution over $[0,\infty)$. Denote by $L_{n,d} = d^{-1} \mathbb{E}[ \| C^{-1} - \hat{C}^{\dagger} \|_F^2]$ the reconstruction error of the inverse covariance matrix when one uses the pseudo-inverse.

Then, $f(\alpha) := \lim_{d,n\rightarrow\infty} L_{n,d}$ will be well-defined (often). The main things is that $f$ becomes unbounded as $\alpha\rightarrow 1$ (also $\alpha\rightarrow 0$, but this is not that interesting). In fact, for $C=I$, there is an exact formula for $\alpha \rightarrow 1$:

\[f(\alpha) = \Theta(\alpha^3/(1-\alpha)^3).\]
Here is a figure that shows $f$ (on the figure $p$ denotes the dimension $d$).

Nice.

The author calls this the "peaking phenomenon": The worst case, from the point of view estimating $C^{-1}$, is when we have as many data points as is the number of dimensions (assuming full rank $C$, or this won't make sense because otherwise you could just add dummy dimensions to improve $L_{n,d}$). The explanation is that $L_{n,d}$ is just sensitive to how small the smallest positive eigenvalue of $\hat{C}$ is (this can be shown), and this smallest positive eigenvalue will become extremely small as $\alpha\rightarrow 1$ (and it does not matter whether $\alpha \rightarrow 1-$ or $\alpha\rightarrow 1+$).

Now notice that having $\alpha=0.5$ is much better than having $\alpha=1$ (for large $n,d$). Thus, there are obvious ways of improving the sample covariance estimate!

In fact, the author then suggests that in practice, assuming $n \ge d$, one should use bagging, while for $n\le d$ one should use random projections. Then he shows experimentally that this improves things though unfortunately this is shown for Fisher discriminants that uses such inverses and the demonstration for $L_{n,d}$ is lacking. He also notes that the "peaking phenomenon" can also be dealt with by other means, such as regularization where we are referred to Bickel and Levina (AoS, 2008).

Anyhow, one clear conclusion of this paper is that the pseudo-inverse must be abandoned when it comes to approximate the inverse covariance matrix. Why is this interesting? The issue of how well $C^{-1}$ can be estimated comes up naturally in regression, or even in reinforcement learning when using the projected Bellman error. And asymptotics is just a cool tool to study how things behave..

Brains, Minds and Machines

2011-05-04T08:43:00.007-07:00

Former UofA student, Alborz, shared a link to a video recording of a recent MIT150 symposium on Brains, Minds and Machines on facebook. I watched the video yesterday (guess what, I need to mark 40 something finals, hehe:)).
I wrote a comment back to Alborz on facebook and then I thought, why not make this a blog post? So, here it goes, edited, expanded. Warning: Spoilers ahead and the summary will be biased. Anyhow..

The title of the panel was: "The Golden Age — A Look at the Original Roots of Artificial Intelligence, Cognitive Science, and Neuroscience " and the panelist were Emilio Bizzi, Sydney Brenner, Noam Chomsky, Marvin Minsky, Barbara H. Partee and Patrick H. Winston. The panel was moderated by Steven Pinker who started with a 20-30 minute introduction. Once done with this each of the panelist delivered a little speech and at the end there were like two questions asked by Pinker.

My heroes in the panel were Minsky and Winston. They rocked! Minsky almost fell asleep during his talk, but he was well aware of this and I loved him. He told a story about Asimov not wanting to come to his lab to see the real robots (he did not want to get disappointed) and about von Neumann who said that he does not know if Minsky's thesis could qualify as a thesis on mathematics (they were both at Princeton in the math department), but soon it will be. I really enjoyed this part. Winston acted a bit like a comedian. I did not mind this either. One thing that Minsky and Winston both said is that the mistakes happened when AI became successful and everyone from that on seemed to forget the science part of AI. But they did not say much about how we can get back on the track (except that we should try). Winston blamed the short-sightedness of funding agencies and who would disagree.

Chomsky made some interesting claims. He claimed that language is designed for thoughts and not for communication. This was a pretty interesting claim. He claimed other things supporting his idea about innate, universal grammars, but I did not know how much credit to give to those as I have many linguist friends who strongly disagree. Things became interesting when answering a question at the end, he dumped the whole of machine learning. He talked about "a novel scientific criterion" that was never heard of before referring to being able to predict "on unanalyzed data". He said that "of course" with enough data you will do better, but it seemed that he thinks that the evaluation criterion is already ridiculous. He also said that a little statistics does not hurt, but he still seems that the big deal is the engineering part. (He did not say with these words, but this is what I got from what he said).

Sydney Brenner (pioneer in genetics and molecular biology) was puzzled about why he was invited, though he had some good stories. I liked when he said that in 50 years people will not understand why everyone talked about consciousness 50 years ago.

Emilio Bizzi (a big shot in neuroscience, in particular in motor control) talked about modularity, "dimension reduction" and generalization, and he looked like a fine Italian gentleman, but I have to confess that I don't remember anything else, though this could have been because it was late.

Barbara H. Partee read a script. In the first 5 minutes or so she was mostly praising Chomsky. Then she started to talk about her own work, which was foundational in semantics. She talked about how semantics is the real thing. By semantics she means formal semantics (like in logics). While in general I am fond of this work, I am not sure if anything like this is going on in our brain and if sentences really do have a meaning in a formal sense. It seems to me that the fact that sentences can have a formal meaning is more likely an illusion, a post-hoc thought than the real thing. Ad it is unclear if bringing in formal logics is going to bring us anywhere. Unfortunately, there was no discussion of this at all. At one point she made the remark that "search engines do not use semantics", but then left us in vain about how we could do any better. Oh, well..

In summary, an impressive set of people, some nice stories, but little cutting edge science. Loads of romanticism about the 50s and 60s and no advice for the young generation. The title tells it all. It was still nice to see these people.

Useful latex/svn tools (merge, clean, svn, diff)

2011-04-13T20:09:00.006-07:00

This blog is about some tools that I have developed (and yet another one that I have downloaded) which help me to streamline my latex work cycle. I make the tools available, hoping that other people will find them useful. However, they are admittedly limited (more about this) and as usual for free stuff they come with zero guarantee. Use them at your own risk.

The first little tool is for creating a cleaned up file before submitting it to a publisher who asks for source files. I call it ltxclean.pl, it is developed in Perl. It can be downloaded from here.
The functionality is
(1) to remove latex comments
(2) to remove \todo{} commands
(3) to merge files included from a main file into the main file
(4) to merge the bbl file into the same main file

If you make the tool executable (chmod a+x ltxclean.pl), you can use it like this:

$ ltxclean.pl main.tex > cleaned.tex

How does this work?

The tool reads in the source tex file, processes it line by line and produces some output to the standard output stream, which you can redirect (as shown above) to a file.
Thus, whatever the tool does is limited to the individual lines. This is a limitation, but this made it possible for me to write this tool in probably less time than I spend on writing about it now.
There are other limitations, see below. Now, how do we know that this worked? The advice is to run latex+dvips and then diff original.ps new.ps to see if there is any significant change. On the files I have tried, the only difference was the filename and the date.

Why this functionality and the glory details

As it happens, removing the comments before you submit a source file is crucial. Not long ago, it happened to me that I have submitted a source to a publisher and I did not care about removing the comments. At the publisher, they loaded the file into a program, which wrapped the long lines, including the ones with comments! This created a lot of garbage in the middle of the text. We were pressed against time, though I could not check the text in details. The result: The text was printed with a lot of garbage! Too bad!! A painful experience for me.. I will never again submit source files with the comments kept in the file! Now, the above utility is meant to handle comments correctly. It pays attention to not to create empty lines (and thus new lines) inadvertently, not to remove end-of-line comments etc.

The \todo{} commands belong to the same category: They are better removed before submitting the file. For my todos, I use the todonotes package, which puts todo notes on the margin (or within the text). This package supplies the \todo[XX]{ZZZ} command, where [XX] is optional. The above little script removes such todo commands, but only if they span a single line only. For now, you would need to remove multi-line todos by hand.

Another service of this little tool is to merge multiple files into a single one. Oftentimes, we use the latex command \input to break a large source file into multiple files. However, publishers typically want just one file. So this tool reads in the main file and the recursively, whenever it sees \input{FILE} in the source, it reads in the corresponding file and processes it before it continues with the current file (just like latex would work).

Finally, if the tool finds a \bibliography{...} command, it will take that out and open the .bbl file sharing the same base name as the input to the tool. Thus, if the tool was called on the file main.tex, when seeing a bibliography command, the tool will attempt to open main.bbl and include it in place of the \bibliography command. (If you use hyperref, turn off pagebackref, otherwise this functionality will not work.)

Managing revisions with svn

Two other small utilities that I make available are svnprevdiff and svnreviewchanges.
The purpose of these scripts is to help one review changes to files which are under svn control.
There is a third script, diffmerge, called by the above two scripts. This script takes two file arguments and loads these into the program DiffMerge which allows you to visually inspect the differences between the two files and make changes to the second one loaded. On a different platform/installation, or if you want to use a different tool for comparing/merging files.

The utility svnreviewchanges takes a file as an argument, compares it to its base version stored on your disk and opens up the two versions for comparison using diffmerge. The purpose is to allow one to quickly review how a file was changed before submitting a file to the svn server (so that you can write meaningful comments in the commit message).

The utility svnprevdiff takes a filename as an argument, compares it to its previous version stored on the svn server and then opens up the two versions using diffmerge. The purpose of this is to check the changes implemented by your pals after an update. A future version will take an optional argument which when present will be interpreted as a revision number. Maybe.

Advice on using latex when working in a team: Break long lines

A small, but useful thing is to put every sentence on its own line and generally avoiding long lines (even when writing equations). The reason is that this will make the job of diff much easier. And believe me, diffing is something people will end up doing for good or bad (mostly good) when they are on a team.

Some of my friends, like Antoska would recommend breaking up the individual sentences into multiple lines. You can do this, but if you overdo it, you will find yourself fiddling way too much with what goes into which line.

Finally, a tool which does this, written by Andrew Stacey, is fmtlatex.pl.
This is also in Perl and its documentation will be written on the screen if you use perldoc fmtlatex. I still have to try this.

I can run Matlab on my Mac again!

2011-04-13T19:22:00.005-07:00

After much struggling today I managed to make Matlab run again my Mac.
The major problem was that Matlab complained about that I have the wrong version of X11 installed on my system and it won't start. As I have finished teaching today for the semester, I thought that I am going to celebrate this by resolving this issue which I was struggling with for a year or so by now. On the internet you will see a lot of advice on what to do, and as they say, the truth is indeed out there, however, it is not so easy to find. In a nutshell what seems to happen is this:

Why Matlab does not start when other applications do start (say, Gimp, Gnuplot using X11, etc.).
Matlab seems to make the assumption that the X11 libraries are located at /usr/X11/lib and it sticks to this assumption no matter how your system is configured. I use XQuartz and macports' X11 and they put stuff elsewhere. I had some legacy code sitting in /usr/X11/, which I did not use. It was a remainder of some version of X11 that I used probably 2 or 3 laptops ago. Matlab reported that the lib was found, but the "architecture was wrong". The error message had something like:

.. Did find: /usr/X11R6/lib/libXext.6.dylib: mach-o, but wrong architecture..

Anyhow, here is one solution.
You have to arrange that /usr/X11 points to a directory that has a working X11 copy.
It is probably a good idea to first clean up the old X11 installation. You can do this by following the advice on the XQuartz FAQ page by issuing the following commands in the terminal:

sudo rm -rf /usr/X11* /System/Library/Launch*/org.x.* /Applications/Utilities/X11.app /etc/*paths.d/X11 sudo pkgutil --forget com.apple.pkg.X11DocumentationLeo sudo pkgutil --forget com.apple.pkg.X11User sudo pkgutil --forget com.apple.pkg.X11SDKLeo sudo pkgutil --forget org.x.X11.pkg

Then I have reinstalled the latest XQuartz copy (not all these steps might be necessary, but in order to stay on the safe side, I will describe everything I did).
I also have macports and xorg-libX11, xorg-libXp, xorg-server seems necessary for the following steps to succeed (but possibly other xorg-* ports are also needed). I am guessing that XQuartz does not install all the libraries, but after installing enough xorg-* ports through macports, all the libraries will be installed which are used by Matlab.

Now, my X11 is located at /opt/X11 and some additional libs are found at /opt/local/lib.

So I created a bunch of symbolic links:

sudo ln -s /opt/X11 /usr/X11
for i in /opt/local/lib/libX* ; do sudo ln -s $i /usr/X11/lib; done

The first line creates a symbolic link to /opt/X11, while the second is necessary because of the additional libX* libraries which, for some reason, macports puts into /opt/local/lib instead of puttting it into /opt/X11/lib. Initially I did not know that I need these libs, and then Matlab complained that it did not find the image for some lib (it was /usr/X11/lib/libXp.6.dylib).

Anyhow, I am really happy that this worked!
I hope people who will have the same trouble will find my post useful.

Djvu vs. Pdf

2009-11-17T08:55:00.008-07:00

Long blog again, so here is the executive summary: Djvu files are typically smaller than Pdf files. Why? Can we further compress pdf files? Yes, we can, but the current best solution has limitations. And you can forget all "advanced" commercial solutions. They are not as good as a free solution.

Introduction

DJVU is a proprietary file format by LizardTech. Incidentally, it was invented by some machine learning researchers, Yann LeCun, Léon Bottou, Patrick Haffner and the image compression researcher Paul G. Howard at AT&T back in 1996. The DJVULibre library provides a free implementation, but is GPLd and hence is not suitable for certain commercial softwares, like Papers, which I am using to organize my electronic paper collection. Hence, Papers, might not support djvu in the near future (the authors of Papers do not want to make it free, and, well, this is their software, their call).
Djvu files can converted to Pdf files using ddjvu, a command line tool which is part of DJVULibre (djvu2pdf is a script that calls this tool). Djvu can also be converted into PS files using djvups (then use ps2pdf). However, all these leave us with pretty big files compared to the originals and, on the top of it, if there was an OCR layer in the Djvu file, it gets lost, but this is another story. How much bigger? Here is an illustration:

Original djvu file: 9.9MB
djvu2pdf file: 427.6MB(!)
djvu2ps file: 1.0GB
djvu2ps, ps2pdf file: 162.6MB

Note that I have turned on compression in the conversion process (-quality=50). (The quality degradation was not really noticeable at this level.) So, at best, I got more than 16 times the original file size. Going mad about it, I started to search the internet for better solutions. I have spent almost a day on this (don't do this, especially if you are a student!)..

JBig2 and the tale of commercial solutions

First, I figured, the difference is that these use general image compression techniques (like jpeg), while djvu is specialized to text and black&white images. Thus, for example, it can recognize if the same character appears multiple times on the page, store a template and a reference to the template. This is clever. I then figured that PDF files support the so-called jbig2 encoding standard, which is built around this idea. Hence, the quest for software that would support encoding a document using a jbig2 encoder and put the result into a pdf format. The easiest would be, if such a software just existed out there. A few commercial packages indeed mention jbig2. I felt lucky (especially, seeing that there are a few cheap ones). So, I started to download trial versions. Here are the results:

PDFJB2: 34.1MB
CVision PdfCompressor: 48MB
CVision PdfCompressor with OCR: 49MB
A-PDF: 106.8MB
A-PDF + PDFCompress: 106.8MB
djvu2pdf + PDFCompress: conversion failed

Hmm, interesting. 34MB is much better than 160MB, but it is still a long way from 9.9MB. (After a superficial look at the resulting files I concluded that only the A-PDF compressed file lost quality. What happened with this file is that on some page in some line containing a mathematical formula, the top of the letters got chopped.)

Free, open source solutions

Becoming desperate, I continued hunting for better solutions. Searching around, I have found iText, which is an open source, free Java library supporting all kinds of manipulations of Pdf files. I have figured that it "uses" Jbig2, but it was not clear if it uses it for compression or just knows how to handle the encoding. So, here I go, I wrote a java program opening a pdf file and then writing it out in "compressed" mode. Hmm, this few lines of coding allowed me to create a file of size 26MB, smaller than what I could ever get previously. Exciting! Unfortunately, opening the file revealed the `secret': Quality was gone. The file looked to be seriously downsampled (i.e., the resolution was decreased). Not good.

Then I have found pdfsizeopt on google code, which aims exactly at compressing the size of pdf files! The Holy Grail? Well, installing pdfsizeopt on my mac was far from easy (I use a Mac, which also runs Windows; quite handy as some of the above software runs only under Windows..). However, finally, I was able to run pdfsizeopt. Unfortunately, it seems to crash, without even looking at my pdf file (I hope the bug will be corrected soon and then I can report results using it). Along the way, I had to install jbig2enc. For this, I just had to install leptonica (version 1.62, not the latest one), which is really the part that is doing the image processing part of the process. JBig2Enc expects a tif file and produces "pdf" ready output (every page is put in a separate file), which can be concatenated into a single pdf file by a python script provided. Having jbig2enc on my system, I gave it a shot. I first used ddjvu to transform the input to a tif file (using the command line option, "-quality=75", resulting in a file of size 1GB). Then I used the jbig2 encoded with the command line arguments "-p -s". The result is this:

jbig2enc: 3.8MB

Wow!! Opening the file revealed a dirty little secret: Color images are gone, as well as the quality of some halftoned gray-scale images got degraded. However, line drawings were kept nicely and, in general, the quality was good (comparable to the original djvu file). Conversion to tif took 5 minutes, conversion from tif to jbig2 took ca. 4 minutes, altogether making the whole process take close to 10 minutes. (Other solutions were not faster at all either. And the tests were run on a resourceful MacBook Pro.)

Conclusions

jbig2enc seems to work, but you will lose colors. If you are happy with this, jbig2enc is the solution, though the process should be streamlined a bit (a small script good do this). Oh yes, I did not mention that these processes are not fast. I did not attempt to measure the speed, but conversion takes a lot of time. Jbig2Enc is maybe on the faster end of the spectrum.

Future work

pdfsizeopt is a good idea. It should be made work.
It would be nice to create a jbig2enc wrapper
ddjvu is open source: Maybe it can be rewritten to support jbig2 directly. The added benefit could be that one could also keep the OCR layer in the original djvu file if one existed
Along the way, I have found a cool google code project, Tesseract, which is an open source OCR engine. How cool would it be if we had an OCR engine that helps the compression algorithm and eventually also puts an OCR layer on the top of documents which lack text information (think of scanned documents, or documents converted from an old postscript file). Currently, I am using Nuance's Pdf Converter Professional (yes, I paid for it..), which I am generally very satisfied with apart from its speed. However, this could be the subject of another post.

PS: I have tested the capabilities of Nuance's Pdf Converter Professional and Abbyy's in terms of their compression capabilities:
Nuance: 132MB
Abbyy: 129MB
Yes, I tried their advance "MRC" compression, in Nuance I have explicitly selected jbig2. No luck.

Keynote vs. Powerpoint vs. Beamer

2009-11-14T12:47:00.006-07:00

A few days ago I decided to give Keynote, Apple's presentation software, a try (part of iWork '09). Beforehand I used MS Powerpoint 2003, Impress from NeoOffice 3.0 (OpenOffice's native Mac version) and LaTeX with beamer. Here is a comparison of the ups and downs of these software, mainly to remind myself when I will reconsider my choice in half a year and also to help people decide what to use for their presentation. Comments, suggestions, critics are absolutely welcome, as usual. Btw, while preparing this note I have learned that go-oo.org has a native Mac Aqua version of OpenOffice. Maybe I will try it some day and update the post. It would also be good to include a recent version of Powerpoint in the comparison.

Stability

Keynote: Excellent
After a few days of usage, so take this statement with a grain of salt..
MS Powerpoint 2003: Excellent
Impress: Poor
Save your work very often
Beamer: Excellent

Creating visually appealing slides, graphics on slides

Keynote: Excellent
Positioning rulers help a lot. The process is really smooth. Keynote forces you to use less text. Built in templates are professional looking. Adding presentation graphics (tables, basic charts) is very easy. Cooler (technical drawing) better done with OmniGraffle. You can also easily animate the graphics, tables. Overall, very impressive.
MS Powerpoint 2003: Good
Aligning to other objects is more cumbersome than in Keynote. The quality of fonts, color palettes, templates is not as good in Keynote.
Impress: Good
Same as MS Powerpoint, maybe somewhat below (but the difference is not big).
Beamer: Poor
The fonts and styles (templates) are great. However, creating slides with lively graphic is a nightmare (due to the lack of a GUI): You will end up with a few standard layouts, you will in general not use graphics, let alone animated graphics (or you will spend days on creating your slides). Also, departing from the styles is difficult and I am just bored of some of these styles that everyone seems to use.

LaTeX (math) support

Keynote: Poor
Supported through LatexIt (free), but overall a cumbersome process. Details below.
MS Powerpoint 2003: Medium
Supported through TexPoint (commercial, USD30) process is roughly same as with LatexIt and Keynote, slightly better integration.
Impress: Medium
Supported through OOoLatex (free), same as MSPowerPoint + TexPoint, the integration is slightly better.
Beamer: Excellent
Beamer is built for this!

Animations

Keynote: Near perfect
Magic slide transition helps a lot with continuity across slides. What does this do? If you have the same object on two consecutive slides, Keynote will create an animation, keeping the object on screen and flying it to its new position. Works with multiple objects, too. I have found this very helpful for presenting a multi-slide argument. In general, Keynote animations are slick, polished, the flexibility is great. I lack some features of Beamer, such as animated highlighting, in-place replacement of some text (these can all be simulated with the existing tools, but with difficulty only).
MS Powerpoint 2003: Basic
I miss Keynote's magic transitions. In general, Keynote is richer in animations. Again, some features of Beamer would be nice to have.
Impress: Weak
Impress is inferior in terms of its animation caps to MS Powerpoint
Beamer: Good
If only someone added support for magic transitions between slides. Some other cool effects would also come handy.

Dual screen presentation support

The idea is to show notes, time left in addition to the current and next slide on your screen, while showing the current slide on the big screen.

Keynote: Excellent
Keynote supports double screen presentations natively. If you need to swap displays, go on the notes screen in the options menu. This will be on the big screen, obviously, if you need to swap the the screens.
MS Powerpoint 2003: Not available
I have no experience with this feature of MS Powerpoint. Maybe you can use and add-on or something, but the basic software does not support it. I am pretty sure newer versions of Powerpoint must support this.
Impress: Excellent(?)
The "Sun Presenter Console" extension supposedly supports dual screen presentations just like Keynote, but I have never had the chance to test it. Hence, the question mark. Some posts on the internet indicate that the extension might leak memory.
Beamer: Basic support
Use Splitshow for this purpose. However, as far as I know, you cannot show the current time or the time remaining on the notes screen.

Interoparability

I want to put my presentations on the web so that people can look at them no matter what (major) operating system they use, without loosing animations or any other features. Another desired feature is the ability to create a compact, printable version of the slides: That is, if you have animations spanning multiple slides, somehow they should get handled intelligently. There is a tradeoff here: The more animation rich your slides are, the more bloated/complicated your printout will be.

Keynote: OK
Proprietary file format. This is my biggest complaint. A keynote presentation is a keynote presentation. Apple likes to lock you in. Export to PDF and PPT works relatively well, but will lose some features of the presentation, like the cool animations. Exporting to PDF without animations to create printable versions seems to work well.
MS Powerpoint 2003: Good
Free powerpoint viewers exist that can play any PPT file. Export to PDF will again lose some features.
Impress: Good
Same as powerpoint.
Beamer: Excellent
Produces PDF outputs: The presentations can be viewed on any computer! Also, the source is later, beamer is available on all systems. Add [handout] to the style and beamer will create an animation free version of your slides that works almost all the cases.

More about using formulae in Keynote (and why it sucks)

I used LatexIt which produces a PDF that can be embedded into the presentation. Style is not matched automatically. The PDF contains the latex source for the formulae, copy paste it back to LatexIt to edit it. When done with the edit, you need to drag and drop the formula back into Keynote. This sucks, since you need to delete the original that you have edited, reposition the new formula and reapply animations if you had any. Horrible.

Another issue is that the source saved with the formula by default does not have the preamble, thus using a command set specific to a presentation is difficult to achieve (you have to set this up manually). Another major headache is that you will not be able to use inline formula (a text is either in LaTeX, or in Keynote, the fonts in general do not match and mix well, alignment is a nightmare), nor will you be able to animate easily formulae (e.g., displaying a multiline formula line by line requires you to split the formula into multiple PDFs and use Keynote animations to show them one by one; this is problematic because formula alignment by hand is time consuming).

Optogenetics

2009-10-31T11:01:00.002-07:00

This is a little deviation from the usual topic.
Scientist are able to genetically modify neurons that respond to light. They are in fact able to do this in a targeted manner. A patient would then have some LEDs inside his skull, emitting some light. In response the selected neurons start to fire. They demonstrated the technology by making mice run counterclockwise when they turn on the light. This is input to the brain. Earlier, it was demonstrated that neurons can be genetically modified to emit light when they are firing. Are we heading towards rewiring the brain and turning it into a light computer?
The motivation for the research is to cure diseases like Parkinson's disease, when the patient has all the circuity and muscles but is just unable to make the movements. In fact, the researchers are already testing this technology on primates. Source: Wired Nov. 2009, "Powered by Photons" pp. 109--113. The wikipedia entry for optogenetics is here.

Pitfalls of optimality in statistics

2009-10-25T19:28:00.006-07:00

I was reading a little bit about robust statistics, as we are in the process of putting together a paper about entropy estimation where robustness comes up as an issue. While searching on the net for the best material to understand this topic (I am thinking about posting another article about what I have found), I have bumped into a nice paper (downloadable from here) by Peter J. Huber, one of the main figures in robust statistics, where he talks about a bunch of pitfalls around pursuing optimality in statistics. Huber writes eloquently -- he gives plenty of examples, motivates definitions. He is just great. I can only recommend this paper or his book. Now, what are the pitfalls he writes about? He distinguishes 4 types with the following syndromes:

The fuzzy concepts syndrome: sloppy translation of concepts into mathematics. Think about uniform vs. non-uniform convergence (sloppy asymptotics). In statistics a concrete example is the concept of efficiency which is defined in a non-uniform manner with respect to the estimable parameters, which allows for (weird) "super-efficient" estimators that pay special attention to some distinguished element of the parameter-space.
The straitjacket syndrome: the use of overly restrictive side conditions, such as requiring that an estimator is unbiased or equivariant (equivariant estimates in high dimensions are inadmissible in very simple situations). In Bayesian statistics another example might be the convenient but potentially inappropriate conjugate priors.
The scapegoat syndrome: confusing the model with reality (offering the model for the gods of statistics instead of the real thing, hoping that they will accept it). The classic example is the Eddington-Fisher argument. Eddington advocated the mean-absolute-deviation (MAD) instead of the root-mean-square (RMS) deviation as a measure of scale. Fisher argued that MAD estimates are highly inefficient (converge slowly) relative to the RMS deviation estimates if the sample comes from a normal distribution. Tukey has shown that the situation gets reversed even under small deviations from a normal model. The argument that under narrow conditions one estimator is better than some other should not be even made. Another example is perhaps classical optimal design and the fundamentalist approach in Bayesian statistics. Of course, there is nothing wrong with assumptions, but the results should be robust.
The souped-up car syndrome: by optimizing for speed we can end up with an elaborate gas-guzzler. Optimizing for one quantity (efficiency) may degrade another one (robustness). Practical solutions must find a balance between such contradicting requirements.

These syndromes are also everywhere in machine learning research. Wear protective gear as needed!

How to make Thunderbird delete temporary file

2009-09-07T10:06:00.004-07:00

If you are using Thunderbird (TB) on Mac OSX, you might be annoyed by that when TB opens an attachment (like a pdf file) it creates the file on the Desktop and then leaves just it there! I have finally found a solution, which seems to work at least for me and assuming that you also have Firefox. The solution is here, but I duplicate it here to make sure the idea spreads:

Simply open Firefox, in the address bar type in about:config, then add a boolean variable browser.helperApps.deleteTempFileOnExit and set its value to true.

Now, this works to the extent that when you exit Firefox(!!) (after quitting Thunderbird), it will remove the cluttering files.

Enjoy!

Ninja Carburglars

2008-04-05T14:01:00.001-07:00

see more crazy cat pics

Statistical Modeling: The Two Cultures

2008-03-29T16:55:00.004-07:00

Sometimes people ask what is the difference between what statisticians and machine learning researchers do. The best answer that I have found so far can be found in
"Statistical Modeling: The Two Cultures" by Leo Breiman (Statistical Science, 16:199-231, 2001).
According to this, statisticians like to start by making modeling assumptions about how the data is generated (e.g., the response is a noise added to the linear combination of the predictor variables), while in machine learning people use algorithm models and treat the data mechanism as unknown. He estimates that (back in 2001) less than 2% of statisticians work in the realm when the data mechanism is considered as unknown.
It seems that there are two problem with the data model approach.
One is that the this approach does not address the ultimate question which is making good predictions: if the data does not fit the model, this approach has nothing to offer (it does not make sense to apply a statistical test if the assumptions are not valid).
The other problem is that as data become more complex, data models become more cumbersome. Then why bother? With complex models we lose the advantage of easy interpretability, not talking about the computational complexity of fitting such models.
The increased interest in Bayesian modeling with Markov Chain Monte Carlo is viewed as the response of the statistical community to this problem. True enough, this approach might be able to scale to complex data, but does this address the first issue? Are not there computationally cheaper alternatives that can achieve the same prediction power?
He characterizes the machine learning approach, as the pragmatic approach: You have to solve a prediction problem, hence take it seriously: Estimate the prediction error and choose the algorithm that gives a predictor with the better accuracy (but let's not forget about data snooping!).
But the paper offers more. Amongst other things it identifies three important recent lessons:

The multiplicity of good models: If you have many variables, there can be many models of similar prediction accuracy. Use them all by combining their predictions instead of just picking one. This should increase accuracy, reduce instability (sensitivity to perturbations of the data). Boosting, bagging, aggregation using exponential weights are relevant recent popular buzzwords.
The Occam dilemma: Occam's razor tells you to choose the simplest predictor. Aggregated predictors don't look particularly simple. But aggregation seems to be the right choice otherwise. I would think that Occam's razor tells you only that you should have a prior preference to simple functions. I think this is rather well understood by now.
Bellman: dimensionality -- curse or blessing: Many features are not bad per se. If your algorithm is prepared to deal with the high-dimensional inputs (SVMs, regularization, random forests are mentioned) then extracting many features can boost accuracy considerably.

In summary, I like the characterization of the difference between (classical) statistical approaches and machine learning. However, I wonder if these differences are still as significant as they were (must have been) in 2001 when the article was written and if the differences will become smaller over time. Then it will be really difficult to answer the question on the difference between the statistical and the machine learning approaches.

Bayesian Statistics in Medical Clinical Trials

2008-03-21T19:48:00.004-07:00

I came across a very interesting document.
The document is titled "Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials". It is a draft guidelines poster by the Center for Devices and Radiological Health of FDA, dated May 23, 2006.
Why is this interesting? The job of FDA (the US Food and Drug Administration) is to make sure that the decisions in any clinical trial are made in a scientifically sound manner. Clearly, when following the Bayesian approach the choice of the prior and the model can influence the decisions. What does FDA do in this situation?
The establish a process where they require a pre-specification (and agreement on) both the prior and the model, including an analysis of the operating characteristics of the design. This latter includes estimating the probability of erroneously approving an ineffective or unsafe device (the Type I error). This will typically be done by conducting Monte-Carlo simulations, where the Type I error is measured for the borderline cases when the device should not be approved. In the case of a large estimated Type I error, the trial will be rejected.
Is this a good procedure? If the simulations use a biased model then the estimated Type I error might be biased. Their response is that both the prior and the model should be backed up with scientific arguments and existing statistics. Yet another problem is that the calculations often use MCMC. How do you determine if your samples converged to the posterior? The samples of the posterior are not iid. How do you know that you took enough samples of the posterior? (Think of a mixture of Gaussian, with a narrow Gaussian proposal. If you sample from the mixture and then sample just a few points with Metropolis-Hastings, you will likely miss the second mode if the two modes are sufficiently far away.)
On the other hand, there are a number of potential advantages to a Bayesian design. If we accept that the model and the prior is good, then often the Bayesian analysis will require smaller sample sizes to reach a decision (if they are not, the conclusion might be wrong). It can also provide flexible methods for handling interim analyses (stopping when enough evidence is available for either approval or rejection) and sometimes good priors are available such as earlier studies on previous generations of a device or from overseas studies. Such approaches can be used with a fequentist approach, too, but the frequentist analysis of deriving a procedure is often non-trivial, while the Bayesian "only" needs to be concerned about computational issues.
The document cites two trials that used Bayesian analysis. It appears that in both studies Bayesian analysis was used only as a supplementary information, i.e., the critical decisions (if a device is safe and minimally effective) were made using traditional, frequentist methods.
Common to both the frequentist and the Bayesian approaches is the use of a number of unverified assumptions. In the frequentist case, if the design is simple then the typical assumption is only that there is a common underlying distribution to the outcome-patient pairs and that patients are selected uniformly at random from the population. This looks fairly minimal, but can be questioned nevertheless (drifts, environmental effects, sample biases, etc.). In a more complicated scenario there will be more assumptions. If the set of assumptions for the methods satisfy some containment relation then one naturally trusts the method that relies on less information. In the lack of containment the decision of which method to prefer is not so simple. In any case, it is very interesting to see how a regularity body (like FDA) wrestles with these fundamental issues. They look to act in a pretty reasonable manner. The existence of this document predicts that we should expect to see more decisions that used Bayesian analysis in the future. Is this good or bad? One could be concerned by the use of more unverified assumptions in the Bayesian analysis and that the probability of making an error can also increase because the calculations are non-trivial. Life is dangerous, is not it? But how dangerous will it be if Bayesian analysis is used routinely in assessing success in clinical trials? Time will tell for sure. Well, assuming some form of stationarity.

Curse of dimensionality

2008-03-15T20:41:00.005-07:00

I came across two papers today that discuss the curse of dimensionality. I thought this is just enough to write a short blog about the topic that definitely deserves attention. So, here we go:

The first paper is by Flip Korn, Bernd-Uwe Pagel and Christos Faloutsos, the title is On the "Dimensionality Curse" and the "Self-Similarity Blessing". This is a 2001 paper that won The paper is about nearest neighbor retrieval: You have $n$ datapoints that you can store and the task is to find the nearest neighbor among these datapoints of a query point. If the data lies in a $D$-dimensional space Euclidean space, it was a common wisdom to believe that the time required to find the nearest neighbor scales exponentially with $D$. The essence of the paper is that if the data lies on a low dimensional manifold then the complexity of search will depend only on the intrinsic dimensionality of the manifold. (The cover tree package due to Alina Beygelzimer, Sham Kakade, and John Langford should be mentioned here. This software is able to deal with data that does not necessarily lies in a vector space. The algorithms comes with guarantees similar to the above mentioned one. For details and further references see the accompanying paper. I guess there were quite a few precursors to this paper since around 1998, where a lot of excitement was generated when people started to realize this phenomenon.) This paper has some relation to our recent work that I will write about at the end of this blog.

So the good news here is that although with a little bit of a luck, a clever algorithm might run significantly faster than what is predicted by the worst case bounds. Note that a not so clever algorithm will not run faster. So you need luck for your data to display some regularities, but you need a clever design to take advantage of those regularities.

The other paper is by David Donoho from 2000, High-dimensional data analysis: The curses and blessings of dimensionality, an Aide-Memoire lecture delivered at the conference "Math Challenges of the 21st Century". Donoho mentions three curses (in optimization, Bellman's original use; in function approximation and in numerical integration). When talking about the blessings, he first talks about the concentration of measure (CoM) phenomenon, which roughly states that any Lipschitz function defined over the $D$-dimensional sphere is "nearly constant": If we place a uniform measure on the sphere then the the probability that it deviates from its expectation by more than $t$ is bounded by $C_1 exp(- C_2 t)$, where $C_i$ are universal constants (i.e., they neither depend on $D$, nor $f$)! This is important since often we are interested in an expected value of a Lipschitz function and we hope to estimate the expected value by the observed (random) value of the function. This phenomenon is important in model selection, as explained to some extent in the paper. The second blessing, related to the CoM phenomenon is "dimension asymptotics". The observation is that when the CoM applies, often as $D$ goes to infinity, the distributions converge to some limiting distribution. Sometimes then it becomes possible to obtain predictions that work for moderate dimensions but which are derived by using the limiting distributions (the example mentioned is the prediction of the top eigenvalue of a data covariance matrix). The third blessing is when the data is a sampled version of a continuous phenomenon (he calls this "approach to continuum"). Since what is measured is continuous, the space of observed data will show signs of compactness that can be exploited. Here the example is that a basis derived from some sampled data with some procedure tends to resemble wavelets, an object from the continuous world. For larger $D$ the resemblance becomes stronger and so the interpretation becomes easier. The is the bless of dimensionality in action.

Back in 2001 he mentioned 3 areas of potential interest: The first is high-dimensional approximate linear algebra (use randomization, the concentration of measure phenomenon to speed up calculations). Here he mentions rank $k$ approximation to data matrices. The second area is when the data are curves or images and the problem is to select a good basis. He mentions an interesting example here: We have a single data point and $D$ is very large. Then if the data is the realization of a stationary Gaussian stochastic process (i.e., coordinates are correlated) then we can in fact learn the full probability distribution. The third example is coming from approximation theory. Here, he cites a well-known result of Barron, that imposes a constraint on the derivative of the function: The derivative must be such that its Fourier transform is integrable. Since assuming $s$ times differentiabily typically yields that the function can be approximated only at a rate of $n^{-s/D}$, we expect to see a slow approximation rate of $n^{-1/D}$. However, Barron has shown that this type of functions can be approximated at the rate $n^{-1/2}$. What this and similar results point to that there are non-classical spaces of functions where things might work differently from what one expects. Again, the need arises for algorithms that are clever enough to take advantage if the situation is such that the curse of dimensionality can be avoided.

On another note, recently with Amir massoud Farahmand and Jean-Yves Audibert we published a paper related to these topics at ICML. The subject is dimensionality estimation of manifolds based on random samples supported by the manifold. Here again, the good news is that the embedding dimension does not play a direct role in the quality of the estimates (only through the properties of the embedding). One motivation for coming up with this method was that we wanted to modify existing regression procedures so that when the data lies on a low dimensional manifold, the methods should converge faster. We figured that many algorithms tune their parameters to the dimensionality of the data points and if we replace that dimensionality by the manifold dimensionality, we get a "manifold adaptive" method, which is clever enough to converge faster when the input lies on a manifold. Interestingly, these results can be obtained without ever estimating the manifold itself. However, this should be the subject of another post!

On a last note, quite a few references to works dealing with the curse of dimensionality can be found in the annotated bibliography of Damien Francois. Enjoy!

Discriminative vs. generative learning: which one is more efficient?

2007-08-12T06:12:00.000-07:00

I just came across a paper by Philip M. Long, Rocco Servedio and Hans Ulrich Simon. (Here is a link to the paper titled "Discriminative Learning can Succeed where Generative Learning Fails".) The question investigated in the paper is the following:
We are in a classification setting and the learning problem is defined by a pair of jointly distributed random variables, (X,Y), where Y can take on the values +1 and -1. Question: How many iid copies of this pair does an algorithm need to (i) find a classifier that yields close to optimal performance with high probability (ii) find two score functions, one trained with the positive examples only, the other with the negative examples only such that the sign of the difference of the two score functions gives a classifier that is almost optimal with high probability?
The result in the paper is that there exists a class of distributions, parameterized by d (determining the dimension of samples) such that there is a discriminative algorithm (tuned to this class) that can learn the correct classifier with only $2log(2/\delta)$ samples, while the number of samples required for any generative classifier is at least d.
Since it is clear that the requirements of generative learning are stronger than those of discriminative learning, it follows that in the above framework discriminative learning is strictly "easier" than generative learning.
The distribution concentrates on O(d) samples and the main idea is that the joint knowledge of positive and negative samples suffices for the easy identification of the target distribution (hence, classifier), while knowing only either the positive or negative examples alone is insufficient. Two special inputs, both marked for easy of recognition, determine the full distribution jointly but one of the inputs is in the positive set, the other is in the negative set and the knowledge of only one of them is insufficient to learning the otherwise "difficult" to learn distribution.
Although the construction given in the paper is simple and works as intended, it is arguably "artificial and contrived", as it was also noted by the authors. In particular, does a similar result hold when we consider a continuous domain of a fixed dimension, and/or we restrict the class of algorithms to consistent ones? Further, the example shows more the limitation of algorithms that learn from positive and negative samples independently of each other than the limitation of generative algorithms (generative algorithms traditionally refer to learners that estimate the joint distribution of the inputs and outputs).
The question of whether generative or discriminative algorithms are more efficient are particularly interesting in light of an earlier paper by Andrew Y. Ng and Michael Jordan ("On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes", NIPS-14, 2002). In this paper the authors compare one particular discriminative algorithm with another particular algorithm that is "generative". The two algorithms are logistic regression and naive Bayes and together they form what is called a "Generative-Discriminative" pair. The meaning of this is that while naive Bayes maximizes the total joint log-likelihood, $\sum_{i=1}^n \log p_\theta(x_i,y_i)$ over the samples, logistic regression maximizes the total conditional likelihood, $\sum_{i=1}^n \log p_\theta(y_i|x_i)$ over the same parametric model. In the case of these two particular algorithms the parametric model is written in terms of $p(x|y)$ and asserts independence of the individual components of the feature-vector $x$. (For continuous spaces the individual feature distributions, $p(x_k|y)$, are modeled as Gaussians with unknown mean and variance.) The agnostic setting is considered (the input distribution is not restricted). Since both learners pick a classifier from the very same class of classifiers and the discriminative learner in the limit of an infinite number of samples converges to the best classifier from this class, it follows that the ultimate loss suffered by the discriminative learner is never higher than that suffered by the generative learner. Hence, it seems that if the naive Bayes assumption made by the generative method is not met, the discriminative method can have an edge -- at least ultimately (open issue: give an example that shows positive separation!). However, this is just half of the story: the generative model may converge faster! In particular, the authors state an upper bound on the convergence of loss for the generative model that scales with $\sqrt{1/n \log(d)}$ ($d$ is the number of components of $x$), while as follows from standard uniform convergence results, the same convergence rate for the discriminative method is $\sqrt{d/n \log(d/n)}$. They argue that this result follows since the hypothesis class has a VC-dimension of $d$. Note the difference in the way the two bounds scale with $d$, the dimension of the input space: In the case of the discriminative algorithm the scaling is (log-)linear with $d$, while in the case of the generative algorithm it is logarithmic in $d$. (Strictly speaking, the upper bound would call for a proof since logistic regression is not a risk-minimization algorithm for the 0/1 loss and the cited theory has been worked out for risk-minimization algorithms.) Since comparing upper bounds is not really appropriate, they point to existing lower bounds that show that the worst-case sample complexity is lower bounded by $O(d)$. (My favorite paper on this is by Antos and Lugosi: "Strong minimax lower bounds for learning": link.) Hence, the conclusion of Ng and Jordan is that, contrary to the widely held belief, generative learning can sometimes be more efficient than discriminative learning; at least when the number of features is large compared to the number of samples and the ultimate loss of the generative learner is not much higher than that of the discriminative learner. They also performed experiments on UCI datasets to validate this claim. The experiments show that the performance of naive Bayes is indeed often better when the number of examples is smaller. Unfortunately, the figures show the loss as a function of the number of samples only and hence validate the theory only half-way since for a full validation we should know how the dimension of the dataset influences the performance.
A similar conclusion to that of Ng and Jordan was shown to hold in an earlier KDD-97 paper titled "Discriminative vs. Informative Learning" by Y. Dan Rubinstein and Trevor Hastie (link) who did a case study with simulated and real examples.
The convergence rate comparison looks quite intriguing at the first sight. However, some more thinking reveals that the story is far from being finished. To see this consider the "rate" function $r(n,d) = L_{gen}$ if $n\le D-1$, $r(n,d) =\min(L_{gen},\sqrt{d/n})$ if $n\ge D$. Imagine that $r(n,d)$ is an upper bound on the loss of some learner, where $L_{gen}$ is the loss of the generative learner. The rate of convergence then is $\sqrt{d/n}$, not contradicting with the lower bounds, but clearly, this discriminative learner will be competitive with the generative learner.
So is discriminative learning more (sample) efficient than generative learning? It seems that sometimes having both the positive and negative samples at hand can be useful. However, generative learning might be advantageous when the model considered fits well the data. Hence, the case is not yet settled. Many interesting open questions!

Learning Symbolic Models

2007-07-30T04:20:00.001-07:00

It is quite a common sense that successful generalization is the key to efficient learning in difficult environment. It appears to me that this must be especially true for reinforcement learning.
One potentially very powerful idea to achieve successful generalization is to learn symbolic models. Why? It is because a symbolic model (almost by definition) allows for very powerful generalizations (e.g. actions with parameters, state representation of environments with a variable number of objects with different object types, etc.).
JAIR just published the paper on this topic by H. M. Pasula, L. S. Zettlemoyer and L. P. Kaelbling, with the title "Learning Symbolic Models of Stochastic Domains". A brief glance reveals that the authors propose a greedy learning method, assuming a particular representation. The learning problem itself was shown earlier to be NP-hard, hence this sounds like a valid approach.
However, one thing is badly missing from this approach: learning the representation itself. It appears that the authors' assumption is that the state representation of the environment is given in an appropriate symbolic form. This is in my opinion a very strong assumption -- at least when it comes to deal with certain real-world problems when only noisy sensory information of the environment is available (think of robotics).
However, I must realize that I am disappointed only because I wrongly assumed (given the title) that the paper will be about the problem of learning representations. Will we ever be able to learn (symbolic) representation in an efficient manner? What are the conditions that allow efficient learning? Do we need the flexibility of symbolic representations to scale up reinforcement learning at all? Here, at UofA several people are interested in these questions (well, who is not??). More work needs to be done, but hopefully, some day you will here more about our efforts..

Minimize!

2007-05-21T14:26:00.000-07:00

A matlab code for minimizing a multivariate function whose partial derivatives are available by Carl Rasmussen is downloadable from here. The routine looks pretty efficient, at least on the classical Rosenbrock function.

A notion of function compression

2007-05-20T10:07:00.000-07:00

The following compressibility concept is introduced by Harnik and Naor in their recent paper: Given a function $f$ over some domain, a compression algorithm for $f$ should efficiently compress an input $x$ in a way that will preserve the information needed to compute $f(x)$. Note that if $f$ is available (and efficiently computable) then compression is trivial as then $y=f(x)$ will serve the purpose of the compact representation of $x$. Actually, the concept was originally studied in the framework of NP decision problems where unless $P=NP$ $f$ is not efficiently computable, hence the trivial solution is not available.

I am wondering if this compressibility notion could be used in learning theory or function approximation? Consider e.g. classification problems so that the output of $f$ is $\{0,1\}$. In order to prevent the trivial solution we may require that $x$ be compressed to some $y$ such that for some fixed (efficiently computable) function $g$, $f(g(y))=f(x)$. We do not require that $g(y)=x$, though if $g$ satisfies this then we certainly get a solution: if a compact representation of $\{ x| f(x)=1\}$ is available then $f$ can be well-compressed. We might allow of course for some error and study approximation rates. Of course, the notation applies to real-valued or more general functions, too. What is the class of functions that can be efficiently compressed? How is this class related to classical smoothness spaces (in the case of real-valued functions)? One obvious thought is that the smoother the target function is, the better the obvious discretization approach would be. It would be interesting to find out whether this new approch could lead to new regularity descriptions by varying the restrictions on the compression.

Another related thought is to compress the training samples in a learning problem such that the solution is kept (roughly) the same.

LaTeX support

2007-04-17T00:38:00.000-07:00

I added the following two lines to the header section of the template of this page:

  <script type="text/javascript"
src="http://www.maths.nottingham.ac.uk/personal/drw/LaTeXMathML.js">
</script>

Suddenly, I can type (almost) anything in latex, e.g. \$a_n\$ becomes $a_n$. Fancy!
(If you do not see anything fancy then either Javascript is disabled in your browser or you are using Internet Explorer without MathML support. In the latter case you may want to download MathPlayer by DesignScience.)

Many thanks for Peter Jipsen the folks who developed ASCIIMathML, which serves as the basis of LaTeXMathML by Douglas R. Woodall. Examples showing what is possible with LatexMathML can be found here. This is an indispensable tool!

The nice thing is that MathML is scaleable:

$E=m c^2$

The Fastest Mixing Markov Chain on a Graph

2007-04-16T18:05:00.000-07:00

The paper can be found here. The authors are Stephen Boyd, Persi Diaconis and Lin Xiao. I have found the paper while looking at the papers by Perso Diaconis, a notable mathematician and magician.

The paper talks about exactly what the title suggests: You are given a finite graph and you can set up a random walk on this graph by determining the transition probabilities between vertices that are connected by an edge. The walk must be symmetric so that the uniform distribution is a stationary distribution of this walk. Assuming that the associated Markov chain is irreducible and symmetric, the state distribution will converge to the uniform. The task is to maximize the rate of convergence of this. The solution is the Fastest Mixing Markov Chain on the graph (FMMC).

The authors show that this is a convex optimization problem and give a polynomial algorithm (based on semidefinite programming) to find the solution. A subgradient method is given that can be more effective for larger graphs. The solution is generalized to the non-symmetric case by considering reversible Markov chains.

Why I find this paper interesting? Well, it is always nice to find out that a problem is convex hence can be solved in an "efficient" manner. But the main reason is that if an agent wanted to find out the most about the states of its environment, it would need to find exactly the FMMC (assuming a finite state environment). Questions: What if the transitions probabilities cannot be changed in an arbitrary manner. The case I have in mind is when you have an Markovian Decision Problem and you want to find the fastest mixing policy, where the goal is to converge say e.g. to a distribution which is close to uniform (the uniform distribution might not be a solution). Is there a well-defined solution to this problem? Do the ideas generalize to this problem? To what extent? Can such a solution be found while interacting with the environment?

The Loss Rank Principle by Marcus Hutter

2007-04-15T19:43:00.000-07:00

I found the paper posted by Marcus Hutter on arxiv quite interesting. The paper is about model (or rather predictor) selection. The idea is a familiar one, but the details appear to be novel: You want to find a model which yields small loss on the dataset available, while yielding a larger loss on most other datasets.

Classification: The simplest case is when we consider supervised learning and the target set is finite. Then you can count the number of target label variations such that the predictor's loss is smaller than its loss when the true targets are used. This idea sounds very similar to the way Rademacher complexity works, see e.g. the paper of Lugosi and Wegkamp, where a localized version of Rademacher complexity is investigated.

Regression: For continuous targets you can use a grid with an increasing resolution (assume that the range of targets is bounded) and count the number of gridpoints such that the predictor's loss is less than its loss on the true dataset.
With an appropriate normalization this converges to the volume of such target values (hopefully this set is measurable:)).

The paper does not go very far: Some examples are given that demonstrate that the criterion gives a computable procedure and that this procedure is reasonable. A quick comparison to alternatives is given. It will be interesting to see the further developments!

Why this Blog??

2007-04-14T19:48:00.000-07:00

I am struggling with organizing the notes I make after lectures or after reading a paper. Hence I will experiment with this fancy way of keeping track of my thoughts. Of course, I will be happy to receive feedback from the occasional readers.

We will see how well it goes!