Skip to main content

Numerical Errors, Perturbation Analysis and Machine Learning

Everyone hates numerical errors. We love to think that computers are machines with infinite precision. When I was a student, I really hated error analysis. It sounded like a subject that is set out to study an annoying side-effect of our imperfect computers, a boring detail that is miles away from anything that anyone would ever consider a nice part of mathematics. I will not try to convince you today that the opposite is true. However, even in error analysis there are some nice ideas and lessons to be learned. This post asks the question whether, if you are doing machine learning, you should care about numerical errors.

This issue should be well understood. However, I don't think that it is as well appreciated as it should be, or that it received the attention it should. In fact, I doubt that the issue is discussed in any of the recent machine learning textbooks beyond the usual caveat "beware the numerical errors" (scary!).

In this blog, I will illustrate the question of numerical errors and stability with the problem of learning a real-valued function on the $[0,1]$ interval. Let $f$ be this unknown function. We will assume that $f$ is square integrable as our objective will be to produce an approximation to $f$ that is close to $f$ in the (unweighted) $2$-norm. For simplicity, let as assume that we have decided to approximate $f$ using polynomials of degree $d-1$. Thus, defining $G = \{ g \,:\, g(x) = \sum_{j=0}^{d-1} \alpha_j x^j, \alpha_0,\ldots,\alpha_{d-1}\in \mathbb{R} \}$, our goal is to find the minimizer

$P(f) = \arg\min_{g\in G} \|f-g\|_2$,                                 (Proj)

where $\|f\|_2^2 = \int_0^1 f^2(x) dx$. Clearly, $G$ is a $d$-dimensional subspace of $L^2([0,1])$. To represent the functions in $G$, let us choose a basis of $G$; call the $j$th basis element $p_j$ ($j=1,\ldots,d$). Thus, $p_j:[0,1] \rightarrow \mathbb{R}$ is some polynomial. One possibility, in the lack of better idea, is to choose $p_j(x) = x^j$; if you want, I use $(p_j)_{j=1}^d$ just for the sake of increasing generality (and to play with you a bit later).

Since $(p_j)_j$ is a basis, any function $g$ of $G$ can be uniquely written as $g = \sum_{j=1}^d \theta_j p_j$. Since $g^* = P(f)$ is also an element of $G$, it can also be written in this form: $g^* = \sum_{j=1}^d \theta_j^* p_j$. To figure out $g^*$, it is thus sufficient to figure out the vector $\theta^* = (\theta_j^*)_j$. To find this vector, we can find the minimum of

$J(\theta) = \| f - \sum_{j=1}^d \theta_j p_j \|_2^2$ 

(since $u \mapsto u^2$ is monotone on $[0,\infty)$). For this, it is useful to remember that for $f\in L^2([0,1])$, $\|f\|_2^2 =\langle f,f \rangle$, where the inner produce $\langle \cdot,\cdot\rangle$ is defined for all $f,g\in L^2([0,1])$ using $\langle f,g \rangle = \int_0^1 f(x)g(x) dx$. Taking the derivative of $J(\theta)$ with respect to $\theta_i$ and equating the result with zero, we get,

$0=\frac{d}{d\theta_i} J(\theta) = 2 \langle \sum_{j=1}^d \theta_j p_j-f, p_i \rangle$.

Using the symmetry and linearity of the inner product $\langle \cdot,\cdot\rangle$,  introducing the matrix $A$ whose $(i,j)$th element is $a_{ij} = \langle p_i, p_j \rangle$ and the vector $b$ whose $i$th element is $\langle f,p_i \rangle$, collecting the above equations for $i=1,\ldots,d$ and reordering the terms, we see that these equations can be written in the short form

$A\theta = b$.

Note that here $A$ is a positive definite matrix. Solving this linear system thus will give us $\theta^*$ and hence $g^*$.

So far so good. However, in practice, we rarely have access to the true function $f$. One commonly studied set-up in machine learning assumes that we have a noise "sample" of the form $((X_t,Y_t);t=1,\ldots,n)$ only, where $(X_t,Y_t)$ is i.i.d. and $\mathbb{E}[Y_t|X_t] = f(X_t)$ and $\mathrm{Var}(Y_t|X_t)\le c<\infty$. However, I do not want to go this far, but consider a conceptually simpler case when instead of $f$, we can access $\hat{f}: [0,1] \rightarrow \mathbb{R}$. Think of $\hat{f}$ as the "noise corrupted" version of $f$. Later it should become clear how the reasoning that follows extends to the case when only a finite dataset of the previously mentioned form is available.

Realizing that we do not have access to $f$, but only to $\hat{f}=f+\Delta f$, it is clear that we won't be able to calculate $g^* = P(f)$, but only (say) $P(\hat{f})$. Let's call the result $\hat{g}^*$: $\hat{g}^* = P(\hat{f})$. How far is $\hat{g}^*$ from $f$?

By the Pythagorean theorem, $\|f-\hat{g}^*\|_2^2 = \| f- g^* \|_2^2 + \|g^* - \hat{g}^* \|_2^2$.
Here, $ \| f- g^* \|_2^2$ is the error that can only be reduced by increasing the degree $d$ (or increasing $G$), while  $\|g^* - \hat{g}^* \|_2^2$ is the error due to starting with the noisy function $\hat{f}$ and not with $f$. The problem of studying the size of $\|f- g^* \|_2^2$ is the subject of approximation theory (where various nice results show how, e.g., the degree of smoothness and $d$ jointly influence the rate at which this error decreases to zero as $d$ goes to infinite) and this error is called the approximation error. We won't get into this nice subject, perhaps in a future post.

So it remains to calculate how big the other term is (which is normally called the estimation error, as it measures the effect of the stochastic error $\Delta f$). One way to understand this term is by using the error analysis (a.k.a. stability/perturbation analysis) of linear systems. Perhaps this is not the simplest way, but by pursuing this direction, the article will be more entertaining perhaps and hopefully we will also make some useful observations. Readers suspecting mischief are asked to stay calm..

Let us start this error analysis by observing that if we write $\hat{g}^* =\sum_{j=1}\hat{\theta}_j^* p_j$ then the vector $\hat{\theta}^*$ will be the solution to the linear system $A \theta = b+\Delta b$, where $\hat{\Delta b}_j = \langle\Delta f,p_j\rangle$, $j=1,\ldots,d$. Now, $\| \sum_{j=1}^d \theta_j p_j \|_2^2 = \theta^\top A \theta$, hence

$\|g^* - \hat{g}^* \|_2^2 = ({\theta}^*-\hat{\theta}^*)^\top A ({\theta}^*-\hat{\theta}^*)$.

Since $A \hat{\theta}^* = b+\Delta b$ and $A {\theta}^* = {b}$, subtracting the two, we get $A ( \hat{\theta}^*- {\theta}^*)  = \Delta b$, from where we get that

$({\theta}^*-\hat{\theta}^*)^\top A ({\theta}^*-\hat{\theta}^*) = (\Delta b)^\top A^{-1} \Delta b$.


 $\|g^* - \hat{g}^* \|_2^2= (\Delta b)^\top A^{-1} \Delta b\le \lambda_{\min}^{-1}(A) \|\Delta b \|_2^2$,                            (*)

where $\lambda_{\min}(A)$ denotes the minimum eigenvalue of $A$. Note that by appropriately choosing the error term $\Delta b$, the inequality can be made sharp. What inequality (*) thus means is that the error caused by $\hat{f}-f$ can be badly enlarged if $\Delta b$ has an unfortunate relation to $A$.

Readers familiar with the error-analysis of linear systems of equation will recognize that (*) is similar but still different to the inequality that shows how errors propagate in the solution of linear systems due to coefficient errors (including errors in $b$). The difference is easy to understand: It is caused by the fact that in numerical analysis the error of the parameter-vector is measured in the (unweighted) 2-norm, whereas here the error is shown for the norm defined using $\|x\|_{A}^2 = x^\top A x$; in this application this error is more natural than the usual $2$-norm. Thus, this is one of the lessons: Use the norm that is the most appropriate for the application (because of this change in the norms, instead of the conditioning number of $A$, we see the inverse of the minimum eigenvalue of $A$).

To see how bad (*) can be, consider the case when $p_j(x) = x^{j-1}$, i.e., the standard polynomial basis. In this case, $a_{ij} = \int_0^1 x^{i+j-2} dx = [x^{i+j-1}/(i+j-1)]_0^1 = 1/(i+j-1)$. The resulting matrix is an old friend, the so-called Hilbert matrix. In fact, this matrix is most infamous for its horrible conditioning, but most of the blame should go to how small the minimum eigenvalue of $A$ is. In fact, a result of Szegő (1975) shows that as $d\rightarrow \infty$, $\lambda_{\min}(A) \sim 2^{15/4} \pi^{3/2} d^{1/2} (\sqrt{2}-1)^{4d+4}$ (an amusing asymptotic relationship on its own). This the reciprocal value of the minimum eigenvalue of $A$ is exponentially large in $d$, thus, potentially totally defeating the purpose of increasing $d$ to reduce the approximation error.

Thus, perhaps we should choose a different system of polynomials. Indeed, once we know that the goal is to keep $\lambda_{\min}(A)$ large, we can choose the polynomials so as to make this happen. One way, which also helps considerably with speeding up the calculations, is to choose $(p_j)$ such that $A$ is diagonal and if diagonal, why not choose $(p_j)$ such that $A$ becomes the identity matrix. We can simply start by $p_1 = 1$ (because $\int_0^1 1^2 dx=1$) and then choose the coefficients of $p_2(x) = a_{20} + a_{21} x$ such that $\langle p_2,p_2 \rangle=1$ and $\langle p_1,p_2 \rangle = 0$ (this gives $[a_{20} x+ 2 a_{20}a_{21} x^2/2+a_{21}^2 x^3/3]_0^1=1$ and $[a_{20}x + a_{21} x^2/2]_0^1=0$, which can be solved for the coefficients $a_{20}$, $a_{21}$). Continuing the same way, we can find $p_1,\ldots,p_d$, which are known as the shifted Legendre polynomials (the unshifted versions are defined with the same way for the $[-1,1]$ interval).

At this stage, however, the astute reader should really wonder whether I have lost my sense of reality! How on the earth should the choice of the basis of $G$ influence the error caused by the "noisy measurement"??? In fact, a very simple reasoning shows that the basis (and thus $A$) should not change how big $\|\hat{g}^*-g^* \|_2^2$ is. To see why note that $\hat{g}^*-g^* =   P(\hat{f})-P(f)$. Up to know, I have carefully avoided giving a name to $P$, but in fact, $P$ does not have to be introduced: It is the well-known orthogonal projection operator. As such, it is a linear operator. Hence, $ P(\hat{f})-P(f) = P(  \hat{f}-f ) = P(\Delta f)$. It is also known, that projections cannot make vectors larger (they are non-expansions in the 2-norm that defines them). Hence, $\|P(\Delta f)\|_2 \le \|\Delta f\|_2$, which in summary means that

$\|\hat{g}^*-g^*\|_2 \le \|\Delta f \|_2$.

The "measurement" error of $f$ cannot be enlarged by projecting the measured function to the function space $G$. How do we reconcile this with (*)? One explanation is that (*) is too conservative; it is a gross over-estimate. However, we also said that the only inequality that was used could be sharp if $\Delta b$ was selected in an "unfortunate" fashion. So that inequality could be tight. Is this a contradiction? [Readers are of course encouraged to stop here and think this through before continuing with reading the rest.]

No, there is no contradiction. All the derivations were sounds. The answer to the puzzle is that $\Delta b$ cannot be selected arbitrarily. By writing $\Delta f = \Delta f^{\parallel}+ \Delta f^{\perp}$, where $\Delta f^{\parallel}\in G$ and $\Delta f^{\perp}$ is perpendicular to $G$, we see that $\Delta b_j = \langle \Delta f^{\parallel}, p_j\rangle$. Since $\Delta f^{\parallel}\in G$, it can be written as $\sum_j \gamma_j p_j$ with some coefficients $(\gamma_j)$. Hence, $\Delta b_j = \langle \Delta f, p_j \rangle  = \sum_k \gamma_k \langle p_k,p_j\rangle$, or $\Delta b = A \gamma$! Thus $\Delta b$ is restricted to lie in the range of $A$. The rest is easy:

$\|g^* - \hat{g}^* \|_2^2= (\Delta b)^\top A^{-1} \Delta b = \gamma^\top A \gamma = \|P(\Delta f) \|_2^2$.

The conclusion? Was this a pointless exercise? Is the stability analysis of linear systems irrelevant to machine learning? No, not at all! What we can conclude is that although the basis chosen does not influence the error in the final approximator that can be attributed to errors in measuring the unknown function, this conclusion holds only if we assume infinite precision arithmetic. With finite precision arithmetic, our analysis shows that rounding errors can be blown up exponentially with the dimension of the subspace $G$ growing to infinity. Thus, with an improperly chosen basis, the rounding errors can totally offset or even reverse the benefit that is expected from increasing the dimensionality of the approximation space $G$.

 If you cannot calculate it with a computer, you cannot estimate it either.

 (The careful reader may wonder about whether the rounding errors $\tilde{\Delta b}$ can lie in the dreaded subspace of $A$, but in this case they can.) Hence, when consider a finite dimensional approximation, care must be taken to choose the basis functions so that the resulting numerical problem is stable.

Finally, one commenter on the blog once noted that I should post more about reinforcement learning. So, let me finish by remarking that the same issues exist in reinforcement learning, as well. A basic problem in reinforcement learning is to estimate the value function underlying a stationary Markov policy. Long story short, such a policy gives rise to a Markov chain, with a transition probability kernel $\mathcal{P}$. We can view $\mathcal{P}$ as an operator that maps bounded measurable $f:X \rightarrow \mathbb{R}$ functions to akin functions: if $g=\mathcal{P} (f)$, $g(x) = \int_{y} f(y) \mathcal{P}(dy|x)$, where the integration is over the state space $X$. Evaluating the said policy then amounts to solving

$(I-\gamma\mathcal{P}) f = r$                           (PEval)

in $f$, where $0\le \gamma < 1$ is the so-called discount factor and $r:X \rightarrow \mathbb{R}$ is the so-called reward function (again, $r$ is assumed to be bounded, measurable). A common method to solve linear inverse problems like the above is Galerkin's method. Long story short, this amounts to selecting an orthogonal basis $(\psi_k)$ in $L^2(X;\mu)$ with $\mu$ being a probability measure and then solving the said system in $G= \{ f\,:\, f =\sum_{k=1}^d \alpha_k \psi_k, \alpha_1,\ldots,\alpha_d\in \mathbb{R} \}$ in the sense that the solution $g\in G$ is required to satisfy (PEval) in the 'weak sense': $\langle ((I-\gamma\mathcal{P})g,\psi_k\rangle = \langle r,\psi_k\rangle$, $k=1,\ldots,d$. As it turns out, there are numerically less and more stable ways of selecting the basis $(\psi_k)$, though figuring out how to do this in the practical settings when $I-\gamma\mathcal{P}$ is unknown but can be sampled remains for future work.

Szegő G. 1982 On some Hermitian forms associated with two given curves of the complex plane, Collected Papers, vol 2. (Basle: Birkhauser) p 666.


Unknown said…
This comment has been removed by a blog administrator.
PC Optimizer said…
This comment has been removed by a blog administrator.

Popular posts from this blog

Keynote vs. Powerpoint vs. Beamer

A few days ago I decided to give Keynote, Apple's presentation software, a try (part of iWork '09). Beforehand I used MS Powerpoint 2003, Impress from NeoOffice 3.0 (OpenOffice's native Mac version) and LaTeX with beamer. Here is a comparison of the ups and downs of these software, mainly to remind myself when I will reconsider my choice in half a year and also to help people decide what to use for their presentation. Comments, suggestions, critics are absolutely welcome, as usual. Btw, while preparing this note I have learned that has a native Mac Aqua version of OpenOffice. Maybe I will try it some day and update the post. It would also be good to include a recent version of Powerpoint in the comparison.
StabilityKeynote: Excellent
After a few days of usage, so take this statement with a grain of salt..MS Powerpoint 2003: ExcellentImpress: Poor
Save your work very oftenBeamer: ExcellentCreating visually appealing slides, graphics on slides
Keynote: Excellent

Approximating inverse covariance matrices

Phew, the last time I have posted an entry to my blog was a loong time ago.. Not that there was nothing interesting to blog about, just I always delayed things. (Btw, google changed the template which eliminated the rendering of the latex formulae, not happy.. Luckily, I could change back the template..) Now, as the actual contents:

I have just read the PAMI paper "Accuracy of Pseudo-Inverse Covariance Learning-A Random Matrix Theory Analysis" by D Hoyle (IEEE T. PAMI, 2011 vol. 33 (7) pp. 1470--1481).

The paper is about pseudo-inverse covariance matrices and their analysis based on random matrix theory analysis and I can say I enjoyed this paper quite a lot.

In short, the author's point is this:
Let \$d,n>0\$ be integers. Let $\hat{C}$ be the sample covariance matrix of some iid data $X_1,\ldots,X_n\in \mathbb{R}^d$ based on $n$ datapoints and let $C$ be the population covariance matrix (i.e., $\hat{C}=\mathbb{E}[X_1 X_1^\top]$). Assume that $d,n\rightarrow \infty$ …

Useful latex/svn tools (merge, clean, svn, diff)

This blog is about some tools that I have developed (and yet another one that I have downloaded) which help me to streamline my latex work cycle. I make the tools available, hoping that other people will find them useful. However, they are admittedly limited (more about this) and as usual for free stuff they come with zero guarantee. Use them at your own risk.

The first little tool is for creating a cleaned up file before submitting it to a publisher who asks for source files. I call it, it is developed in Perl. It can be downloaded from here.
The functionality is
(1) to remove latex comments
(2) to remove \todo{} commands
(3) to merge files included from a main file into the main file
(4) to merge the bbl file into the same main file

If you make the tool executable (chmod a+x, you can use it like this:

$ main.tex > cleaned.tex

How does this work?

The tool reads in the source tex file, processes it line by line and produces some output to the standard ou…