## Monday, 16 September 2013

### Numerical Errors, Perturbation Analysis and Machine Learning

Everyone hates numerical errors. We love to think that computers are machines with infinite precision. When I was a student, I really hated error analysis. It sounded like a subject that is set out to study an annoying side-effect of our imperfect computers, a boring detail that is miles away from anything that anyone would ever consider a nice part of mathematics. I will not try to convince you today that the opposite is true. However, even in error analysis there are some nice ideas and lessons to be learned. This post asks the question whether, if you are doing machine learning, you should care about numerical errors.

This issue should be well understood. However, I don't think that it is as well appreciated as it should be, or that it received the attention it should. In fact, I doubt that the issue is discussed in any of the recent machine learning textbooks beyond the usual caveat "beware the numerical errors" (scary!).

In this blog, I will illustrate the question of numerical errors and stability with the problem of learning a real-valued function on the $[0,1]$ interval. Let $f$ be this unknown function. We will assume that $f$ is square integrable as our objective will be to produce an approximation to $f$ that is close to $f$ in the (unweighted) $2$-norm. For simplicity, let as assume that we have decided to approximate $f$ using polynomials of degree $d-1$. Thus, defining $G = \{ g \,:\, g(x) = \sum_{j=0}^{d-1} \alpha_j x^j, \alpha_0,\ldots,\alpha_{d-1}\in \mathbb{R} \}$, our goal is to find the minimizer

$P(f) = \arg\min_{g\in G} \|f-g\|_2$,                                 (Proj)

where $\|f\|_2^2 = \int_0^1 f^2(x) dx$. Clearly, $G$ is a $d$-dimensional subspace of $L^2([0,1])$. To represent the functions in $G$, let us choose a basis of $G$; call the $j$th basis element $p_j$ ($j=1,\ldots,d$). Thus, $p_j:[0,1] \rightarrow \mathbb{R}$ is some polynomial. One possibility, in the lack of better idea, is to choose $p_j(x) = x^j$; if you want, I use $(p_j)_{j=1}^d$ just for the sake of increasing generality (and to play with you a bit later).

Since $(p_j)_j$ is a basis, any function $g$ of $G$ can be uniquely written as $g = \sum_{j=1}^d \theta_j p_j$. Since $g^* = P(f)$ is also an element of $G$, it can also be written in this form: $g^* = \sum_{j=1}^d \theta_j^* p_j$. To figure out $g^*$, it is thus sufficient to figure out the vector $\theta^* = (\theta_j^*)_j$. To find this vector, we can find the minimum of

$J(\theta) = \| f - \sum_{j=1}^d \theta_j p_j \|_2^2$

(since $u \mapsto u^2$ is monotone on $[0,\infty)$). For this, it is useful to remember that for $f\in L^2([0,1])$, $\|f\|_2^2 =\langle f,f \rangle$, where the inner produce $\langle \cdot,\cdot\rangle$ is defined for all $f,g\in L^2([0,1])$ using $\langle f,g \rangle = \int_0^1 f(x)g(x) dx$. Taking the derivative of $J(\theta)$ with respect to $\theta_i$ and equating the result with zero, we get,

$0=\frac{d}{d\theta_i} J(\theta) = 2 \langle \sum_{j=1}^d \theta_j p_j-f, p_i \rangle$.

Using the symmetry and linearity of the inner product $\langle \cdot,\cdot\rangle$,  introducing the matrix $A$ whose $(i,j)$th element is $a_{ij} = \langle p_i, p_j \rangle$ and the vector $b$ whose $i$th element is $\langle f,p_i \rangle$, collecting the above equations for $i=1,\ldots,d$ and reordering the terms, we see that these equations can be written in the short form

$A\theta = b$.

Note that here $A$ is a positive definite matrix. Solving this linear system thus will give us $\theta^*$ and hence $g^*$.

So far so good. However, in practice, we rarely have access to the true function $f$. One commonly studied set-up in machine learning assumes that we have a noise "sample" of the form $((X_t,Y_t);t=1,\ldots,n)$ only, where $(X_t,Y_t)$ is i.i.d. and $\mathbb{E}[Y_t|X_t] = f(X_t)$ and $\mathrm{Var}(Y_t|X_t)\le c<\infty$. However, I do not want to go this far, but consider a conceptually simpler case when instead of $f$, we can access $\hat{f}: [0,1] \rightarrow \mathbb{R}$. Think of $\hat{f}$ as the "noise corrupted" version of $f$. Later it should become clear how the reasoning that follows extends to the case when only a finite dataset of the previously mentioned form is available.

Realizing that we do not have access to $f$, but only to $\hat{f}=f+\Delta f$, it is clear that we won't be able to calculate $g^* = P(f)$, but only (say) $P(\hat{f})$. Let's call the result $\hat{g}^*$: $\hat{g}^* = P(\hat{f})$. How far is $\hat{g}^*$ from $f$?

By the Pythagorean theorem, $\|f-\hat{g}^*\|_2^2 = \| f- g^* \|_2^2 + \|g^* - \hat{g}^* \|_2^2$.
Here, $\| f- g^* \|_2^2$ is the error that can only be reduced by increasing the degree $d$ (or increasing $G$), while  $\|g^* - \hat{g}^* \|_2^2$ is the error due to starting with the noisy function $\hat{f}$ and not with $f$. The problem of studying the size of $\|f- g^* \|_2^2$ is the subject of approximation theory (where various nice results show how, e.g., the degree of smoothness and $d$ jointly influence the rate at which this error decreases to zero as $d$ goes to infinite) and this error is called the approximation error. We won't get into this nice subject, perhaps in a future post.

So it remains to calculate how big the other term is (which is normally called the estimation error, as it measures the effect of the stochastic error $\Delta f$). One way to understand this term is by using the error analysis (a.k.a. stability/perturbation analysis) of linear systems. Perhaps this is not the simplest way, but by pursuing this direction, the article will be more entertaining perhaps and hopefully we will also make some useful observations. Readers suspecting mischief are asked to stay calm..

Let us start this error analysis by observing that if we write $\hat{g}^* =\sum_{j=1}\hat{\theta}_j^* p_j$ then the vector $\hat{\theta}^*$ will be the solution to the linear system $A \theta = b+\Delta b$, where $\hat{\Delta b}_j = \langle\Delta f,p_j\rangle$, $j=1,\ldots,d$. Now, $\| \sum_{j=1}^d \theta_j p_j \|_2^2 = \theta^\top A \theta$, hence

$\|g^* - \hat{g}^* \|_2^2 = ({\theta}^*-\hat{\theta}^*)^\top A ({\theta}^*-\hat{\theta}^*)$.

Since $A \hat{\theta}^* = b+\Delta b$ and $A {\theta}^* = {b}$, subtracting the two, we get $A ( \hat{\theta}^*- {\theta}^*) = \Delta b$, from where we get that

$({\theta}^*-\hat{\theta}^*)^\top A ({\theta}^*-\hat{\theta}^*) = (\Delta b)^\top A^{-1} \Delta b$.

Hence,

$\|g^* - \hat{g}^* \|_2^2= (\Delta b)^\top A^{-1} \Delta b\le \lambda_{\min}^{-1}(A) \|\Delta b \|_2^2$,                            (*)

where $\lambda_{\min}(A)$ denotes the minimum eigenvalue of $A$. Note that by appropriately choosing the error term $\Delta b$, the inequality can be made sharp. What inequality (*) thus means is that the error caused by $\hat{f}-f$ can be badly enlarged if $\Delta b$ has an unfortunate relation to $A$.

Readers familiar with the error-analysis of linear systems of equation will recognize that (*) is similar but still different to the inequality that shows how errors propagate in the solution of linear systems due to coefficient errors (including errors in $b$). The difference is easy to understand: It is caused by the fact that in numerical analysis the error of the parameter-vector is measured in the (unweighted) 2-norm, whereas here the error is shown for the norm defined using $\|x\|_{A}^2 = x^\top A x$; in this application this error is more natural than the usual $2$-norm. Thus, this is one of the lessons: Use the norm that is the most appropriate for the application (because of this change in the norms, instead of the conditioning number of $A$, we see the inverse of the minimum eigenvalue of $A$).

To see how bad (*) can be, consider the case when $p_j(x) = x^{j-1}$, i.e., the standard polynomial basis. In this case, $a_{ij} = \int_0^1 x^{i+j-2} dx = [x^{i+j-1}/(i+j-1)]_0^1 = 1/(i+j-1)$. The resulting matrix is an old friend, the so-called Hilbert matrix. In fact, this matrix is most infamous for its horrible conditioning, but most of the blame should go to how small the minimum eigenvalue of $A$ is. In fact, a result of Szegő (1975) shows that as $d\rightarrow \infty$, $\lambda_{\min}(A) \sim 2^{15/4} \pi^{3/2} d^{1/2} (\sqrt{2}-1)^{4d+4}$ (an amusing asymptotic relationship on its own). This the reciprocal value of the minimum eigenvalue of $A$ is exponentially large in $d$, thus, potentially totally defeating the purpose of increasing $d$ to reduce the approximation error.

Thus, perhaps we should choose a different system of polynomials. Indeed, once we know that the goal is to keep $\lambda_{\min}(A)$ large, we can choose the polynomials so as to make this happen. One way, which also helps considerably with speeding up the calculations, is to choose $(p_j)$ such that $A$ is diagonal and if diagonal, why not choose $(p_j)$ such that $A$ becomes the identity matrix. We can simply start by $p_1 = 1$ (because $\int_0^1 1^2 dx=1$) and then choose the coefficients of $p_2(x) = a_{20} + a_{21} x$ such that $\langle p_2,p_2 \rangle=1$ and $\langle p_1,p_2 \rangle = 0$ (this gives $[a_{20} x+ 2 a_{20}a_{21} x^2/2+a_{21}^2 x^3/3]_0^1=1$ and $[a_{20}x + a_{21} x^2/2]_0^1=0$, which can be solved for the coefficients $a_{20}$, $a_{21}$). Continuing the same way, we can find $p_1,\ldots,p_d$, which are known as the shifted Legendre polynomials (the unshifted versions are defined with the same way for the $[-1,1]$ interval).

At this stage, however, the astute reader should really wonder whether I have lost my sense of reality! How on the earth should the choice of the basis of $G$ influence the error caused by the "noisy measurement"??? In fact, a very simple reasoning shows that the basis (and thus $A$) should not change how big $\|\hat{g}^*-g^* \|_2^2$ is. To see why note that $\hat{g}^*-g^* = P(\hat{f})-P(f)$. Up to know, I have carefully avoided giving a name to $P$, but in fact, $P$ does not have to be introduced: It is the well-known orthogonal projection operator. As such, it is a linear operator. Hence, $P(\hat{f})-P(f) = P( \hat{f}-f ) = P(\Delta f)$. It is also known, that projections cannot make vectors larger (they are non-expansions in the 2-norm that defines them). Hence, $\|P(\Delta f)\|_2 \le \|\Delta f\|_2$, which in summary means that

$\|\hat{g}^*-g^*\|_2 \le \|\Delta f \|_2$.

The "measurement" error of $f$ cannot be enlarged by projecting the measured function to the function space $G$. How do we reconcile this with (*)? One explanation is that (*) is too conservative; it is a gross over-estimate. However, we also said that the only inequality that was used could be sharp if $\Delta b$ was selected in an "unfortunate" fashion. So that inequality could be tight. Is this a contradiction? [Readers are of course encouraged to stop here and think this through before continuing with reading the rest.]

No, there is no contradiction. All the derivations were sounds. The answer to the puzzle is that $\Delta b$ cannot be selected arbitrarily. By writing $\Delta f = \Delta f^{\parallel}+ \Delta f^{\perp}$, where $\Delta f^{\parallel}\in G$ and $\Delta f^{\perp}$ is perpendicular to $G$, we see that $\Delta b_j = \langle \Delta f^{\parallel}, p_j\rangle$. Since $\Delta f^{\parallel}\in G$, it can be written as $\sum_j \gamma_j p_j$ with some coefficients $(\gamma_j)$. Hence, $\Delta b_j = \langle \Delta f, p_j \rangle = \sum_k \gamma_k \langle p_k,p_j\rangle$, or $\Delta b = A \gamma$! Thus $\Delta b$ is restricted to lie in the range of $A$. The rest is easy:

$\|g^* - \hat{g}^* \|_2^2= (\Delta b)^\top A^{-1} \Delta b = \gamma^\top A \gamma = \|P(\Delta f) \|_2^2$.

The conclusion? Was this a pointless exercise? Is the stability analysis of linear systems irrelevant to machine learning? No, not at all! What we can conclude is that although the basis chosen does not influence the error in the final approximator that can be attributed to errors in measuring the unknown function, this conclusion holds only if we assume infinite precision arithmetic. With finite precision arithmetic, our analysis shows that rounding errors can be blown up exponentially with the dimension of the subspace $G$ growing to infinity. Thus, with an improperly chosen basis, the rounding errors can totally offset or even reverse the benefit that is expected from increasing the dimensionality of the approximation space $G$.

If you cannot calculate it with a computer, you cannot estimate it either.

(The careful reader may wonder about whether the rounding errors $\tilde{\Delta b}$ can lie in the dreaded subspace of $A$, but in this case they can.) Hence, when consider a finite dimensional approximation, care must be taken to choose the basis functions so that the resulting numerical problem is stable.

Finally, one commenter on the blog once noted that I should post more about reinforcement learning. So, let me finish by remarking that the same issues exist in reinforcement learning, as well. A basic problem in reinforcement learning is to estimate the value function underlying a stationary Markov policy. Long story short, such a policy gives rise to a Markov chain, with a transition probability kernel $\mathcal{P}$. We can view $\mathcal{P}$ as an operator that maps bounded measurable $f:X \rightarrow \mathbb{R}$ functions to akin functions: if $g=\mathcal{P} (f)$, $g(x) = \int_{y} f(y) \mathcal{P}(dy|x)$, where the integration is over the state space $X$. Evaluating the said policy then amounts to solving

$(I-\gamma\mathcal{P}) f = r$                           (PEval)

in $f$, where $0\le \gamma < 1$ is the so-called discount factor and $r:X \rightarrow \mathbb{R}$ is the so-called reward function (again, $r$ is assumed to be bounded, measurable). A common method to solve linear inverse problems like the above is Galerkin's method. Long story short, this amounts to selecting an orthogonal basis $(\psi_k)$ in $L^2(X;\mu)$ with $\mu$ being a probability measure and then solving the said system in $G= \{ f\,:\, f =\sum_{k=1}^d \alpha_k \psi_k, \alpha_1,\ldots,\alpha_d\in \mathbb{R} \}$ in the sense that the solution $g\in G$ is required to satisfy (PEval) in the 'weak sense': $\langle ((I-\gamma\mathcal{P})g,\psi_k\rangle = \langle r,\psi_k\rangle$, $k=1,\ldots,d$. As it turns out, there are numerically less and more stable ways of selecting the basis $(\psi_k)$, though figuring out how to do this in the practical settings when $I-\gamma\mathcal{P}$ is unknown but can be sampled remains for future work.

Bibliography:
Szegő G. 1982 On some Hermitian forms associated with two given curves of the complex plane, Collected Papers, vol 2. (Basle: Birkhauser) p 666.

## Sunday, 15 September 2013

### Student Response Systems -- A Year Later

I have promised to come back with my experience after the Fall semester of 2013. Though the end of that semester passed a long ago, here are some thoughts:

Overall my experience of socrative was very positive. During the first week I polled the class to see how large a percentage of the class has some kind of wifi enabled device that they could use. 90 out of the 95 students had some kind of device that they were willing to bring to the class, so I decided to give socrative a try.

Socrative helped me tremendously to stay on the top of what everyone in the class knows. The way I used socrative was as follows: After every block of new material (usually 5-10 slides), I inserted a few questions to verify whether the students "got" the new material. This was all the "extra work" that I had to put in designing the class material due to socrative. And I would have done something similar without socrative anyways, so in fact I did not feel much of a difference here. Once the question was projected, I started the question on socrative by pushing a button and then the students could start entering their answers. The answers are shown on the teacher's screen. Based on the feedback received, I could then decide whether I needed to re-explain something.

I used a tablet to connect to socrative, while I used a second computer to project my slides: This was handy as I did not have to exit the presentation to start the question in socrative. In the classroom there was a projector that could project the tablet's screen. Sometimes I have shown the feedback screen even before I got all the answers in. The students seemed to like this as it created an interesting dynamic in the class.

I used multiple choice questions mostly, but I also used yes/no questions occasionally. In addition to these, I also experimented with "short answer" questions. This latter question-type would be very useful and I generally prefer it. It is not only less work to prepare "short answer" questions, but the they are usually better at testing what the students know. To put in another way, it is really hard (and time-consuming) to design good multiple choice questions with valid-looking alternatives (if you have some ideas of how to automate this, I am all ears). Examples of the "short answer" questions I asked are: "What is the keyword that should go in this program in this place?" or "What will this program print?".

The feedback screen after a quiz-type question shows a histogram which is updated on the fly, so if you want, you can steer things in the class by showing how the histogram evolves. The same works for yes/no questions. Unfortunately, the feedback screens for the short answer questions are less than ideal as it shows the list of answers as they are coming in. On this list the same answers in fact can be repeated many times.. Needless to say, this is not the best way of presenting this information. It would be much nicer, for example, if there was a histogram representing the top answers.

I also replaced the end-of-class questionnaire of previous years with the "exit quiz" that socrative provided. Unfortunately, the questions on the exit quiz are not configurable and some of socrative's standard questions caused permanent confusion ("solve the problem on the board" -- we did not even have a board!). Also, to my surprise, the exit-quiz appeared to be less effective than the paper-based questionnaire in facilitating comments. Later, I figured that I can leave the exit-quiz open for a few days after class to collect more feedback; this helped, but unfortunately only just a little bit. Reading through the exit-quiz responses is also a bit of extra work if you have not done this before, but this was not the case for me. And I actually like to read through these extra comments; they are very useful most of the time.

Once, I also experimented with socrative's "quiz" component. This allows teachers to compile a longer list of questions (a quiz) that the students can solve in the class (either timed or not timed). This quiz was not marked. Unfortunately, there were too many technical difficulties: socrative was overloaded (the class size was around 90). Also, socrative eliminated leading whitespace characters from the answers, which was quite unfortunate as whitespaces are crucially important in Python, the programming language that we used in this class. Thus, I decided not to use the quiz again.

In conclusion, although the system is not perfect, it was usable and overall it helped both the students and me. I have received many comments from the student praising how socrative was useful to engage everyone, including the shy students. As this was my main motivation to try this system, I conclude that the experiment was successful.

PS: Next semester, I tried again. Unfortunately, in this semester we were in a different classroom, where first the wifi service became unreliable and then it completely stopped working. I have no idea why. No one had. I asked the technical support people to fix it but all I got was promises. This was sad. Luckily, this was a smaller class (around 50), so we could still have some true interaction with the students.

PPS: I have also used in the first semester a website where students could solve hundreds of little programming problems. The site provided the questions and it also have feedback, even including things like "did you forget to initialize your variable". Again, the tool had glitches (the question bank was not very well developed), but overall I think this was also a great tool.

## Monday, 10 September 2012

### Student Response Systems

There are plenty Student Response Systems (SRSs) out there. Which one to choose? This brief document summarizes what I have found on the web before the Fall of 2012.

The systems differ in a few characteristics. The first is what type of devices the students can use. Classical, iClicker like systems require the student buy a device which communicates with a receiver that the teacher should possess. Since our school uses iClickers, let me focus on them. The overall cost of first generation iClickers is $10/device, assuming the student sells back the device to the bookstore (which they can). This is not a major cost but who likes to pay when you don’t have to? The first limitation of iClicker-like systems is that they are bound to the smart classrooms and the computers there. Thus, if you are like me and use your own computer for projection, you will need to switch between screens to show the results of a poll. This makes the use of iClicker quite cumbersome. A second major limitation of iClickers is that they are limited to multiple choice questions. In particular, they don’t allow free text entry, numerical responses, or other type of test questions (like matching). Free text and numerical responses are very useful for assessing the knowledge of students and designing meaningful multiple choice questions is hard and time-consuming. # Systems accepting text input from non-smart phones An alternative to iClickers is to use systems that uses Wifi networks or the texting capabilities of conventional phones with texting capabilities. Wifi capable devices include smartphones, tablets and laptops. The systems that I have found that support texting (i.e., receiving input from non-smart phones) are Poll Everywhere, LectureTools and Top Hat Monocle. These also support input from Web-capable devices. Their pricing differs quite a bit. Poll Everywhere allows instructors to pay for the semester. The price currently is USD350. LectureTools requires you to pay for two semesters at the price of USD800. Top Hat Monocle does not allow the instructors to pay. The per student price is USD20 for a semester, or USD38 for five years. As said before, all these systems support Wifi and texting. I had a chance to test Poll Everywhere and LectureTools. I had some problems with LectureTools (importing my slides did not work). The concept of LectureTools is that you keep your slides on the web, tightly integrated with the questions. They support during the presentation annotation of slides, which is nice. However, overall LectureTools was not as smooth and easy to use as Poll Everywhere. For example, I could not figure out given the limited amount of time how the students will connect on the web to my questions. Poll Everywhere was really easy to use, on the other hand. It supports Powerpoint and Keynote. Compared to Top Hat Monocle, Poll Everywhere is not as feature rich (Top Hat Monocle has games, for example), but I was happy with the functionality Poll Everywhere provided. # Systems that use the Web-capable devices If one is contempt with Web-enabled devices, the number of available SRSs soars high. In fact, one can use any Web-based solution, many of them being free. Starting with the free options, of course, one can create a quiz in Moodle. Controlling what the students see and what they don’t is pretty cumbersome. Moodle is not very well suited for the purpose of live polling. Another method is to use Twitter. Students are pretty excited about Twitter (from the feedback I got), though I would be careful projecting everything that comes in to the screen -- some moderation might be essential to keep the class under control. Another problem with Twitter is that a tool for analyzing responses to questions would be needed (and I know of no such tool). The next option is to use Google Forms. The very idea of Google Forms is that information submitted on the web is sent to a Google spreadsheet. Since you need the spreadsheet to get the URL of the form, start by creating a spreadsheet, then insert a form there. Once the form is created (give it a cool skin!), get back to the spreadsheet to get the URL to it. You will send this to the students (maybe compressing it using tinyurl.com, or a similar service). You can control the timing of when the form is accepting input from the spreadsheet. To support ad hoc questions, you can just create an empty form and recycle it through your presentation. If you want to use a fixed set of questions, you will need one form (and thus spreadsheet) per question that you can store on your google drive. The downside of Google Forms is that students can submit as many responses as they wish. With Google Apps, presumably there is a way around this, but this would need to be investigated. Flisti is an extremely simple web-based polling systems (I guess there are many other similar systems). You go to their webpage, create the poll there and give the poll’s URL to the students. You can view the results of the poll online. I think that users are tracked based on their IP addresses, so no multiple submissions are possible from the same IP address to the same poll. Only multiple choice questions (with multiple answers, possibly) are supported. Socrative is a fuller Web-based SRS currently under beta-testing. During the beta-testing phase, the system is free. The web-based interface is nice and sleek and it was extremely user-friendly. The teacher can control in real-time which questions are “live” in his/her “classroom”. The classroom is identified by a numbe, that the students go to. The only issue with Socrative is that every activity is limited to 50 students. QuestionPress is another commercial system. The price for my class is$66. This seems to be a mature systems that I was truly impressed by. All interactions are Web-based. ClickerSchool is similar to QuestionPress, the price is $95 for my class for one semester. ClickerSchool are provided specialized smart phone apps (both for iOS and Android). eClicker, on the other hand, requires the teacher to buy a software. All their software supports Apple products only (Mac OSX and iOS). SRN (studentresponsenetwork.com) offers a campuswide license for$195. This is a client-server system that requires installing software on the teachers’, as well as the students’ computers.
I could not find pricing information on the web for TurningPoint/ResponseWare, which seems to be a mature product (but it cannot be “tried”). The same goes for vClicker.

# Which type of system to choose?

The question remains: Which type of system to choose? One factor to consider is how important it is to have other than multiple choice questions in class. Personally, I think that multiple choice questions exist only for historical reasons -- their pedagogical value is rather questionable. Good multiple choice questions are extremely hard and time consuming to create. If this is not convincing enough and you don’t mind switching the projector back and forth between your computer and the one in the classroom, you can stay with iClickers.
In the opposite case, the next thing to consider whether you want to support phones with text input. I have just polled my students and out of the 52 responses so far, 43 can and are willing to bring their laptops to the classroom, 37 carry a smartphone, 6 a tablet and 8 carry a non-Web enabled phone (these are overlapping groups of students). Thus, the vast majority of students are able to use a system that is built around the Web. Since it will always be hard to achieve 100% coverage, one idea is to let the students pair up or form groups of three. Based on the statistics I gathered, overall, I am leaning towards that support of texting input should not be viewed as a major advantage.
However, it should also be mentioned here that a potential danger of using laptops or other fancy, Web-enabled devices is that they represent potential sources of distraction. Thus, with the use of these devices, the teacher will need to face the challenge of competing for the students’ attention with the social networks, email, and ultimately the whole Web.
The next factor consider is the ease of use of the SRS. The teacher may need to create a significant number of questions for every class. Integration with Powerpoint and Keynote may be a plus, but switching between a browser and a presentation software looks easy enough.
Since I will use the chosen SRS only for formative assessment and not for grading (the present common sense is that this would be a bad idea, not talking about that this would indeed require full coverage of the whole class), I don’t care about whether the SRS supports automated grading and keeps the identities.
Based on these considerations, I will probably go with Socrative.

References
I have created a google spreadsheet for comparing the systems listed here, in addition to a few more. The spreadsheet can be found here. The spreadsheet links a few other sources that I have used during my research.

## Saturday, 14 April 2012

### Approximating inverse covariance matrices

Phew, the last time I have posted an entry to my blog was a loong time ago.. Not that there was nothing interesting to blog about, just I always delayed things. (Btw, google changed the template which eliminated the rendering of the latex formulae, not happy.. Luckily, I could change back the template..) Now, as the actual contents:

I have just read the PAMI paper "Accuracy of Pseudo-Inverse Covariance Learning-A Random Matrix Theory Analysis" by D Hoyle (IEEE T. PAMI, 2011 vol. 33 (7) pp. 1470--1481).

The paper is about pseudo-inverse covariance matrices and their analysis based on random matrix theory analysis and I can say I enjoyed this paper quite a lot.

In short, the author's point is this:
Let $d,n>0$ be integers. Let $\hat{C}$ be the sample covariance matrix of some iid data $X_1,\ldots,X_n\in \mathbb{R}^d$ based on $n$ datapoints and let $C$ be the population covariance matrix (i.e., $\hat{C}=\mathbb{E}[X_1 X_1^\top]$). Assume that $d,n\rightarrow \infty$ while $\alpha = n/d$ is kept fixed. Assume some form of "consistency" so that the eigenvalues of $C$ tend to some distribution over $[0,\infty)$. Denote by $L_{n,d} = d^{-1} \mathbb{E}[ \| C^{-1} - \hat{C}^{\dagger} \|_F^2]$ the reconstruction error of the inverse covariance matrix when one uses the pseudo-inverse.

Then, $f(\alpha) := \lim_{d,n\rightarrow\infty} L_{n,d}$ will be well-defined (often). The main things is that $f$ becomes unbounded as $\alpha\rightarrow 1$ (also $\alpha\rightarrow 0$, but this is not that interesting). In fact, for $C=I$, there is an exact formula for $\alpha \rightarrow 1$:

$f(\alpha) = \Theta(\alpha^3/(1-\alpha)^3)$.
Here is a figure that shows $f$ (on the figure $p$ denotes the dimension $d$).

Nice.

The author calls this the "peaking phenomenon": The worst case, from the point of view estimating $C^{-1}$, is when we have as many data points as is the number of dimensions (assuming full rank $C$, or this won't make sense because otherwise you could just add dummy dimensions to improve $L_{n,d}$). The explanation is that $L_{n,d}$ is just sensitive to how small the smallest positive eigenvalue of $\hat{C}$ is (this can be shown), and this smallest positive eigenvalue will become extremely small as $\alpha\rightarrow 1$ (and it does not matter whether $\alpha \rightarrow 1-$ or $\alpha\rightarrow 1+$).

Now notice that having $\alpha=0.5$ is much better than having $\alpha=1$ (for large $n,d$). Thus, there are obvious ways of improving the sample covariance estimate!

In fact, the author then suggests that in practice, assuming $n \ge d$, one should use bagging, while for $n\le d$ one should use random projections. Then he shows experimentally that this improves things though unfortunately this is shown for Fisher discriminants that use such inverses and the demonstration for $L_{n,d}$ is lacking. He also notes that the "peaking phenomenon" can also be dealt with by other means, such as regularization where we are referred to Bickel and Levina (AoS, 2008).

Anyhow, one clear conclusion of this paper is that the pseudo-inverse must be abandoned when it comes to approximate the inverse covariance matrix. Why is this interesting? The issue of how well $C^{-1}$ can be estimated comes up naturally in regression, or even in reinforcement learning when using the projected Bellman error. And asymptotics is just a cool tool to study how things behave..

## Wednesday, 4 May 2011

### Brains, Minds and Machines

Former UofA student, Alborz, shared a link to a video recording of a recent MIT150 symposium on Brains, Minds and Machines on facebook. I watched the video yesterday (guess what, I need to mark 40 something finals, hehe:)).
I wrote a comment back to Alborz on facebook and then I thought, why not make this a blog post? So, here it goes, edited, expanded. Warning: Spoilers ahead and the summary will be biased. Anyhow..

The title of the panel was: "The Golden Age — A Look at the Original Roots of Artificial Intelligence, Cognitive Science, and Neuroscience " and the panelist were Emilio Bizzi, Sydney Brenner, Noam Chomsky, Marvin Minsky, Barbara H. Partee and Patrick H. Winston. The panel was moderated by Steven Pinker who started with a 20-30 minute introduction. Once done with this each of the panelist delivered a little speech and at the end there were like two questions asked by Pinker.

My heroes in the panel were Minsky and Winston. They rocked! Minsky almost fell asleep during his talk, but he was well aware of this and I loved him. He told a story about Asimov not wanting to come to his lab to see the real robots (he did not want to get disappointed) and about von Neumann who said that he does not know if Minsky's thesis could qualify as a thesis on mathematics (they were both at Princeton in the math department), but soon it will be. I really enjoyed this part. Winston acted a bit like a comedian. I did not mind this either. One thing that Minsky and Winston both said is that the mistakes happened when AI became successful and everyone from that on seemed to forget the science part of AI. But they did not say much about how we can get back on the track (except that we should try). Winston blamed the short-sightedness of funding agencies and who would disagree.

Chomsky made some interesting claims. He claimed that language is designed for thoughts and not for communication. This was a pretty interesting claim. He claimed other things supporting his idea about innate, universal grammars, but I did not know how much credit to give to those as I have many linguist friends who strongly disagree. Things became interesting when answering a question at the end, he dumped the whole of machine learning. He talked about "a novel scientific criterion" that was never heard of before referring to being able to predict "on unanalyzed data". He said that "of course" with enough data you will do better, but it seemed that he thinks that the evaluation criterion is already ridiculous. He also said that a little statistics does not hurt, but he still seems that the big deal is the engineering part. (He did not say with these words, but this is what I got from what he said).

Sydney Brenner (pioneer in genetics and molecular biology) was puzzled about why he was invited, though he had some good stories. I liked when he said that in 50 years people will not understand why everyone talked about consciousness 50 years ago.

Emilio Bizzi (a big shot in neuroscience, in particular in motor control) talked about modularity, "dimension reduction" and generalization, and he looked like a fine Italian gentleman, but I have to confess that I don't remember anything else, though this could have been because it was late.

Barbara H. Partee read a script. In the first 5 minutes or so she was mostly praising Chomsky. Then she started to talk about her own work, which was foundational in semantics. She talked about how semantics is the real thing. By semantics she means formal semantics (like in logics). While in general I am fond of this work, I am not sure if anything like this is going on in our brain and if sentences really do have a meaning in a formal sense. It seems to me that the fact that sentences can have a formal meaning is more likely an illusion, a post-hoc thought than the real thing. Ad it is unclear if bringing in formal logics is going to bring us anywhere. Unfortunately, there was no discussion of this at all. At one point she made the remark that "search engines do not use semantics", but then left us in vain about how we could do any better. Oh, well..

In summary, an impressive set of people, some nice stories, but little cutting edge science. Loads of romanticism about the 50s and 60s and no advice for the young generation. The title tells it all. It was still nice to see these people.

## Wednesday, 13 April 2011

### Useful latex/svn tools (merge, clean, svn, diff)

This blog is about some tools that I have developed (and yet another one that I have downloaded) which help me to streamline my latex work cycle. I make the tools available, hoping that other people will find them useful. However, they are admittedly limited (more about this) and as usual for free stuff they come with zero guarantee. Use them at your own risk.

The first little tool is for creating a cleaned up file before submitting it to a publisher who asks for source files. I call it ltxclean.pl, it is developed in Perl. It can be downloaded from here.
The functionality is