We are in a classification setting and the learning problem is defined by a pair of jointly distributed random variables, (X,Y), where Y can take on the values +1 and -1. Question: How many iid copies of this pair does an algorithm need to (i) find a classifier that yields close to optimal performance with high probability (ii) find two score functions, one trained with the positive examples only, the other with the negative examples only such that the sign of the difference of the two score functions gives a classifier that is almost optimal with high probability?

The result in the paper is that there exists a class of distributions, parameterized by d (determining the dimension of samples) such that there is a discriminative algorithm (tuned to this class) that can learn the correct classifier with only $2log(2/\delta)$ samples, while the number of samples required for any generative classifier is at least d.

Since it is clear that the requirements of generative learning are stronger than those of discriminative learning, it follows that in the above framework discriminative learning is strictly "easier" than generative learning.

The distribution concentrates on O(d) samples and the main idea is that the joint knowledge of positive and negative samples suffices for the easy identification of the target distribution (hence, classifier), while knowing only either the positive or negative examples alone is insufficient. Two special inputs, both marked for easy of recognition, determine the full distribution jointly but one of the inputs is in the positive set, the other is in the negative set and the knowledge of only one of them is insufficient to learning the otherwise "difficult" to learn distribution.

Although the construction given in the paper is simple and works as intended, it is arguably "artificial and contrived", as it was also noted by the authors. In particular, does a similar result hold when we consider a continuous domain of a fixed dimension, and/or we restrict the class of algorithms to consistent ones? Further, the example shows more the limitation of algorithms that learn from positive and negative samples independently of each other than the limitation of generative algorithms (generative algorithms traditionally refer to learners that estimate the joint distribution of the inputs and outputs).

The question of whether generative or discriminative algorithms are more efficient are particularly interesting in light of an earlier paper by Andrew Y. Ng and Michael Jordan ("On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes", NIPS-14, 2002). In this paper the authors compare one particular discriminative algorithm with another particular algorithm that is "generative". The two algorithms are logistic regression and naive Bayes and together they form what is called a "Generative-Discriminative" pair. The meaning of this is that while naive Bayes maximizes the total joint log-likelihood, $\sum_{i=1}^n \log p_\theta(x_i,y_i)$ over the samples, logistic regression maximizes the total conditional likelihood, $\sum_{i=1}^n \log p_\theta(y_i|x_i)$ over the same parametric model. In the case of these two particular algorithms the parametric model is written in terms of $p(x|y)$ and asserts independence of the individual components of the feature-vector $x$. (For continuous spaces the individual feature distributions, $p(x_k|y)$, are modeled as Gaussians with unknown mean and variance.) The agnostic setting is considered (the input distribution is not restricted). Since both learners pick a classifier from the very same class of classifiers and the discriminative learner in the limit of an infinite number of samples converges to the best classifier from this class, it follows that the ultimate loss suffered by the discriminative learner is never higher than that suffered by the generative learner. Hence, it seems that if the naive Bayes assumption made by the generative method is not met, the discriminative method can have an edge -- at least ultimately (open issue: give an example that shows positive separation!). However, this is just half of the story: the generative model may converge faster! In particular, the authors state an upper bound on the convergence of loss for the generative model that scales with $\sqrt{1/n \log(d)}$ ($d$ is the number of components of $x$), while as follows from standard uniform convergence results, the same convergence rate for the discriminative method is $\sqrt{d/n \log(d/n)}$. They argue that this result follows since the hypothesis class has a VC-dimension of $d$. Note the difference in the way the two bounds scale with $d$, the dimension of the input space: In the case of the discriminative algorithm the scaling is (log-)linear with $d$, while in the case of the generative algorithm it is logarithmic in $d$. (Strictly speaking, the upper bound would call for a proof since logistic regression is not a risk-minimization algorithm for the 0/1 loss and the cited theory has been worked out for risk-minimization algorithms.) Since comparing upper bounds is not really appropriate, they point to existing lower bounds that show that the worst-case sample complexity is lower bounded by $O(d)$. (My favorite paper on this is by Antos and Lugosi: "Strong minimax lower bounds for learning": link.) Hence, the conclusion of Ng and Jordan is that, contrary to the widely held belief, generative learning can sometimes be more efficient than discriminative learning; at least when the number of features is large compared to the number of samples and the ultimate loss of the generative learner is not much higher than that of the discriminative learner. They also performed experiments on UCI datasets to validate this claim. The experiments show that the performance of naive Bayes is indeed often better when the number of examples is smaller. Unfortunately, the figures show the loss as a function of the number of samples only and hence validate the theory only half-way since for a full validation we should know how the dimension of the dataset influences the performance.

A similar conclusion to that of Ng and Jordan was shown to hold in an earlier KDD-97 paper titled "Discriminative vs. Informative Learning" by Y. Dan Rubinstein and Trevor Hastie (link) who did a case study with simulated and real examples.

The convergence rate comparison looks quite intriguing at the first sight. However, some more thinking reveals that the story is far from being finished. To see this consider the "rate" function $r(n,d) = L_{gen}$ if $n\le D-1$, $r(n,d) =\min(L_{gen},\sqrt{d/n})$ if $n\ge D$. Imagine that $r(n,d)$ is an upper bound on the loss of some learner, where $L_{gen}$ is the loss of the generative learner. The rate of convergence then is $\sqrt{d/n}$, not contradicting with the lower bounds, but clearly, this discriminative learner will be competitive with the generative learner.

## 5 comments:

Interesting post!

You could be interested by this paper.

PS: I think you forgot a "log" in the formula after "The meaning of this is that while naive Bayes maximizes the total joint log-likelihood"...

Best regards,

Pierre

Thanks for the link -- the paper is indeed interesting and relevant. I am not sure that the term discriminative training should be abolished though, as suggested at the end of the paper. In particular, I strongly disagree with the statement that "there is only one correct way to train [it]". However, the interpretation of discriminative training as a maximum likelihood method with a more flexible model is interesting. Rubinstein and Hastie also mention this.

And you were right: the log was missing. Now I corrected the post.

If you have some time, this article extends the other one.

Pierre

gesges

Some of the formulas you've mentioned throughout the article were a bit out of my league. I'm a newbie when it comes to logistic regression. Despite that, it was fascinating reading your post.

Post a Comment