Tuesday, November 17, 2009

Djvu vs. Pdf

Long blog again, so here is the executive summary: Djvu files are typically smaller than Pdf files. Why? Can we further compress pdf files? Yes, we can, but the current best solution has limitations. And you can forget all "advanced" commercial solutions. They are not as good as a free solution.

Introduction

DJVU is a proprietary file format by LizardTech. Incidentally, it was invented by some machine learning researchers, Yann LeCun, Léon Bottou, Patrick Haffner and the image compression researcher Paul G. Howard at AT&T back in 1996. The DJVULibre library provides a free implementation, but is GPLd and hence is not suitable for certain commercial softwares, like Papers, which I am using to organize my electronic paper collection. Hence, Papers, might not support djvu in the near future (the authors of Papers do not want to make it free, and, well, this is their software, their call).
Djvu files can converted to Pdf files using ddjvu, a command line tool which is part of DJVULibre (djvu2pdf is a script that calls this tool). Djvu can also be converted into PS files using djvups (then use ps2pdf). However, all these leave us with pretty big files compared to the originals and, on the top of it, if there was an OCR layer in the Djvu file, it gets lost, but this is another story. How much bigger? Here is an illustration:

Original djvu file: 9.9MB
djvu2pdf file: 427.6MB(!)
djvu2ps file: 1.0GB
djvu2ps, ps2pdf file: 162.6MB

Note that I have turned on compression in the conversion process (-quality=50). (The quality degradation was not really noticeable at this level.) So, at best, I got more than 16 times the original file size. Going mad about it, I started to search the internet for better solutions. I have spent almost a day on this (don't do this, especially if you are a student!)..

JBig2 and the tale of commercial solutions

First, I figured, the difference is that these use general image compression techniques (like jpeg), while djvu is specialized to text and black&white images. Thus, for example, it can recognize if the same character appears multiple times on the page, store a template and a reference to the template. This is clever. I then figured that PDF files support the so-called jbig2 encoding standard, which is built around this idea. Hence, the quest for software that would support encoding a document using a jbig2 encoder and put the result into a pdf format. The easiest would be, if such a software just existed out there. A few commercial packages indeed mention jbig2. I felt lucky (especially, seeing that there are a few cheap ones). So, I started to download trial versions. Here are the results:

PDFJB2: 34.1MB
CVision PdfCompressor: 48MB
CVision PdfCompressor with OCR: 49MB
A-PDF: 106.8MB
A-PDF + PDFCompress: 106.8MB
djvu2pdf + PDFCompress: conversion failed

Hmm, interesting. 34MB is much better than 160MB, but it is still a long way from 9.9MB. (After a superficial look at the resulting files I concluded that only the A-PDF compressed file lost quality. What happened with this file is that on some page in some line containing a mathematical formula, the top of the letters got chopped.)

Free, open source solutions

Becoming desperate, I continued hunting for better solutions. Searching around, I have found iText, which is an open source, free Java library supporting all kinds of manipulations of Pdf files. I have figured that it "uses" Jbig2, but it was not clear if it uses it for compression or just knows how to handle the encoding. So, here I go, I wrote a java program opening a pdf file and then writing it out in "compressed" mode. Hmm, this few lines of coding allowed me to create a file of size 26MB, smaller than what I could ever get previously. Exciting! Unfortunately, opening the file revealed the `secret': Quality was gone. The file looked to be seriously downsampled (i.e., the resolution was decreased). Not good.

Then I have found pdfsizeopt on google code, which aims exactly at compressing the size of pdf files! The Holy Grail? Well, installing pdfsizeopt on my mac was far from easy (I use a Mac, which also runs Windows; quite handy as some of the above software runs only under Windows..). However, finally, I was able to run pdfsizeopt. Unfortunately, it seems to crash, without even looking at my pdf file (I hope the bug will be corrected soon and then I can report results using it). Along the way, I had to install jbig2enc. For this, I just had to install leptonica (version 1.62, not the latest one), which is really the part that is doing the image processing part of the process. JBig2Enc expects a tif file and produces "pdf" ready output (every page is put in a separate file), which can be concatenated into a single pdf file by a python script provided. Having jbig2enc on my system, I gave it a shot. I first used ddjvu to transform the input to a tif file (using the command line option, "-quality=75", resulting in a file of size 1GB). Then I used the jbig2 encoded with the command line arguments "-p -s". The result is this:

jbig2enc: 3.8MB

Wow!! Opening the file revealed a dirty little secret: Color images are gone, as well as the quality of some halftoned gray-scale images got degraded. However, line drawings were kept nicely and, in general, the quality was good (comparable to the original djvu file). Conversion to tif took 5 minutes, conversion from tif to jbig2 took ca. 4 minutes, altogether making the whole process take close to 10 minutes. (Other solutions were not faster at all either. And the tests were run on a resourceful MacBook Pro.)

Conclusions

jbig2enc seems to work, but you will lose colors. If you are happy with this, jbig2enc is the solution, though the process should be streamlined a bit (a small script good do this). Oh yes, I did not mention that these processes are not fast. I did not attempt to measure the speed, but conversion takes a lot of time. Jbig2Enc is maybe on the faster end of the spectrum.

Future work

  1. pdfsizeopt is a good idea. It should be made work.
  2. It would be nice to create a jbig2enc wrapper
  3. ddjvu is open source: Maybe it can be rewritten to support jbig2 directly. The added benefit could be that one could also keep the OCR layer in the original djvu file if one existed
  4. Along the way, I have found a cool google code project, Tesseract, which is an open source OCR engine. How cool would it be if we had an OCR engine that helps the compression algorithm and eventually also puts an OCR layer on the top of documents which lack text information (think of scanned documents, or documents converted from an old postscript file). Currently, I am using Nuance's Pdf Converter Professional (yes, I paid for it..), which I am generally very satisfied with apart from its speed. However, this could be the subject of another post.
PS: I have tested the capabilities of Nuance's Pdf Converter Professional and Abbyy's in terms of their compression capabilities:
Nuance: 132MB
Abbyy: 129MB
Yes, I tried their advance "MRC" compression, in Nuance I have explicitly selected jbig2. No luck.

Saturday, November 14, 2009

Keynote vs. Powerpoint vs. Beamer

A few days ago I decided to give Keynote, Apple's presentation software, a try (part of iWork '09). Beforehand I used MS Powerpoint 2003, Impress from NeoOffice 3.0 (OpenOffice's native Mac version) and LaTeX with beamer. Here is a comparison of the ups and downs of these software, mainly to remind myself when I will reconsider my choice in half a year and also to help people decide what to use for their presentation. Comments, suggestions, critics are absolutely welcome, as usual. Btw, while preparing this note I have learned that go-oo.org has a native Mac Aqua version of OpenOffice. Maybe I will try it some day and update the post. It would also be good to include a recent version of Powerpoint in the comparison.

Stability

  • Keynote: Excellent
    After a few days of usage, so take this statement with a grain of salt..
  • MS Powerpoint 2003: Excellent
  • Impress: Poor
    Save your work very often
  • Beamer: Excellent

Creating visually appealing slides, graphics on slides

  • Keynote: Excellent
    Positioning rulers help a lot. The process is really smooth. Keynote forces you to use less text. Built in templates are professional looking. Adding presentation graphics (tables, basic charts) is very easy. Cooler (technical drawing) better done with OmniGraffle. You can also easily animate the graphics, tables. Overall, very impressive.
  • MS Powerpoint 2003: Good
    Aligning to other objects is more cumbersome than in Keynote. The quality of fonts, color palettes, templates is not as good in Keynote.
  • Impress: Good
    Same as MS Powerpoint, maybe somewhat below (but the difference is not big).
  • Beamer: Poor
    The fonts and styles (templates) are great. However, creating slides with lively graphic is a nightmare (due to the lack of a GUI): You will end up with a few standard layouts, you will in general not use graphics, let alone animated graphics (or you will spend days on creating your slides). Also, departing from the styles is difficult and I am just bored of some of these styles that everyone seems to use.

LaTeX (math) support

  • Keynote: Poor
    Supported through LatexIt (free), but overall a cumbersome process. Details below.
  • MS Powerpoint 2003: Medium
    Supported through TexPoint (commercial, USD30) process is roughly same as with LatexIt and Keynote, slightly better integration.
  • Impress: Medium
    Supported through OOoLatex (free), same as MSPowerPoint + TexPoint, the integration is slightly better.
  • Beamer: Excellent
    Beamer is built for this!

Animations

  • Keynote: Near perfect
    Magic slide transition helps a lot with continuity across slides. What does this do? If you have the same object on two consecutive slides, Keynote will create an animation, keeping the object on screen and flying it to its new position. Works with multiple objects, too. I have found this very helpful for presenting a multi-slide argument. In general, Keynote animations are slick, polished, the flexibility is great. I lack some features of Beamer, such as animated highlighting, in-place replacement of some text (these can all be simulated with the existing tools, but with difficulty only).
  • MS Powerpoint 2003: Basic
    I miss Keynote's magic transitions. In general, Keynote is richer in animations. Again, some features of Beamer would be nice to have.
  • Impress: Weak
    Impress is inferior in terms of its animation caps to MS Powerpoint
  • Beamer: Good
    If only someone added support for magic transitions between slides. Some other cool effects would also come handy.

Dual screen presentation support

The idea is to show notes, time left in addition to the current and next slide on your screen, while showing the current slide on the big screen.
  • Keynote: Excellent
    Keynote supports double screen presentations natively. If you need to swap displays, go on the notes screen in the options menu. This will be on the big screen, obviously, if you need to swap the the screens.
  • MS Powerpoint 2003: Not available
    I have no experience with this feature of MS Powerpoint. Maybe you can use and add-on or something, but the basic software does not support it. I am pretty sure newer versions of Powerpoint must support this.
  • Impress: Excellent(?)
    The "Sun Presenter Console" extension supposedly supports dual screen presentations just like Keynote, but I have never had the chance to test it. Hence, the question mark. Some posts on the internet indicate that the extension might leak memory.
  • Beamer: Basic support
    Use Splitshow for this purpose. However, as far as I know, you cannot show the current time or the time remaining on the notes screen.

Interoparability

I want to put my presentations on the web so that people can look at them no matter what (major) operating system they use, without loosing animations or any other features. Another desired feature is the ability to create a compact, printable version of the slides: That is, if you have animations spanning multiple slides, somehow they should get handled intelligently. There is a tradeoff here: The more animation rich your slides are, the more bloated/complicated your printout will be.
  • Keynote: OK
    Proprietary file format. This is my biggest complaint. A keynote presentation is a keynote presentation. Apple likes to lock you in. Export to PDF and PPT works relatively well, but will lose some features of the presentation, like the cool animations. Exporting to PDF without animations to create printable versions seems to work well.
  • MS Powerpoint 2003: Good
    Free powerpoint viewers exist that can play any PPT file. Export to PDF will again lose some features.
  • Impress: Good
    Same as powerpoint.
  • Beamer: Excellent
    Produces PDF outputs: The presentations can be viewed on any computer! Also, the source is later, beamer is available on all systems. Add [handout] to the style and beamer will create an animation free version of your slides that works almost all the cases.

More about using formulae in Keynote (and why it sucks)

I used LatexIt which produces a PDF that can be embedded into the presentation. Style is not matched automatically. The PDF contains the latex source for the formulae, copy paste it back to LatexIt to edit it. When done with the edit, you need to drag and drop the formula back into Keynote. This sucks, since you need to delete the original that you have edited, reposition the new formula and reapply animations if you had any. Horrible.

Another issue is that the source saved with the formula by default does not have the preamble, thus using a command set specific to a presentation is difficult to achieve (you have to set this up manually). Another major headache is that you will not be able to use inline formula (a text is either in LaTeX, or in Keynote, the fonts in general do not match and mix well, alignment is a nightmare), nor will you be able to animate easily formulae (e.g., displaying a multiline formula line by line requires you to split the formula into multiple PDFs and use Keynote animations to show them one by one; this is problematic because formula alignment by hand is time consuming).

Saturday, October 31, 2009

Optogenetics

This is a little deviation from the usual topic.
Scientist are able to genetically modify neurons that respond to light. They are in fact able to do this in a targeted manner. A patient would then have some LEDs inside his skull, emitting some light. In response the selected neurons start to fire. They demonstrated the technology by making mice run counterclockwise when they turn on the light. This is input to the brain. Earlier, it was demonstrated that neurons can be genetically modified to emit light when they are firing. Are we heading towards rewiring the brain and turning it into a light computer?
The motivation for the research is to cure diseases like Parkinson's disease, when the patient has all the circuity and muscles but is just unable to make the movements. In fact, the researchers are already testing this technology on primates. Source: Wired Nov. 2009, "Powered by Photons" pp. 109--113. The wikipedia entry for optogenetics is here.

Sunday, October 25, 2009

Pitfalls of optimality in statistics

I was reading a little bit about robust statistics, as we are in the process of putting together a paper about entropy estimation where robustness comes up as an issue. While searching on the net for the best material to understand this topic (I am thinking about posting another article about what I have found), I have bumped into a nice paper (downloadable from here) by Peter J. Huber, one of the main figures in robust statistics, where he talks about a bunch of pitfalls around pursuing optimality in statistics. Huber writes eloquently -- he gives plenty of examples, motivates definitions. He is just great. I can only recommend this paper or his book. Now, what are the pitfalls he writes about? He distinguishes 4 types with the following syndromes:

  • The fuzzy concepts syndrome: sloppy translation of concepts into mathematics. Think about uniform vs. non-uniform convergence (sloppy asymptotics). In statistics a concrete example is the concept of efficiency which is defined in a non-uniform manner with respect to the estimable parameters, which allows for (weird) "super-efficient" estimators that pay special attention to some distinguished element of the parameter-space.
  • The straitjacket syndrome: the use of overly restrictive side conditions, such as requiring that an estimator is unbiased or equivariant (equivariant estimates in high dimensions are inadmissible in very simple situations). In Bayesian statistics another example might be the convenient but potentially inappropriate conjugate priors.
  • The scapegoat syndrome: confusing the model with reality (offering the model for the gods of statistics instead of the real thing, hoping that they will accept it). The classic example is the Eddington-Fisher argument. Eddington advocated the mean-absolute-deviation (MAD) instead of the root-mean-square (RMS) deviation as a measure of scale. Fisher argued that MAD estimates are highly inefficient (converge slowly) relative to the RMS deviation estimates if the sample comes from a normal distribution. Tukey has shown that the situation gets reversed even under small deviations from a normal model. The argument that under narrow conditions one estimator is better than some other should not be even made. Another example is perhaps classical optimal design and the fundamentalist approach in Bayesian statistics. Of course, there is nothing wrong with assumptions, but the results should be robust.
  • The souped-up car syndrome: by optimizing for speed we can end up with an elaborate gas-guzzler. Optimizing for one quantity (efficiency) may degrade another one (robustness). Practical solutions must find a balance between such contradicting requirements.
These syndromes are not to hard identify in machine learning research. Wear protective gear as needed!

Monday, September 7, 2009

How to make Thunderbird delete temporary file

If you are using Thunderbird (TB) on Mac OSX, you might be annoyed by that when TB opens an attachment (like a pdf file) it creates the file on the Desktop and then leaves just it there! I have finally found a solution, which seems to work at least for me and assuming that you also have Firefox. The solution is here, but I duplicate it here to make sure the idea spreads:

Simply open Firefox, in the address bar type in about:config, then add a boolean variable browser.helperApps.deleteTempFileOnExit and set its value to true.

Now, this works to the extent that when you exit Firefox(!!) (after quitting Thunderbird), it will remove the cluttering files.

Enjoy!

Saturday, April 5, 2008

Ninja Carburglars

Humorous Pictures
see more crazy cat pics

Saturday, March 29, 2008

Statistical Modeling: The Two Cultures

Sometimes people ask what is the difference between what statisticians and machine learning researchers do. The best answer that I have found so far can be found in
"Statistical Modeling: The Two Cultures" by Leo Breiman (Statistical Science, 16:199-231, 2001).
According to this, statisticians like to start by making modeling assumptions about how the data is generated (e.g., the response is a noise added to the linear combination of the predictor variables), while in machine learning people use algorithm models and treat the data mechanism as unknown. He estimates that (back in 2001) less than 2% of statisticians work in the realm when the data mechanism is considered as unknown.
It seems that there are two problem with the data model approach.
One is that the this approach does not address the ultimate question which is making good predictions: if the data does not fit the model, this approach has nothing to offer (it does not make sense to apply a statistical test if the assumptions are not valid).
The other problem is that as data become more complex, data models become more cumbersome. Then why bother? With complex models we lose the advantage of easy interpretability, not talking about the computational complexity of fitting such models.
The increased interest in Bayesian modeling with Markov Chain Monte Carlo is viewed as the response of the statistical community to this problem. True enough, this approach might be able to scale to complex data, but does this address the first issue? Are not there computationally cheaper alternatives that can achieve the same prediction power?
He characterizes the machine learning approach, as the pragmatic approach: You have to solve a prediction problem, hence take it seriously: Estimate the prediction error and choose the algorithm that gives a predictor with the better accuracy (but let's not forget about data snooping!).
But the paper offers more. Amongst other things it identifies three important recent lessons:

  1. The multiplicity of good models: If you have many variables, there can be many models of similar prediction accuracy. Use them all by combining their predictions instead of just picking one. This should increase accuracy, reduce instability (sensitivity to perturbations of the data). Boosting, bagging, aggregation using exponential weights are relevant recent popular buzzwords.
  2. The Occam dilemma: Occam's razor tells you to choose the simplest predictor. Aggregated predictors don't look particularly simple. But aggregation seems to be the right choice otherwise. I would think that Occam's razor tells you only that you should have a prior preference to simple functions. I think this is rather well understood by now.
  3. Bellman: dimensionality -- curse or blessing: Many features are not bad per se. If your algorithm is prepared to deal with the high-dimensional inputs (SVMs, regularization, random forests are mentioned) then extracting many features can boost accuracy considerably.
In summary, I like the characterization of the difference between (classical) statistical approaches and machine learning. However, I wonder if these differences are still as significant as they were (must have been) in 2001 when the article was written and if the differences will become smaller over time. Then it will be really difficult to answer the question on the difference between the statistical and the machine learning approaches.