Djvu vs. Pdf

Long blog again, so here is the executive summary: Djvu files are typically smaller than Pdf files. Why? Can we further compress pdf files? Yes, we can, but the current best solution has limitations. And you can forget all "advanced" commercial solutions. They are not as good as a free solution.

Introduction

DJVU is a proprietary file format by LizardTech. Incidentally, it was invented by some machine learning researchers, Yann LeCun, Léon Bottou, Patrick Haffner and the image compression researcher Paul G. Howard at AT&T back in 1996. The DJVULibre library provides a free implementation, but is GPLd and hence is not suitable for certain commercial softwares, like Papers, which I am using to organize my electronic paper collection. Hence, Papers, might not support djvu in the near future (the authors of Papers do not want to make it free, and, well, this is their software, their call).
Djvu files can converted to Pdf files using ddjvu, a command line tool which is part of DJVULibre (djvu2pdf is a script that calls this tool). Djvu can also be converted into PS files using djvups (then use ps2pdf). However, all these leave us with pretty big files compared to the originals and, on the top of it, if there was an OCR layer in the Djvu file, it gets lost, but this is another story. How much bigger? Here is an illustration:

Original djvu file: 9.9MB
djvu2pdf file: 427.6MB(!)
djvu2ps file: 1.0GB
djvu2ps, ps2pdf file: 162.6MB

Note that I have turned on compression in the conversion process (-quality=50). (The quality degradation was not really noticeable at this level.) So, at best, I got more than 16 times the original file size. Going mad about it, I started to search the internet for better solutions. I have spent almost a day on this (don't do this, especially if you are a student!)..

JBig2 and the tale of commercial solutions

First, I figured, the difference is that these use general image compression techniques (like jpeg), while djvu is specialized to text and black&white images. Thus, for example, it can recognize if the same character appears multiple times on the page, store a template and a reference to the template. This is clever. I then figured that PDF files support the so-called jbig2 encoding standard, which is built around this idea. Hence, the quest for software that would support encoding a document using a jbig2 encoder and put the result into a pdf format. The easiest would be, if such a software just existed out there. A few commercial packages indeed mention jbig2. I felt lucky (especially, seeing that there are a few cheap ones). So, I started to download trial versions. Here are the results:

PDFJB2: 34.1MB
CVision PdfCompressor: 48MB
CVision PdfCompressor with OCR: 49MB
A-PDF: 106.8MB
A-PDF + PDFCompress: 106.8MB
djvu2pdf + PDFCompress: conversion failed

Hmm, interesting. 34MB is much better than 160MB, but it is still a long way from 9.9MB. (After a superficial look at the resulting files I concluded that only the A-PDF compressed file lost quality. What happened with this file is that on some page in some line containing a mathematical formula, the top of the letters got chopped.)

Free, open source solutions

Becoming desperate, I continued hunting for better solutions. Searching around, I have found iText, which is an open source, free Java library supporting all kinds of manipulations of Pdf files. I have figured that it "uses" Jbig2, but it was not clear if it uses it for compression or just knows how to handle the encoding. So, here I go, I wrote a java program opening a pdf file and then writing it out in "compressed" mode. Hmm, this few lines of coding allowed me to create a file of size 26MB, smaller than what I could ever get previously. Exciting! Unfortunately, opening the file revealed the `secret': Quality was gone. The file looked to be seriously downsampled (i.e., the resolution was decreased). Not good.

Then I have found pdfsizeopt on google code, which aims exactly at compressing the size of pdf files! The Holy Grail? Well, installing pdfsizeopt on my mac was far from easy (I use a Mac, which also runs Windows; quite handy as some of the above software runs only under Windows..). However, finally, I was able to run pdfsizeopt. Unfortunately, it seems to crash, without even looking at my pdf file (I hope the bug will be corrected soon and then I can report results using it). Along the way, I had to install jbig2enc. For this, I just had to install leptonica (version 1.62, not the latest one), which is really the part that is doing the image processing part of the process. JBig2Enc expects a tif file and produces "pdf" ready output (every page is put in a separate file), which can be concatenated into a single pdf file by a python script provided. Having jbig2enc on my system, I gave it a shot. I first used ddjvu to transform the input to a tif file (using the command line option, "-quality=75", resulting in a file of size 1GB). Then I used the jbig2 encoded with the command line arguments "-p -s". The result is this:

jbig2enc: 3.8MB

Wow!! Opening the file revealed a dirty little secret: Color images are gone, as well as the quality of some halftoned gray-scale images got degraded. However, line drawings were kept nicely and, in general, the quality was good (comparable to the original djvu file). Conversion to tif took 5 minutes, conversion from tif to jbig2 took ca. 4 minutes, altogether making the whole process take close to 10 minutes. (Other solutions were not faster at all either. And the tests were run on a resourceful MacBook Pro.)

Conclusions

jbig2enc seems to work, but you will lose colors. If you are happy with this, jbig2enc is the solution, though the process should be streamlined a bit (a small script good do this). Oh yes, I did not mention that these processes are not fast. I did not attempt to measure the speed, but conversion takes a lot of time. Jbig2Enc is maybe on the faster end of the spectrum.

Future work

pdfsizeopt is a good idea. It should be made work.
It would be nice to create a jbig2enc wrapper
ddjvu is open source: Maybe it can be rewritten to support jbig2 directly. The added benefit could be that one could also keep the OCR layer in the original djvu file if one existed
Along the way, I have found a cool google code project, Tesseract, which is an open source OCR engine. How cool would it be if we had an OCR engine that helps the compression algorithm and eventually also puts an OCR layer on the top of documents which lack text information (think of scanned documents, or documents converted from an old postscript file). Currently, I am using Nuance's Pdf Converter Professional (yes, I paid for it..), which I am generally very satisfied with apart from its speed. However, this could be the subject of another post.

PS: I have tested the capabilities of Nuance's Pdf Converter Professional and Abbyy's in terms of their compression capabilities:
Nuance: 132MB
Abbyy: 129MB
Yes, I tried their advance "MRC" compression, in Nuance I have explicitly selected jbig2. No luck.

Comments

absarforex1 February 2011 at 04:51
good information on this blog.nice job
Dov Grobgeld29 September 2012 at 23:30
Obviously the best solution is to have a convertor that has intrinsic knowledge both of the DJVU format and the PDF/JBIG2 formats. If I understand it correctly the solutions that you mention first rasterizes DJVU and then converts the resulting raster through JBIG2 encoding. This discards critical info from the DJVU file like layer and color info. What you would really like to do is to translate each djvu layer according to its optimal algorithm and then merge the result on the pdf page.

Thanks also for the info about the fact that JBIG2 uses the same character template optimization as in DJVU. I was previously under the (now I know erroneous) impression that DJVU of scanned documetns would always be smaller than PDF files, because of this optimization.
Csaba Szepesvári30 September 2012 at 00:02
Dov,
You are absolutely right about the rasterizing part. I have tried to go directly from djvu to pdf (by writing some code), but I had limited time so I could not finish. Needless to say, I'd be very interested in a solution that would do this.
- Csaba

Search This Blog

Musings about machine learning and other things

Constrained MDPs and the reward hypothesis

Djvu vs. Pdf

Comments

Post a Comment

Popular posts from this blog

Constrained MDPs and the reward hypothesis

Useful latex/svn tools (merge, clean, svn, diff)

Keynote vs. Powerpoint vs. Beamer