Risks of DJVU/lossy compression - Re: If you OCR, always archive the bitmaps too

Toby Thain toby at telegraphics.com.au
Sun Sep 27 15:14:43 CDT 2015

On 2015-09-27 2:33 PM, Fred Cisin wrote:
> On Sun, 27 Sep 2015, Pontus Pihlgren wrote:
>> It seems to me that a better tool could solve the issue. One that
>> could display the OCR:ed content only and the scanned content
>> only when desired, for instance when you suspect an error.
>> Is there such a reader? Is the content organised to make it
>> possible.
> I haven't seen one.
> I did start trying to write an heuristic probabilistic OCR one 25 years
> ago.  The idea being to overlay the OCR'd (displayed with matching
> fonts) over the scanned content.  Besides visual confirmation and
> indication of probability of accuracy with each character, it lends
> itself well to hiring neighborhood kids to type in just the "wrong"
> characters to clean up the OCR'd file, and heuristically tune the font
> database, including adding new fonts - EVERY character is "wrong" until
> it repeats a few times in the document.  ("clean up" a NYT article, and
> the OCR now has their font).

DJVU compression is somewhat analogous to this process, because, 
font-like, it builds a set of master glyphs then uses them as a 
compression dictionary (if everyone will forgive my simplistic 
explanation). Being lossy, like OCR, it inherently adds the risk of 
picking the wrong (but visually almost indistinguishable) glyph -- the 
WORST kind of typo for being so insidious.

There was a somewhat scary case study on the web a few years ago (not 
sure if it's still out there, haven't been able to find it) where the 
DJVU compression in a Xerox copier was quietly changing digits on 
scanned schematics to different digits. Close enough for DJVU -- but 
wrong. The risks are obvious(*).


* - Hat tip to PGN. comp.risks digest.

More information about the cctalk mailing list