Risks of DJVU/lossy compression - Re: If you OCR, always archive the bitmaps too
toby at telegraphics.com.au
Sun Sep 27 15:14:43 CDT 2015
On 2015-09-27 2:33 PM, Fred Cisin wrote:
> On Sun, 27 Sep 2015, Pontus Pihlgren wrote:
>> It seems to me that a better tool could solve the issue. One that
>> could display the OCR:ed content only and the scanned content
>> only when desired, for instance when you suspect an error.
>> Is there such a reader? Is the content organised to make it
> I haven't seen one.
> I did start trying to write an heuristic probabilistic OCR one 25 years
> ago. The idea being to overlay the OCR'd (displayed with matching
> fonts) over the scanned content. Besides visual confirmation and
> indication of probability of accuracy with each character, it lends
> itself well to hiring neighborhood kids to type in just the "wrong"
> characters to clean up the OCR'd file, and heuristically tune the font
> database, including adding new fonts - EVERY character is "wrong" until
> it repeats a few times in the document. ("clean up" a NYT article, and
> the OCR now has their font).
DJVU compression is somewhat analogous to this process, because,
font-like, it builds a set of master glyphs then uses them as a
compression dictionary (if everyone will forgive my simplistic
explanation). Being lossy, like OCR, it inherently adds the risk of
picking the wrong (but visually almost indistinguishable) glyph -- the
WORST kind of typo for being so insidious.
There was a somewhat scary case study on the web a few years ago (not
sure if it's still out there, haven't been able to find it) where the
DJVU compression in a Xerox copier was quietly changing digits on
scanned schematics to different digits. Close enough for DJVU -- but
wrong. The risks are obvious(*).
* - Hat tip to PGN. comp.risks digest.
More information about the cctalk