Documentation

Jules Richardson julesrichardsonuk at yahoo.co.uk
Thu Nov 30 10:12:59 CST 2006


Richard wrote:
> In article <456EEA43.7050601 at yahoo.co.uk>,
>     Jules Richardson <julesrichardsonuk at yahoo.co.uk>  writes:
> 
>> [...] it's easy
>> to make sure that the page was scanned straight etc., but easy to miss things
>> which might hinder some future OCR process.
> 
> To be honest, almost every time I have tried to OCR something (even a
> pristine original), it was simply faster and more accurate to type it
> in myself.  I don't know why but I have been singularly unimpressed
> with OCR software.  Obviously lots of people do OCR, but the amount of
> rework and editing necessary to get high accuracy is just as much work
> as typing it in yourself for someone like me that is a fast touch
> typer.

Oh, I agree. Twenty years down the line I expect it'll be a lot better though, 
  but by then the original paper copies of some of the material out there 
might be long-gone - hence my concern about improving the quality of some scans.

I suppose a vague rule of thumb might be that if it's not readable by a human 
then it's never going to be readable via OCR :-) Thing is, to maximise 
chances, every single letter in every single scan would have to be proof-read 
for legibility - which is obviously unrealistic.

Hence my feeling that bi-level just isn't good enough for some docs, because 
it won't necessarily discriminate between real text and a hair / dirt / pen 
mark where greyscale *might*. It's not infallible either of course - a blue 
biro mark might be indistinguishable from the faded text below it after 
scanning; give it five years and I'll probably be advocating full-colour scans :-)

cheers

Jules




More information about the cctech mailing list