Documentation
Jules Richardson
julesrichardsonuk at yahoo.co.uk
Thu Nov 30 10:12:59 CST 2006
Richard wrote:
> In article <456EEA43.7050601 at yahoo.co.uk>,
> Jules Richardson <julesrichardsonuk at yahoo.co.uk> writes:
>
>> [...] it's easy
>> to make sure that the page was scanned straight etc., but easy to miss things
>> which might hinder some future OCR process.
>
> To be honest, almost every time I have tried to OCR something (even a
> pristine original), it was simply faster and more accurate to type it
> in myself. I don't know why but I have been singularly unimpressed
> with OCR software. Obviously lots of people do OCR, but the amount of
> rework and editing necessary to get high accuracy is just as much work
> as typing it in yourself for someone like me that is a fast touch
> typer.
Oh, I agree. Twenty years down the line I expect it'll be a lot better though,
but by then the original paper copies of some of the material out there
might be long-gone - hence my concern about improving the quality of some scans.
I suppose a vague rule of thumb might be that if it's not readable by a human
then it's never going to be readable via OCR :-) Thing is, to maximise
chances, every single letter in every single scan would have to be proof-read
for legibility - which is obviously unrealistic.
Hence my feeling that bi-level just isn't good enough for some docs, because
it won't necessarily discriminate between real text and a hair / dirt / pen
mark where greyscale *might*. It's not infallible either of course - a blue
biro mark might be indistinguishable from the faded text below it after
scanning; give it five years and I'll probably be advocating full-colour scans :-)
cheers
Jules
More information about the cctech
mailing list