If you OCR, always archive the bitmaps too - Re: Regarding Manuals

Toby Thain toby at telegraphics.com.au
Sat Sep 26 20:41:52 CDT 2015


On 2015-09-26 5:51 PM, Johnny Billquist wrote:
> On 2015-09-26 23:42, Toby Thain wrote:
>> On 2015-09-26 4:28 PM, Johnny Billquist wrote:
>>> On 2015-09-26 12:16, Johnny Billquist wrote:
>>>> On 2015-09-25 22:35, Al Kossow wrote:
>>>>> I have been going back and applying OCR to the ones on bitsavers.
>>>>> Are there some in particular that you have a problem with?
>>>>
>>>> Aha. I wasn't aware of that. I've downloaded copies many years ago that
>>>> I've been keeping locally. I'll check out the current versions on
>>>> bitsavers then.
>>>
>>> Al, exactly how have they been OCRed? Looking at them, it would appear
>>> that what you see is still the bitmaps of all the pages, but then you
>>> have the basic text also available for selection/searching.
>>>
>>> My issue with that is that the documents are huge, and the experience
>>> just scrolling through them is pretty bad.
>>
>> Imho, though I am sure I am not alone:
>>
>> Software which "recreates" the typography of a document from OCR does
>> not produce an acceptable substitute, I've yet to see a book that wasn't
>> ruined by it.
>>
>> Just worth mentioning for anyone who might be tempted - For this reason
>> and others, the bitmaps must NEVER be discarded (Although of course
>> bitmaps can be archived in a different file if people want to supply OCR
>> as well.)
>
> Look at the results in the link I posted. I was more than happy with
> that result.

I've seen plenty of technical books ruined by this technique, which is 
why I beg anyone doing this to not divorce the bitmaps from the OCR'd 
result.

I suppose some books might be relatively immune, but technical texts 
seem to be quite sensitive to poor interpretation by OCR, logically enough.

--Toby

>
> But sure, for those who like bitmaps, I'm certainly not going to take
> them away. But for me, I'm actually interested in the content, and not
> the pixels.
>
>      Johnny
>



More information about the cctalk mailing list