If you OCR, always archive the bitmaps too - Re: Regarding Manuals

Johnny Billquist bqt at update.uu.se
Sun Sep 27 09:08:07 CDT 2015

On 2015-09-27 15:38, Dave G4UGM wrote:
>> -----Original Message-----
>> From: cctalk [mailto:cctalk-bounces at classiccmp.org] On Behalf Of Johnny
>> Billquist
>> Sent: 27 September 2015 13:18
>> To: cctalk at classiccmp.org
>> Subject: Re: If you OCR, always archive the bitmaps too - Re: Regarding
>> Manuals
>> On 2015-09-27 03:41, Toby Thain wrote:
>>> On 2015-09-26 5:51 PM, Johnny Billquist wrote:
>>>> On 2015-09-26 23:42, Toby Thain wrote:
>>>>> On 2015-09-26 4:28 PM, Johnny Billquist wrote:
>>>>>> On 2015-09-26 12:16, Johnny Billquist wrote:
>>>>>>> On 2015-09-25 22:35, Al Kossow wrote:
>>>>>>>> I have been going back and applying OCR to the ones on bitsavers.
>>>>>>>> Are there some in particular that you have a problem with?
>>>>>>> Aha. I wasn't aware of that. I've downloaded copies many years ago
>>>>>>> that I've been keeping locally. I'll check out the current
>>>>>>> versions on bitsavers then.
>>>>>> Al, exactly how have they been OCRed? Looking at them, it would
>>>>>> appear that what you see is still the bitmaps of all the pages, but
>>>>>> then you have the basic text also available for selection/searching.
>>>>>> My issue with that is that the documents are huge, and the
>>>>>> experience just scrolling through them is pretty bad.
>>>>> Imho, though I am sure I am not alone:
>>>>> Software which "recreates" the typography of a document from OCR
>>>>> does not produce an acceptable substitute, I've yet to see a book
>>>>> that wasn't ruined by it.
>>>>> Just worth mentioning for anyone who might be tempted - For this
>>>>> reason and others, the bitmaps must NEVER be discarded (Although of
>>>>> course bitmaps can be archived in a different file if people want to
>>>>> supply OCR as well.)
>>>> Look at the results in the link I posted. I was more than happy with
>>>> that result.
>>> I've seen plenty of technical books ruined by this technique, which is
>>> why I beg anyone doing this to not divorce the bitmaps from the OCR'd
>>> result.
>>> I suppose some books might be relatively immune, but technical texts
>>> seem to be quite sensitive to poor interpretation by OCR, logically enough.
>> I suppose it is the eternal argument between preservation and use. I use
>> these documents every day. I don't care about the pixels, but the content.
>> Museums and the like are obviously more interested in the preservation.
>> I get the feeling you didn't actually check the text that I OCRed from a book.
>> That text is an example what I'm looking for.
> I did. I found it hard to read as it has OCr'd with mixed typefaces.

That's because it looked exactly that way in the original as well. Blame 
the author/typesetter.

> It is also only basically non-technical English. Try a couple of pages from any of the VM/370R6 manuals.

Like I said, I don't even remember what software I used. :-(
I got it with a scanner I bought in the 90s, and running on Windows ME. 
Otherwise I would love to apply it to something bigger/more technical.

> I have tried to OCR without the bit maps with little success. These are especially badly reproduced (originally not as scanned). I can read the text from the BitMaps and know its right.
> One error in the OCR and I can be scratching my hed for ages. I also don't have problems reading them on a laprop...

Errors are always bad. Agreed. That is not something we're discussing here.

I don't have problems reading the current scans, as such. But when 
having ten of these open at the same time, and scrolling through them, 
it becomes obvious that the bitmaps are heavy. It can take a while for 
the screen to be updated. Not to mention the problems you sometimes hits 
with searching...


Johnny Billquist                  || "I'm on a bus
                                   ||  on a psychedelic trip
email: bqt at softjar.se             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol

More information about the cctalk mailing list