If you OCR, always archive the bitmaps too - Re: Regarding Manuals
bqt at update.uu.se
Sun Sep 27 07:17:48 CDT 2015
On 2015-09-27 03:41, Toby Thain wrote:
> On 2015-09-26 5:51 PM, Johnny Billquist wrote:
>> On 2015-09-26 23:42, Toby Thain wrote:
>>> On 2015-09-26 4:28 PM, Johnny Billquist wrote:
>>>> On 2015-09-26 12:16, Johnny Billquist wrote:
>>>>> On 2015-09-25 22:35, Al Kossow wrote:
>>>>>> I have been going back and applying OCR to the ones on bitsavers.
>>>>>> Are there some in particular that you have a problem with?
>>>>> Aha. I wasn't aware of that. I've downloaded copies many years ago
>>>>> I've been keeping locally. I'll check out the current versions on
>>>>> bitsavers then.
>>>> Al, exactly how have they been OCRed? Looking at them, it would appear
>>>> that what you see is still the bitmaps of all the pages, but then you
>>>> have the basic text also available for selection/searching.
>>>> My issue with that is that the documents are huge, and the experience
>>>> just scrolling through them is pretty bad.
>>> Imho, though I am sure I am not alone:
>>> Software which "recreates" the typography of a document from OCR does
>>> not produce an acceptable substitute, I've yet to see a book that wasn't
>>> ruined by it.
>>> Just worth mentioning for anyone who might be tempted - For this reason
>>> and others, the bitmaps must NEVER be discarded (Although of course
>>> bitmaps can be archived in a different file if people want to supply OCR
>>> as well.)
>> Look at the results in the link I posted. I was more than happy with
>> that result.
> I've seen plenty of technical books ruined by this technique, which is
> why I beg anyone doing this to not divorce the bitmaps from the OCR'd
> I suppose some books might be relatively immune, but technical texts
> seem to be quite sensitive to poor interpretation by OCR, logically enough.
I suppose it is the eternal argument between preservation and use. I use
these documents every day. I don't care about the pixels, but the
content. Museums and the like are obviously more interested in the
I get the feeling you didn't actually check the text that I OCRed from a
book. That text is an example what I'm looking for.
I will not prevent people who want pixel preservation from continuing to
have that. But for me, it is a problem. The experience in actually using
these documents are pretty poor. And, as have also been noted,
information have been lost in these scans, as they have not preserved
Johnny Billquist || "I'm on a bus
|| on a psychedelic trip
email: bqt at softjar.se || Reading murder books
pdp is alive! || tryin' to stay hip" - B. Idol
More information about the cctech