If you OCR, always archive the bitmaps too - Re: Regarding Manuals
dave.g4ugm at gmail.com
Sun Sep 27 08:38:27 CDT 2015
> -----Original Message-----
> From: cctalk [mailto:cctalk-bounces at classiccmp.org] On Behalf Of Johnny
> Sent: 27 September 2015 13:18
> To: cctalk at classiccmp.org
> Subject: Re: If you OCR, always archive the bitmaps too - Re: Regarding
> On 2015-09-27 03:41, Toby Thain wrote:
> > On 2015-09-26 5:51 PM, Johnny Billquist wrote:
> >> On 2015-09-26 23:42, Toby Thain wrote:
> >>> On 2015-09-26 4:28 PM, Johnny Billquist wrote:
> >>>> On 2015-09-26 12:16, Johnny Billquist wrote:
> >>>>> On 2015-09-25 22:35, Al Kossow wrote:
> >>>>>> I have been going back and applying OCR to the ones on bitsavers.
> >>>>>> Are there some in particular that you have a problem with?
> >>>>> Aha. I wasn't aware of that. I've downloaded copies many years ago
> >>>>> that I've been keeping locally. I'll check out the current
> >>>>> versions on bitsavers then.
> >>>> Al, exactly how have they been OCRed? Looking at them, it would
> >>>> appear that what you see is still the bitmaps of all the pages, but
> >>>> then you have the basic text also available for selection/searching.
> >>>> My issue with that is that the documents are huge, and the
> >>>> experience just scrolling through them is pretty bad.
> >>> Imho, though I am sure I am not alone:
> >>> Software which "recreates" the typography of a document from OCR
> >>> does not produce an acceptable substitute, I've yet to see a book
> >>> that wasn't ruined by it.
> >>> Just worth mentioning for anyone who might be tempted - For this
> >>> reason and others, the bitmaps must NEVER be discarded (Although of
> >>> course bitmaps can be archived in a different file if people want to
> >>> supply OCR as well.)
> >> Look at the results in the link I posted. I was more than happy with
> >> that result.
> > I've seen plenty of technical books ruined by this technique, which is
> > why I beg anyone doing this to not divorce the bitmaps from the OCR'd
> > result.
> > I suppose some books might be relatively immune, but technical texts
> > seem to be quite sensitive to poor interpretation by OCR, logically enough.
> I suppose it is the eternal argument between preservation and use. I use
> these documents every day. I don't care about the pixels, but the content.
> Museums and the like are obviously more interested in the preservation.
> I get the feeling you didn't actually check the text that I OCRed from a book.
> That text is an example what I'm looking for.
I did. I found it hard to read as it has OCr'd with mixed typefaces. It is also only basically non-technical English. Try a couple of pages from any of the VM/370R6 manuals.
I have tried to OCR without the bit maps with little success. These are especially badly reproduced (originally not as scanned). I can read the text from the BitMaps and know its right.
One error in the OCR and I can be scratching my hed for ages. I also don't have problems reading them on a laprop...
> I will not prevent people who want pixel preservation from continuing to
> have that. But for me, it is a problem. The experience in actually using these
> documents are pretty poor. And, as have also been noted, information have
> been lost in these scans, as they have not preserved color codings.
> Johnny Billquist || "I'm on a bus
> || on a psychedelic trip
> email: bqt at softjar.se || Reading murder books
> pdp is alive! || tryin' to stay hip" - B. Idol
More information about the cctalk