OCR old software listing

Mon Dec 31 19:33:09 CST 2018

On the other hand, just for the FUN of it, 
can you write some software to find and fix (or simply flag) the most 
common errors?

When I had to terminate my publisher, I was s'posed to receive a copy of 
their customer database.
They deleted all delimiters (spaces, commas, periods, and other 
punctuation), and then printed it out on greenbar with a font that used 
the same character for zero and letter 'O'; and same character for 
one, lower case 'l', and upper cse 'I'.  Surprisingly, they did NOT use a 
bad ribbon and printhead!

An acquaintance OCR'ed it. They were able to get what they thought was "80 
to 90%" from the originals, but not from a xerox copy that visually 
seemed to be just as good.

I spent a little time writing some simple code to parse and fix most of 
it. 
Mostly simple context, such as a zero between two letters is likely an 
'O', or an 'O' between two numerals is likely a zero.  Similarly with one, 
lower case 'L' and upper case 'I'.   Some OCR software now pays attention 
to context.
Five consecutive numerals following two capital letters is likely to be a 
zip code, and end of the record.  USUALLY. Comparison of those digits 
with the two letters in a zipcode database provided partial confirmation.
. . . and so forth . . .

Not a practical use of time, but a fun exercise in parsing.

Another time, the .SRT file that I found for "Company Man" used upper case 
'I' instead of lower case 'L'!  (AND had a three minute offset for the 
start time)    Did not take very long to fix.

--
Grumpy Ol' Fred     		cisin at xenosoft.com

On Tue, 1 Jan 2019, dwight wrote:

> Fred is right, OCR is only worth it if the document is in perfect condition. I just finish getting an old 4004 listing working. I made only two mistakes on the 4K of code that were not the fault of the poorness of the listing. Twice I put LDM instead of LD. LDM was the most commonly used.
> There were still some 15 or so other errors do to the printing. It looked to be done on a ASR33 with poor registration of the print drum. Cs and 0s were often missing the right 1/3. Expecting an OCR to do much would have been a folly. Even though some 85% to 90% could be read properly. It took be about 3 weeks of evenings to make heads or tails of the code. I've finally got it running correctly.
> If it had been done with an OCR, many cases it would have simply put a C instead of a 0. I'd have had to go through the listing, checking each C to make sure it was right. It is easier in many cases to have analysed what I could see and make a judgement, based on what I could see and the general context as I was typing it in.
> Dwight
>
> ________________________________
> From: cctalk <cctalk-bounces at classiccmp.org> on behalf of Fred Cisin via cctalk <cctalk at classiccmp.org>
> Sent: Monday, December 31, 2018 9:46 AM
> To: General Discussion: On-Topic and Off-Topic Posts
> Subject: Re: OCR old software listing
>
> On Mon, 31 Dec 2018, Larry Kraemer via cctalk wrote:
>> I used the libtiff-tools (Debian 8.x - 32 Bit) to extract all 61 .TIF's
>> from the Multipage .tif file.  While the .tif's look descent, and
>> RasterVect shows the .tif properties to be Group 4 Fax (1bpp) with 5100
>> x 6600 pixels - 300 DPI, I can't get tesseract 3.x, TextBridge Classic
>> 2.0, or Irfanview with KADMOS Plugin to OCR any of the .tif files, with
>> descent results.  I'd expect an OCR of 85 to 90 % correct conversion to
>> ASCII text.
>
> Software listings need more accuraacy than that.
> How many wrong characters does it take for a program not to work?
> "desCent" isn't good enough.
>
> 85 to 90 % correct is a character wrong in every 6 to 10 characters.
> How many errors is that PER LINE?
>
> "But, you can start with that, and just fix the errors, without retyping
> the rest."  Doing it that way is a desCent into madness.
> BTDT.  wore out the T-shirts.
>
>
> A competent typist can retype the whole thing faster than fixing an error
> in every six to ten characters.
> Only if there is less than one error for every several hundred characters
> does "patching it" save time for a competent typist.
> In general, for a competent typist, the fastest way to reposition the
> cursor to the next error in the line is to simply hit the keys of the
> intervening letters.
> It is NOT to move the cursor with the mouse, then put your hand back on
> the keys to type a character.
> Using cursor motion keys is no faster for a competent typist than hitting
> the keys of the letters toskip over.
>
>
> TIP: display the OCR'ed text that is to be corrected in a font that
> exaggerates the difference between zero and the letter 'O', and between
> one and lower case 'l'.  There are some programs that will attempt to
> select those based on context.
>
> --
> Grumpy Ol' Fred                  cisin at xenosoft.com