PDF to Text Conversion (Was: Manual scanning: TIFF-to-PDF software with greyscale support?)

Brad Parker brad at heeltoe.com
Tue Dec 22 09:22:36 CST 2009


On Dec 21, 2009, at 9:38 PM, Al Kossow wrote:

> On 12/21/09 5:35 PM, Jerome H. Fine wrote:
>
>> I have about 100,000 lines of code in over 3 dozen PDF files that  
>> were
>> scanned from the hard copy listings. Unfortunately, the original text
>> source
>> files were lost, so the PDF files are a last resort. Other than  
>> typing
>> in the
>> code by hand from the PDF file, are there any good freeware programs
>> to convert a PDF back to a text file?
>>
>
> sounds like the TSX-Plus listings I scanned for Lyle.

I spent a little time playing with ocropus and then teseract, trying  
to scan
pdp-11 diags back to text.  I didn't have good luck.  I'd be  
interested if others
have a working formula.

I did have a little fun "training" tereract on the line printer font.   
I think that
technique holds promise but it needed more data to do a  good job (my  
initial sample
was too small, but did improve things a lot).

just curious if anyone else has tried training one of the ocr programs  
to read
line printer fonts.

-brad




More information about the cctech mailing list