Scanning docs for bitsavers

3 Dec 2019

actually? ?we scan to pdf? with back ocr? also text? also tiff also jpegwith the slooowww?
?hp 11x17 scan fax print thing i can scan entite document then save 1 save2 save3? save 4
without rescanning each time? ?ed? at smecc
In a message dated 12/3/2019 2:16:01 AM US Mountain Standard Time, cctalk at
classiccmp.org writes:
Hi!
On Tue, 2019-12-03 11:34:06 +1100, Guy Dunphy via cctalk <cctalk at classiccmp.org>
wrote:
...
  At 01:57 PM 2/12/2019 -0700, you wrote:
 On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via
cctalk <cctalk at classiccmp.org>
wrote:
 > When I corresponded with Al Kossow about format several years ago, he
 > indicated that CCITT Group 4 lossless compression was their standard.  As for
G4 bilevel encoding, the only reasons it isn't treated with the same
 disdain as JBIG2, are:
 1. Bandwaggon effect - "It must be OK because so many people use it."
 2. People with little or zero awareness of typography, the visual quality of
? ? text, and anything to do with preservation of historical character of
? ? printed works. For them "I can read it OK" is the sole requirement.
 G4 compression was invented for fax machines. No one cared much about visual
 quality of faxes, they just had to be readable. Also the technology of fax
 machines was only capable of two-tone B&W reproduction, so that's what G4
 encoding provided. 
So it boils down to two distinct tasks:
? * Scan old paper documentation with a proven file format (ie. no
? ? compression artifacts, b/w or 16 gray-level for black-and-white
? ? text, tables and the like.
? * Make these images accessible as useable documentation.
The first step is that's work-intensive, the second step can probably
be easily redone every time we "learn" something about how to make the
documents more useful.
? For accessibility, PDF seems to be quite a nice choice, as long as
we see that as a representation only (and not as the information
source.) Convert the images to TIFF for example, possibly downsample,
possibly OCR and overlay it.
...
  But PDF literally cannot be used as a wrapper for the
results, since
 it doesn't incorporate the required image compression formats.
 This is why I use things like html structuring, wrapped as either a zip
 file or RARbook format. Because there is no other option at present.
 There will be eventually. Just not yet. PDF has to be either greatly
 extended, or replaced. 
I think that PDF actually is a quite well-working output format, but
we'd see it as a compilation product of our actual source (images),
not as the final (and only) product.
...
  And that's why I get upset when people physically
destroy rare old documents
 during or after scanning them currently. It happens so frequently, that by
 the time we have a technically adequate document coding scheme, a lot of old
 documents won't have any surviving paper copies.
 They'll be gone forever, with only really crap quality scans surviving. 
:-(? Too bad, but that happens all the time.
Thanks,
? Jan-Benedict
--

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Scanning docs for bitsavers