Question about PDF manipulation
bv at norbionics.com
Fri Jun 3 04:38:46 CDT 2005
On 2 Jun, 2005, at 19:40, Barry Watzman wrote:
> I really don't think that you understand the nature and capabilities
> of the
> product (Acrobat) that you are criticizing.
I have used Acrobat since the first beta.
Your comment just proves to me that you seem to be totally clueless
about the important issues in modern electronic dissemination of
information, so I guess I have to explain it in more detail.
I am writing this with the help of Display PDF, the engine used for the
display in Mac OS X. It is much superior to the inelegant kludges that
make up other graphical display systems. Adobe is a company that
understands about nice font rendering and the presentation of things.
That is also what Acrobat is about, it is a way to make sure that the
print shop is printing your document the way you intended. That is what
a PDF file is good for, and in every other respect it is an inferior
I have been working with communication and transfer of information
between dissimilar systems for the last 30 years, so I have some idea
about what is beneficial and what is harmful. Doing it the simple way
is often more useful than trying to add features. Using a proprietary
solution is usually bad, even it it is supposed to be the most
widespread and "everybody" is using it. There are exceptions, but they
need to be researched and documented in each individual case.
If you scan a book, you end up with bitmaps of the pages. If you stuff
these bitmaps into a PDF container, the only value you add is that they
are kept together in sequence. The value you subtract is that they are
no longer readily available for everybody, and anybody who wants to OCR
them to make any kind of index or cross reference will have to use
proprietary software or get them extracted from the container they are
Now, if you do it the simple way, you use a suitably named directory
instead of the PDF file. In that directory, you can keep individual
PNGs for each page you scanned, named P000 and up (use whatever
starting value is suitable to reflect the page numbers in the book). If
it is structured that way, you make a subdirectory for each chapter and
appendix, and place the page scans in there instead. Thus you retain
the structure of the original without using any proprietary format, and
everybody with a graphical display will be able to use the scanned
book. There are numerous image viewers available for all platforms,
many will be graphical browsers which will navigate your pages and
directories rather better than an Acrobat reader. Many suitable viewers
are available from open source projects, so you can build one even on
platforms which have been neglected for years.
The most elegant solution, though, is to use an ordinary web browser to
access the pages. It is trivially simple to make a website out of this
directory structure, and there are many free server-side products
available to make access user friendly without any effort by the
It is in the web scenario we most clearly see the "value subtracted"
nature of PDF. If I want to look at the information of page 52, I have
to get the whole document. That will waste bandwidth, and it makes the
server more expensive to operate. Besides, I get thrown out of the
normal working mode for my web browser and into the different mode of
the Acrobat plugin (if that is supported on the platform I use,
otherwise it will be the standalone reader or GhostView or something).
The next important feature of the open solution is that it encourages a
collaborative effort to add useful thigs like indexes, cross references
and even full text versions. Take a look at Wikipedia to see what it is
possible to accomplish when things are kept in open, universally
accessible form. A repository of technical information could be set up
the same way, and it would become gradually more useful as people added
their comments and index hints for the scanned pages as
metainformation. To OCR the pages just for use as aids to searching and
indexing would be simple, the raw OCR output could be given the same
name as the scanned page just with a .txt extension. If somebody later
on were to proofread and mark it up, that would lead to a .xml
document. These kinds of possibilities are only available if the
documents are kept in a simple, logical structure that is accessible to
as many as possible, not just for reading but for further refinement.
In order to avoid technical lock-in today, my preferred document format
is XML with CSS styling, either with an XHTML DTD or, ideally, a DTD
tailored to the usage area and reflected by the stylesheets. Bitmap
images are ideally PNGs, photographs JPEG 2000, and vector images are
SVG. There exists a plethora of free tools to work with, transform and
generate this kind of document.
The most harmful things for anybody who wants a useable, syntactical
web, are lock-in formats. The worst by far is Flash, with Microsoft
Office formats closely following (even when the output is supposed to
be HTML), but PDF is a good third.
More information about the cctalk