Question about PDF manipulation
julesrichardsonuk at yahoo.co.uk
Thu Jun 2 16:15:33 CDT 2005
On Thu, 2005-06-02 at 22:49 +0200, Jan-Benedict Glaw wrote:
> On Thu, 2005-06-02 16:39:58 -0400, Paul Koning <pkoning at equallogic.com> wrote:
> > >>>>> "Jan-Benedict" == Jan-Benedict Glaw <jbglaw at lug-owl.de> writes:
> > Jan-Benedict> Do we actually *have* the tools? We've got tumble to
> > Jan-Benedict> assemble a PDF file, but do we have proper tools to
> > Jan-Benedict> disassemble one? ...and I really mean exporting the
> > Jan-Benedict> initial TIFF, not something that looks like it.
> > Ghostscript reads PDF files every bit as well as PS files, and it's
> > open source...
> You didn't answer my question:-) Consider I prepare a TIFF file that
> contains (with additional tags) eg. some raw OCRed text, not
> read-checked. Now I preapre a PDF from this and use gs to get the image
> back. Is my text still there? Or do I get an image that "looks" almost
> the original, but doesn't contain my extra-data?
Hmm, 'no' seems to be the answer. Or at least when I use ImageMagick
(which seems to call into ghostscript in the case of manipulating PDF
files) it's not preserving the TIFF comment field.
A did the following:
- Created a small TIFF image with Gimp, and saved it with "this is a
comment" in the comment field.
- Verified that the comment was in place using by running 'identify'
on the TIFF file.
- Converted the single TIFF file to a PDF using ImageMagick's convert
utility (which calls into Ghostscript librairies AFAIK)
- Converted the resulting PDF file back to a single TIFF image with
- Ran identify again on the resulting TIFF file, and the comment's now
changed to: "Image generated by ESP Ghostscript (device=pnmraw)"
... so it looks like any TIFF 'metadata' isn't getting preserved.
Looking at the PDF file, I'm not convinced there's any TIFF data in
there to be honest. It looks more like the image is re-encoded from the
input TIFF to PDFs own way of storing bitmap data - in other words it's
not simply a wrapper for a bunch of TIFF images, but merely a wrapper
for bitmap data in PDF's own format. That's something of a
disappointment; I always thought PDF just encapsulated the input images
rather than re-encoding in any way...
More information about the cctalk