Re: PDF Metadata -- PDF Explorer & PDF-ShellTools

[ Show ]

Support VoyForums

[ Shrink ]

VoyForums Announcement: Programming and providing support for this service has been a labor of love since 1997. We are one of the few services online who values our users' privacy, and have never sold your information. We have even fought hard to defend your privacy in legal cases; however, we've done it with almost no financial support -- paying out of pocket to continue providing the service. Due to the issues imposed on us by advertisers, we also stopped hosting most ads on the forums many years ago. We hope you appreciate our efforts.

Show your support by donating any amount. (Note: We are still technically a for-profit company, so your contribution is not tax-deductible.) PayPal Acct: Feedback:

Donate to VoyForums (PayPal):

[ Login ] [ Contact Forum Admin ] [ Main index ] [ Post a new message ] [ Search | Check update time | Archives: 1 ]

Subject:

Re: PDF Metadata

Author:
RTT

[ Next Thread | Previous Thread | Next Message | Previous Message ]

Date Posted: 18:01:43 04/10/06 Mon
In reply to: PC 's message, "PDF Metadata" on 17:22:04 04/10/06 Mon

>1) Is there any way to determine whether the pdf
>documents were originally from scanned documents or
>not? Does your product provide any pdf metadata
>information which can determine this?

No, and there is no standard way to determine that. You can only be sure checking all the pages of the document for text words, and if not text found...

>
>2) (OR) If there is no such field, can I deduce
>if the pdf "Creator" Metadata field is blank that
>means the document is a scanned document?

You can not presume almost nothing from info fields, special from empty fields. These fields are filled by the software\person who create the document and empty fields only say that the person\software let it empty. Software that creates pdf documents using scanning can fill that field also.

>3) (OR) If I cannot make the deduction from 2)
>can I at least deduce that if the "Creator" Metadata
>field is not blank that it is definetely not a scanned
>document? (and if it is blank it may or may not be)

No.

Indirectly with PDFE you can determine if the document pages does not have text so probably only image content.
Run the IndexTextWords Batch tool on these documents and parse the batch output log. You can presume that if for a particular document PDFE does not found words to index, is very probable that file was been obtained by scanning techniques.

Let me know if you do not understand something.

This feature is so many times referenced that I'm probably go to implement some batch tool to deal with this. Implementation ideas are welcome. What other flags are also important.

[ Next Thread | Previous Thread | Next Message | Previous Message ]