[UFO Chicago] grep on a PDF
Neil R. Ormos
ormos at ripco.com
Tue Mar 18 06:29:14 PDT 2014
Carl Karsten wrote:
> Neil R. Ormos <ormos at ripco.com> wrote:
>> Brian Sobolak wrote:
>>> I have a daily report that is delivered to me via PDF that I'd like to do
>>> two simply operations:
>>> - 'wc' the whole thing to count roughly the number of lines
>>> - 'grep' to count the number of certain words.
>>> I'm reasonably certain that the underlying document is parseable as it
>>> isn't a scanned image or anything like that.
>>> I looked around briefly for 'pdf2txt' or
>>> something similar and didn't find it in my
>>> toolkit; I also started the gs man page before
>>> being interrupted.
>>> Has anyone tried this?
>> Try pdftotext(1). It comes in the poppler-utils
>> package, at least in Debian.
>> Output can be ordered/organized in three different
>> ways, depending on whether you use the -layout
>> option, the -raw option, or the default (none of
>> these options). If the default output is
>> unusable, try the other options.
> pdftotext would be my first choice, but if you
> want to get lost in options and be ready to take
> on whatever pdf you ever encounter:
> http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
> https://packages.debian.org/wheezy/pdftk
> You can use it to import a data file and fill
> out a PDF form. (because couldn't figure out how
> to fill it out and save it.)
PDFtk is great for some tasks.
I wasn't aware it would extract the document text
(other than text already encoded in "fillable"
form fields). Which pdftk Operation do you use for
that?
--Neil Ormos
More information about the ufo
mailing list