[UFO Chicago] grep on a PDF

Tue Mar 18 06:29:14 PDT 2014

Carl Karsten wrote:
> Neil R. Ormos <ormos at ripco.com> wrote:
>> Brian Sobolak wrote:

>>> I have a daily report that is delivered to me via PDF that I'd like to do
>>> two simply operations:

>>>  - 'wc' the whole thing to count roughly the number of lines
>>>  - 'grep' to count the number of certain words.

>>> I'm reasonably certain that the underlying document is parseable as it
>>> isn't a scanned image or anything like that.

>>> I looked around briefly for 'pdf2txt' or
>>> something similar and didn't find it in my
>>> toolkit; I also started the gs man page before
>>> being interrupted.

>>> Has anyone tried this?

>> Try pdftotext(1).  It comes in the poppler-utils
>> package, at least in Debian.

>> Output can be ordered/organized in three different
>> ways, depending on whether you use the -layout
>> option, the -raw option, or the default (none of
>> these options).  If the default output is
>> unusable, try the other options.

> pdftotext would be my first choice, but if you
> want to get lost in options and be ready to take
> on whatever pdf you ever encounter:

> http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/

> https://packages.debian.org/wheezy/pdftk

> You can use it to import a data file and fill
> out a PDF form. (because couldn't figure out how
> to fill it out and save it.)

PDFtk is great for some tasks.

I wasn't aware it would extract the document text
(other than text already encoded in "fillable"
form fields). Which pdftk Operation do you use for
that?

--Neil Ormos