[UFO Chicago] grep on a PDF

Mon Mar 17 20:19:21 PDT 2014

On Mon, Mar 17, 2014 at 9:53 PM, Neil R. Ormos <ormos at ripco.com> wrote:

> Brian Sobolak wrote:
>
> > Hi --
> >
> > I have a daily report that is delivered to me via PDF that I'd like to do
> > two simply operations:
> >
> >  - 'wc' the whole thing to count roughly the number of lines
> >  - 'grep' to count the number of certain words.
> >
> > I'm reasonably certain that the underlying document is parseable as it
> > isn't a scanned image or anything like that.
> >
> > I looked around briefly for 'pdf2txt' or something similar and didn't
> find
> > it in my toolkit; I also started the gs man page before being
> interrupted.
> >
> > Has anyone tried this?
>
> Try pdftotext(1).  It comes in the poppler-utils
> package, at least in Debian.
>
> Output can be ordered/organized in three different
> ways, depending on whether you use the -layout
> option, the -raw option, or the default (none of
> these options).  If the default output is
> unusable, try the other options.
>

pdftotext would be my first choice, but if you want to get lost in options
and be ready to take on whatever pdf you ever encounter:

http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/

https://packages.debian.org/wheezy/pdftk

You can use it to import a data file and fill out a PDF form. (because
couldn't figure out how to fill it out and save it.)

-- 
Carl K
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ufo.chicago.il.us/pipermail/ufo/attachments/20140317/31f1cfb9/attachment.html>