[UFO Chicago] grep on a PDF

Neil R. Ormos ormos at ripco.com
Mon Mar 17 19:53:40 PDT 2014


Brian Sobolak wrote:

> Hi --
> 
> I have a daily report that is delivered to me via PDF that I'd like to do
> two simply operations:
> 
>  - 'wc' the whole thing to count roughly the number of lines
>  - 'grep' to count the number of certain words.
> 
> I'm reasonably certain that the underlying document is parseable as it
> isn't a scanned image or anything like that.
> 
> I looked around briefly for 'pdf2txt' or something similar and didn't find
> it in my toolkit; I also started the gs man page before being interrupted.
> 
> Has anyone tried this?

Try pdftotext(1).  It comes in the poppler-utils
package, at least in Debian.

Output can be ordered/organized in three different
ways, depending on whether you use the -layout
option, the -raw option, or the default (none of
these options).  If the default output is
unusable, try the other options.

--Neil


More information about the ufo mailing list