[UFO Chicago] grep on a PDF
Carl Karsten
carl at personnelware.com
Tue Mar 18 08:48:31 PDT 2014
On Tue, Mar 18, 2014 at 8:29 AM, Neil R. Ormos <ormos at ripco.com> wrote:
> Carl Karsten wrote:
> > Neil R. Ormos <ormos at ripco.com> wrote:
> >> Brian Sobolak wrote:
>
> >>> I have a daily report that is delivered to me via PDF that I'd like to
> do
> >>> two simply operations:
>
> >>> - 'wc' the whole thing to count roughly the number of lines
> >>> - 'grep' to count the number of certain words.
>
> >>> I'm reasonably certain that the underlying document is parseable as it
> >>> isn't a scanned image or anything like that.
>
> >>> I looked around briefly for 'pdf2txt' or
> >>> something similar and didn't find it in my
> >>> toolkit; I also started the gs man page before
> >>> being interrupted.
>
> >>> Has anyone tried this?
>
> >> Try pdftotext(1). It comes in the poppler-utils
> >> package, at least in Debian.
>
> >> Output can be ordered/organized in three different
> >> ways, depending on whether you use the -layout
> >> option, the -raw option, or the default (none of
> >> these options). If the default output is
> >> unusable, try the other options.
>
> > pdftotext would be my first choice, but if you
> > want to get lost in options and be ready to take
> > on whatever pdf you ever encounter:
>
> > http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
>
> > https://packages.debian.org/wheezy/pdftk
>
> > You can use it to import a data file and fill
> > out a PDF form. (because couldn't figure out how
> > to fill it out and save it.)
>
> PDFtk is great for some tasks.
>
> I wasn't aware it would extract the document text
> (other than text already encoded in "fillable"
> form fields). Which pdftk Operation do you use for
> that?
>
> --Neil Ormos
>
Huh, looks like it doesn't extract text.
I am surprised it does all that other stuff but not dump text.
Never mind.
--
Carl K
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://ufo.chicago.il.us/pipermail/ufo/attachments/20140318/119a5284/attachment.html>
More information about the ufo
mailing list