<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Mar 18, 2014 at 8:29 AM, Neil R. Ormos <span dir="ltr"><<a href="mailto:ormos@ripco.com" target="_blank">ormos@ripco.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="">Carl Karsten wrote:<br>
> Neil R. Ormos <<a href="mailto:ormos@ripco.com">ormos@ripco.com</a>> wrote:<br>
>> Brian Sobolak wrote:<br>
<br>
</div><div><div class="h5">>>> I have a daily report that is delivered to me via PDF that I'd like to do<br>
>>> two simply operations:<br>
<br>
>>> - 'wc' the whole thing to count roughly the number of lines<br>
>>> - 'grep' to count the number of certain words.<br>
<br>
>>> I'm reasonably certain that the underlying document is parseable as it<br>
>>> isn't a scanned image or anything like that.<br>
<br>
>>> I looked around briefly for 'pdf2txt' or<br>
>>> something similar and didn't find it in my<br>
>>> toolkit; I also started the gs man page before<br>
>>> being interrupted.<br>
<br>
>>> Has anyone tried this?<br>
<br>
>> Try pdftotext(1). It comes in the poppler-utils<br>
>> package, at least in Debian.<br>
<br>
>> Output can be ordered/organized in three different<br>
>> ways, depending on whether you use the -layout<br>
>> option, the -raw option, or the default (none of<br>
>> these options). If the default output is<br>
>> unusable, try the other options.<br>
<br>
> pdftotext would be my first choice, but if you<br>
> want to get lost in options and be ready to take<br>
> on whatever pdf you ever encounter:<br>
<br>
> <a href="http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/" target="_blank">http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/</a><br>
<br>
> <a href="https://packages.debian.org/wheezy/pdftk" target="_blank">https://packages.debian.org/wheezy/pdftk</a><br>
<br>
> You can use it to import a data file and fill<br>
> out a PDF form. (because couldn't figure out how<br>
> to fill it out and save it.)<br>
<br>
</div></div>PDFtk is great for some tasks.<br>
<br>
I wasn't aware it would extract the document text<br>
(other than text already encoded in "fillable"<br>
form fields). Which pdftk Operation do you use for<br>
that?<br>
<span class=""><font color="#888888"><br>
--Neil Ormos<br>
</font></span></blockquote></div><br></div><div class="gmail_extra">Huh, looks like it doesn't extract text. <br><br></div><div class="gmail_extra">I am surprised it does all that other stuff but not dump text.<br></div>
<div class="gmail_extra"><br></div><div class="gmail_extra">Never mind.<br></div><div class="gmail_extra"><br>-- <br>Carl K
</div></div>