<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Mar 18, 2014 at 8:29 AM, Neil R. Ormos <span dir="ltr"><<a href="mailto:ormos@ripco.com" target="_blank">ormos@ripco.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="">Carl Karsten wrote:<br>

> Neil R. Ormos <<a href="mailto:ormos@ripco.com">ormos@ripco.com</a>> wrote:<br>

>> Brian Sobolak wrote:<br>

<br>

</div><div><div class="h5">>>> I have a daily report that is delivered to me via PDF that I'd like to do<br>

>>> two simply operations:<br>

<br>

>>>  - 'wc' the whole thing to count roughly the number of lines<br>

>>>  - 'grep' to count the number of certain words.<br>

<br>

>>> I'm reasonably certain that the underlying document is parseable as it<br>

>>> isn't a scanned image or anything like that.<br>

<br>

>>> I looked around briefly for 'pdf2txt' or<br>

>>> something similar and didn't find it in my<br>

>>> toolkit; I also started the gs man page before<br>

>>> being interrupted.<br>

<br>

>>> Has anyone tried this?<br>

<br>

>> Try pdftotext(1).  It comes in the poppler-utils<br>

>> package, at least in Debian.<br>

<br>

>> Output can be ordered/organized in three different<br>

>> ways, depending on whether you use the -layout<br>

>> option, the -raw option, or the default (none of<br>

>> these options).  If the default output is<br>

>> unusable, try the other options.<br>

<br>

> pdftotext would be my first choice, but if you<br>

> want to get lost in options and be ready to take<br>

> on whatever pdf you ever encounter:<br>

<br>

> <a href="http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/" target="_blank">http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/</a><br>

<br>

> <a href="https://packages.debian.org/wheezy/pdftk" target="_blank">https://packages.debian.org/wheezy/pdftk</a><br>

<br>

> You can use it to import a data file and fill<br>

> out a PDF form. (because couldn't figure out how<br>

> to fill it out and save it.)<br>

<br>

</div></div>PDFtk is great for some tasks.<br>

<br>

I wasn't aware it would extract the document text<br>

(other than text already encoded in "fillable"<br>

form fields). Which pdftk Operation do you use for<br>

that?<br>

<span class=""><font color="#888888"><br>

--Neil Ormos<br>

</font></span></blockquote></div><br></div><div class="gmail_extra">Huh, looks like it doesn't extract text.  <br><br></div><div class="gmail_extra">I am surprised it does all that other stuff but not dump text.<br></div>


<div class="gmail_extra"><br></div><div class="gmail_extra">Never mind.<br></div><div class="gmail_extra"><br>-- <br>Carl K

</div></div>