[UFO Chicago] which is faster when dealing with large data sets

Mon Nov 16 21:01:30 PST 2015

On Mon, November 16, 2015 12:55 am, Neil R. Ormos wrote:
> Brian Sobolak wrote:
>
>> I have a list of 4400 elements.  I have to find those items in a
>> list of 4M elements.  It's not deterministic, ie, I won't have keys
>> that match exactly, but have to use simple checks (e.g. do the first
>> 3 characters match?  Do I find 2 words from from X in the other set
>> Y?) etc.
>
>> I'm using awk, of course.
>
>> I'm sure the precise answer depends on the exact operations I
>> choose, but is it generally faster to compare 4400 vs 4M, or 4M vs
>> 4400?  I'm guessing which one you start with matters.
>
> The answer would depend, at least in part, on the size of the records
> in each file, and the complexity of the tests you are using to
> determine similarity.
>
> It would also depend on whether you intend (and have enough RAM) to
> read all the data into awk arrays when you start the program, as
> opposed to making plural passes across one or more files.
>
> Without knowing those things, but assuming records and files of
> manageable size, I would suggest that it's likely to be faster to have
> the inner loop iterate over the smaller set of 4400 records, as there
> is some possibility that the code and data needed for that can reside
> in cache.  Also, this structure would make it convenient to read and
> process each record from the larger file sequentially, in a single
> pass, using awk's main pattern-action loop, avoiding the need to store
> all of the larger file in RAM. (If whatever you're doing involves
> Windows' paging mechanism, you'll pay a huge performance penalty.)
>
> Also, you may be able to get better performance if you can find a way
> to code your similarity tests to exploit awk's associative array
> features, via simple array look-ups, or searches of array indices
> (e.g., if (x in array) ...). Those internal index matching operations
> are much faster than iterating over the array in a loop (at least in
> GNU Awk).
>

This was helpful.  I'm trying it now (and having decent results).  Rather
than retain the full record, I built to arrays with the keys and the stuff
I wanted to match and am trying various parameters to see if it works. 
(Match on 3 words vs 2, compare first 5 characters, etc.)  Lots of brute
force guess work to find how names match, esp. when non-English values are
involved.

I am doing this on cygwin on windows with 8GB of RAM and it takes a few
minutes to build the array of 4M rows, and... awhile to run through the
whole thing.  I've been letting it run in the background and thus haven't
paid much attention.

brian

--