[UFO Chicago] which is faster when dealing with large data sets

Sun Nov 15 22:55:53 PST 2015

Brian Sobolak wrote:

> I have a list of 4400 elements.  I have to find those items in a
> list of 4M elements.  It's not deterministic, ie, I won't have keys
> that match exactly, but have to use simple checks (e.g. do the first
> 3 characters match?  Do I find 2 words from from X in the other set
> Y?) etc.

> I'm using awk, of course.

> I'm sure the precise answer depends on the exact operations I
> choose, but is it generally faster to compare 4400 vs 4M, or 4M vs
> 4400?  I'm guessing which one you start with matters.

The answer would depend, at least in part, on the size of the records
in each file, and the complexity of the tests you are using to
determine similarity.

It would also depend on whether you intend (and have enough RAM) to
read all the data into awk arrays when you start the program, as
opposed to making plural passes across one or more files.

Without knowing those things, but assuming records and files of
manageable size, I would suggest that it's likely to be faster to have
the inner loop iterate over the smaller set of 4400 records, as there
is some possibility that the code and data needed for that can reside
in cache.  Also, this structure would make it convenient to read and
process each record from the larger file sequentially, in a single
pass, using awk's main pattern-action loop, avoiding the need to store
all of the larger file in RAM. (If whatever you're doing involves
Windows' paging mechanism, you'll pay a huge performance penalty.)

Also, you may be able to get better performance if you can find a way
to code your similarity tests to exploit awk's associative array
features, via simple array look-ups, or searches of array indices
(e.g., if (x in array) ...). Those internal index matching operations
are much faster than iterating over the array in a loop (at least in
GNU Awk).

Best regards,

--Neil