Tuesday, 10 September 2013

Algorithm for finding matching set for a given set of inputs

Algorithm for finding matching set for a given set of inputs

I have a set of input sets, each element is consisting of up to 50 chars.
Think of a database table with varchar(50) columns, each row being one
input set. The column count is usually about within 5-8 range. The number
of rows is usually about 2-3 millions but can be up to 150 million. Every
column must have a non-null value. Let us call this table T, and each of
its rows RT
I have a second set of inputs, which denote a matching pattern. Just like
the original input it can also be considered a database table with the
same fixed number of columns (plus metadata for matching target but that
part is not relevant). This time however, only one column has to have a
value but any of them can be null. Let us call this table M, and each of
its rows RM. The typical size of this table is 4000 - 5000 rows But can be
up to 40 000.
The task here is for each RT, we have to find the matching RM. Here are
the matching rules for a given RT:
A match is made if all the elements of RM is the same as the corresponding
element in RT. (a string exact match is considered as same). If an element
is null in RM, RT is not checked for a match.
Only one match is possible. If multiple matches are identified, the one
with longest RM is considered the correct one. The case of multiple RM
having same length can be ignored here. We pick just the first one
identified.
The code will run on a typical windows client with 4 GB of RAM. So, MT can
be held in memory but T cannot.
I am looking for a technique to reduce the number of comparisons. More
specifically, are you aware of a technique where the whole RT can be
checked against for a given RT. If not, what is the most efficient way to
handle this problem?
This is going to be coded in .NET and any existing code or library for
.NET is very much appreciated.
Thanks,
Kemal
NOTES:
This is not a homework. I graduated about 20 years ago :)
There is a variety with regular expressions. But for the current release,
a match with simple string equivalence is sufficient.
Here is some examples:
RT = { A, B, C, D, E }
RM1 = { A,,,, } RM2 = { A, B,,, } RM3 = {A,,,D, E}
For this RT, all of these match. But due to the rule 2, we pick RM3 as the
match. Hope this example clarifies. Thanks again

No comments:

Post a Comment