Introduction and the Old Approach
We want to revamp the base dedupe engine in CiviCRM 2.1. - the code used in CiviCRM 2.0,
Dedupe works based on a set of rules that say what does it mean that two contacts 'match' in the given rule's context; every rule has a weight; if the sum of weights crosses the set's threshold, these contacts are considered possible dupes of each other.
Three-rule example: (a) first_name equals - weight 5, (b) last_name equals - weight 7, (c) email equals - weight 10; (threshold) two contacts are dupes if the sum of weights is over 10 (i.e., at least two rules 'fired').
The dedupe functionality from 2.0 is built from the grounds up: first a mechanism that generates a query saying 'find me people with the same $field as contact of $id', then a mechanism that, for a given contact_id, runs all such queries (in the above three-rule example, three queries per contact_id) and lefts only these dupes that cross the defined threshold, and, finally, runs this whole set of queries for every single contact_id there is.
Not surprisingly, this does not scale too well - if the number of contacts is 10,000 and the number of rules is three, the above approach means deduping requires 30,000 SQL queries and then the results are compared on the PHP side to see whether the threshold is crossed.
The New Approach to Be Implemented in 2.1
After having some heavier installations try out dedupe, it turns out most (if not all) people using dedupe use it in quite the opposite fashion: the use is not based on retrieving possible dupes of a given contact, but rather on getting all the possible dupe pairs across the whole databse.
As such, we decided to approach dedupe from the other way around: for every dedupe rule there is (say, equal first_name with weight 5), find all contact_id pairs and associate them with the rule's weight. Repeat this for every rule (say, three times). Finally, from all the (id1, id2, weight) triples leave only these id1, id2 pairs that have the weight sum crossing the threshold. This basically means four queries total and no PHP work on the results.
A Proof-of-Concept Example
A query that returns contact_id pairs for first_name (weight 5):
A query that returns contact_id pairs for last_name (weight 7):
A query that returns contact_id pairs for email (weight 10):
Given that we know the above queries, instead of running them and trying to work on the results on the PHP side, we may easily create one temporary dedupe table with all the data still neatly packed in three columns (thanks to UNION on the selects). In pseudo-PHP-code, where $queries is an array with the above three strings:
Now, we can run a second query that filters only pairs with the sum of weights above the threshold (10 in the example):
This way, in only two queries, we have all the contact_id pairs that fulfill the three rules (and only these that cross the threshold). The temporary dedupe table is kept in memory and dropped as soon as the db connection finishes.
An example for the current trunk db seed:
Extensions to the Approach
Substring-Based Rules
Simply change the USING (last_name) condition to ON SUBSTR(c1.last_name, 1, 5) = SUBSTR(c2.last_name, 1, 5).
Per-Contact Limitation
Simply append the required contact_id to the rules (3 in this example):
Further Improvements
To limit the data needed to be handled, the rules could be checked in decreasing weight order and the SELECTs could consider only contacts that have a chance to 'still make it' - e.g., if a contact, during the previous rule checks, accumulated a sum o weights 5, the threshold is 10, and the remaining rules are for a total sum of 3, then there's no point in taking that contact under consideration.
I'm not sure whether the added complexity of such an approach wouldn't actually hamper the overall process (simple JOINs should be easily optimisable by the MySQL engine).

14 Comments
Hide/Show CommentsApr 24, 2008
David Strauss
Neither the 2.0 nor the proposed 2.1 approach uses the algorithm I created as intended. My intended use is running the algorithm for a contact whenever that contact is created or changed. This avoids all the scaling issues because you're not running the dedupe engine for every contact all at once. Unless there's a rule change for matching duplicates, there's no reason to ever run the engine for the whole database at once. The score between two contacts can only change when one of the two contacts changes.
If the concern is giving normal users dedupe control, then we just don't. We can still run the dedupe algorithm whenever a contact changes, even if we just cache the high-scoring results for later use by someone managing duplicate consolidation.
And by "running the dedupe algorithm," I mean running it for only one contact record versus all others. This is a very fast operation.
Apr 24, 2008
Dan Hassell
Just wanted to add my $0.02 to David's comment. Since most of our data comes from importing large data files rather than single contact editing, there still needs to be a more efficient method of detecting duplicates than the one in 2.0. Having a deduping method that only works one at a time would not be much help. Personally I want to encourage this line of thinking proposed above as it is an important issue for us.
In fact I want to ask if the following enhancement is possible -- or feasible -- could a table of first name synonyms be added, so that the system would treat "Mike" the same as "Michael", and "Sue" would match to "Susan"? These are the kinds of differences that result in many hard to detect duplicates in my experience.
Thanks!
Apr 25, 2008
Brian Shaughnessy
I agree with Dan, contra David. While David's point is valid (the usefulness of on-the-fly dedupe checking based on predefined rules), most of the deduping I do is immediately post-import on a large set of records. For example, I may do a large import and then run a check to ensure there are no duplicate emails (which impacts CMS-user match, and thus is an important part of my initial import/setup/config process).
While some basic on-the-fly checking would be useful, I could envision it may impact performance for the end user who is regularly editing records, so I wouldn't want it to be too aggressive (assuming another query is required upon data entry or record saving). Also, I think there will always be a need for running the dedupe process on the full database periodically for "less likely" dups (such as addresses).
On a related subject --
The mechanism for comparing and merging dups is very useful. It would be great if there were a way to manually initiate that process. As Dan pointed out, sometimes the dups are not easily found via an automated process (Sue Jones = Susan Jones), and yet the admin user may stumble upon them while using the system. It would be great if there were a way to select two users from search results and have the option to "merge records" from the activity dropdown -- taking advantage of the existing merge function in the dedupe tool. Because of the usefulness of that merge tool, I've found myself creating specific dedupe criteria to "catch" manually-found dups and be able to easily merge them (it's easy to manually merge when it's just contact info to copy over, but more difficult if there are memberships, events, contribs, activities, etc. that need to be merged).
Apr 25, 2008
Piotr Szotkowski
'The mechanism for comparing and merging dups is very useful. It would be great if there were a way to manually initiate that process.'
That's a good idea - adding some UI elements for this shouldn't be much of a problem.
It's actually already possible to do this by hand - the merge screen's URL works for any contact_id pair. For example, if you want to merge contacts 44 an 69, just go to
.../civicrm/contact/merge?reset=1&cid=44&oid=69Apr 25, 2008
Piotr Szotkowski
'In fact I want to ask if the following enhancement is possible – or feasible – could a table of first name synonyms be added, so that the system would treat "Mike" the same as "Michael", and "Sue" would match to "Susan"?'
I'll think about it; it's not as trivial as it seems to implement this in an universal and efficient way (or I can't think of one off the top of my head).
Also, on a more general note - would a soundex or metaphone approach work here?
Apr 26, 2008
David Strauss
Soundex and metaphones are mentioned in the early thesis draft I sent to Donald that describes my algorithm. I should send both of you the final version of my thesis.
In any case, phonetic algorithms work very well with my algorithm.
Apr 27, 2008
Peter Davis
One more vote for a table in which we can put the Mick=Mike=Michael(=Mikaere in New Zealand) and McDonald=MacDonald. Suspect this will need to be tables we can edit as requirements around the world will be very different.
And having just done a big import i guess I am about to find out more about the current deduping.
Apr 25, 2008
Piotr Szotkowski
'Neither the 2.0 nor the proposed 2.1 approach uses the algorithm I created as intended.'
That's true. Your algorithm is great (and I really admire its elegance) in the use case it addresses, and we hope to implement it 'properly' at some point in time. Unfortunately, at the moment it turns out the most popular use case is to run the dedupe scan across the whole database - either because it's a new functionality added to an already-existing set of contacts (i.e., somebody upgrades to CiviCRM with dedupe functionality and wants to cleanse their database) or, as Dan and Brian commented, a large import just happened and the results should be sorted out. (I see how your approach could be introduced into the import process.)
'My intended use is running the algorithm for a contact whenever that contact is created or changed. This avoids all the scaling issues because you're not running the dedupe engine for every contact all at once. Unless there's a rule change for matching duplicates, there's no reason to ever run the engine for the whole database at once. The score between two contacts can only change when one of the two contacts changes.'
I see how your approach could be based on a non-temporary table that would cache the results in the form (id1, id2, rule_id-or-rule_set_id, weight). The relevant cache entries would be deleted on rule changes and recalculated on contact changes. Does this make sense at all?
'And by "running the dedupe algorithm," I mean running it for only one contact record versus all others. This is a very fast operation.'
Agreed - still, we do need a mechanism to dedupe over the whole database. Also, as noted in the Extensions section, the above approach makes it possible to (I assume - quite quickly/cheaply) do the query for just one contact.
Apr 26, 2008
David Strauss
Well, then maybe the best approach is batched de-duping. Keep a record of all records that need duplicate checking and run the algorithm for 100 (or any other batch size) at a time. There could be an admin area for checking the status of duplicate checks. We would use the sort of table you mention to store the ongoing results.
Apr 28, 2008
xavier dutoit
Hi,
On the dedupe, I have one issue : it isn't possible to keep a low threshold AND flag nuples as "not duplicates" ?
Say that on a first step, Civi identify John doe john@doe.com and bill doe john@doe.com. I know that they are different persons. I would like to flag them as non dupes, so I won't have them that show up everytime I run the dedupe. The goal would be to lower the number of false positive, as when I run a dedupe and the first page is filled with the same contacts that I already have identified as non dupe, I feel like raising the threshold to get rid of them, albeit knowing that "new" real dupes won't be identified.
The goal being to do a first pass, merge the one that needs it to be, flag the others, lower the threshold or simplify the rules, and run a second pass.
Obviously, the second pass shouldn't display again the contacts that have been identified as non dupes.
Then we can keep a rather low threshold, but being able to raise it and do the multi-step approach after a big import or something if necessary.
Does it make sense ?
X+
Apr 30, 2008
Piotr Szotkowski
'I would like to flag them as non dupes, so I won't have them that show up every time I run the dedupe.'
This is a functionality we want to implement, yes.
Apr 29, 2008
Daragh O Brien
Hi,
Duplicate data is a classic information quality problem. The solution is fourfold:
Step 2 above avoids the scenario identified by Xavier... combining fields and conducting a mixed field match based on edit distances or other algorithms reduces the probability of duff matches. Thresholds to define what is a 'match' should be set. Step 3 basically calls for 3 levels of match threshold - NoMatch, HumanReview_Match, and Probable_Match (which may be human reviewed or may be automated.
Pre-matching data cleansing/standardisation is a useful first step to reduce variance and improve your 'odds' of matching correctly (and personally I wouldn't run a matching process that doesn't do this) .
For a more detailed discussion of some of this, check out http://blogs.informatica.com/dataquality/2007/01/matching_determinism_and_probab.html. I'm not suggestion all CiviCRM customers need to run Informatica... there are other options
One such option might be to not build from scratch what might be available as an OpenSource tool... http://www.infosolvetech.com (haven't used their data quality tool yet so this is not a recommendation but more of a suggestion- but will be kicking the tyres on it before long I'm sure) - perhaps there is someway to build a relationship there to provide enterprise scale matching/deduping on a SAAS basis?
A quick scan through sourceforge found me this: http://sourceforge.net/projects/dataquality/ , again I haven't used it and can't vouch for quality (no pun intended) or currently delivered functionality. At the very least, a simple capability to profile your data will shed light quickly on common issues that you might need to address by cleaning (perhaps with SQL statements or record by record manual tidying) to ensure better quality data. Another option is FERBI, also on SourceForge. Ferbi has slightly better documentation and more straightforward installation than the first tool I referenced. Note - both of these tools are desktop tools for analysis of data quality and are not drupal modules or similar, but they might provide a model for consideration or other options for achieving the goal.
Apr 30, 2008
Piotr Szotkowski
Thanks a lot, Daragh, for your through reply!
To address some of the raised issues/ideas:
The data standarisation is a big issue, one one hand quite important for the whole dedupe process, on the other - not so trivial to implement. Hopefully we'll tackle it for CiviCRM 2.2 in some way.
As for match distances, I was thinking about using Levenshtein distance for starters. Thanks for pointing out the other metrics!
We plan to address the multi-threshold approach by allowing creation of similar rule groups with different thresholds (for starters), and will see how this catches up.
Thanks a lot for the links (the Informatica one should drop the last 'b': http://blogs.informatica.com/dataquality/2007/01/matching_determinism_and_proba.html)! I'll make sure to check out the projects mentioned.
Apr 30, 2008
David Strauss
Another part of the paper on my algorithm describes how I really only intend for the current algorithm to provide an initial, small candidate set of duplicates. Items in that small set can have comparisons like Levenshtein distance run on them.