Skip to end of metadata
Go to start of metadata

Introduction and the Old Approach

We want to revamp the base dedupe engine in CiviCRM 2.1. - the code used in CiviCRM 2.0,

Dedupe works based on a set of rules that say what does it mean that two contacts 'match' in the given rule's context; every rule has a weight; if the sum of weights crosses the set's threshold, these contacts are considered possible dupes of each other.

Three-rule example: (a) first_name equals - weight 5, (b) last_name equals - weight 7, (c) email equals - weight 10; (threshold) two contacts are dupes if the sum of weights is over 10 (i.e., at least two rules 'fired').

The dedupe functionality from 2.0 is built from the grounds up: first a mechanism that generates a query saying 'find me people with the same $field as contact of $id', then a mechanism that, for a given contact_id, runs all such queries (in the above three-rule example, three queries per contact_id) and lefts only these dupes that cross the defined threshold, and, finally, runs this whole set of queries for every single contact_id there is.

Not surprisingly, this does not scale too well - if the number of contacts is 10,000 and the number of rules is three, the above approach means deduping requires 30,000 SQL queries and then the results are compared on the PHP side to see whether the threshold is crossed.

The New Approach to Be Implemented in 2.1

After having some heavier installations try out dedupe, it turns out most (if not all) people using dedupe use it in quite the opposite fashion: the use is not based on retrieving possible dupes of a given contact, but rather on getting all the possible dupe pairs across the whole databse.

As such, we decided to approach dedupe from the other way around: for every dedupe rule there is (say, equal first_name with weight 5), find all contact_id pairs and associate them with the rule's weight. Repeat this for every rule (say, three times). Finally, from all the (id1, id2, weight) triples leave only these id1, id2 pairs that have the weight sum crossing the threshold. This basically means four queries total and no PHP work on the results.

A Proof-of-Concept Example

A query that returns contact_id pairs for first_name (weight 5):

A query that returns contact_id pairs for last_name (weight 7):

A query that returns contact_id pairs for email (weight 10):

Given that we know the above queries, instead of running them and trying to work on the results on the PHP side, we may easily create one temporary dedupe table with all the data still neatly packed in three columns (thanks to UNION on the selects). In pseudo-PHP-code, where $queries is an array with the above three strings:

Now, we can run a second query that filters only pairs with the sum of weights above the threshold (10 in the example):

This way, in only two queries, we have all the contact_id pairs that fulfill the three rules (and only these that cross the threshold). The temporary dedupe table is kept in memory and dropped as soon as the db connection finishes.

An example for the current trunk db seed:

Extensions to the Approach

Substring-Based Rules

Simply change the USING (last_name) condition to ON SUBSTR(c1.last_name, 1, 5) = SUBSTR(c2.last_name, 1, 5).

Per-Contact Limitation

Simply append the required contact_id to the rules (3 in this example):

Further Improvements

To limit the data needed to be handled, the rules could be checked in decreasing weight order and the SELECTs could consider only contacts that have a chance to 'still make it' - e.g., if a contact, during the previous rule checks, accumulated a sum o weights 5, the threshold is 10, and the remaining rules are for a total sum of 3, then there's no point in taking that contact under consideration.

I'm not sure whether the added complexity of such an approach wouldn't actually hamper the overall process (simple JOINs should be easily optimisable by the MySQL engine).

Labels:
  1. Apr 24, 2008

    Neither the 2.0 nor the proposed 2.1 approach uses the algorithm I created as intended. My intended use is running the algorithm for a contact whenever that contact is created or changed. This avoids all the scaling issues because you're not running the dedupe engine for every contact all at once. Unless there's a rule change for matching duplicates, there's no reason to ever run the engine for the whole database at once. The score between two contacts can only change when one of the two contacts changes.

    If the concern is giving normal users dedupe control, then we just don't. We can still run the dedupe algorithm whenever a contact changes, even if we just cache the high-scoring results for later use by someone managing duplicate consolidation.

    And by "running the dedupe algorithm," I mean running it for only one contact record versus all others. This is a very fast operation.

    1. Apr 24, 2008

      Just wanted to add my $0.02 to David's comment.  Since most of our data comes from importing large data files rather than single contact editing, there still needs to be a more efficient method of detecting duplicates than the one in 2.0.  Having a deduping method that only works one at a time would not be much help.  Personally I want to encourage this line of thinking proposed above as it is an important issue for us.

       In fact I want to ask if the following enhancement is possible -- or feasible -- could a table of first name synonyms be added, so that the system would treat "Mike" the same as "Michael", and "Sue" would match to "Susan"?  These are the kinds of differences that result in many hard to detect duplicates in my experience.

       Thanks!

      1. Apr 25, 2008

        I agree with Dan, contra David. While David's point is valid (the usefulness of on-the-fly dedupe checking based on predefined rules), most of the deduping I do is immediately post-import on a large set of records. For example, I may do a large import and then run a check to ensure there are no duplicate emails (which impacts CMS-user match, and thus is an important part of my initial import/setup/config process).

        While some basic on-the-fly checking would be useful, I could envision it may impact performance for the end user who is regularly editing records, so I wouldn't want it to be too aggressive (assuming another query is required upon data entry or record saving). Also, I think there will always be a need for running the dedupe process on the full database periodically for "less likely" dups (such as addresses).

        On a related subject --
        The mechanism for comparing and merging dups is very useful. It would be great if there were a way to manually initiate that process. As Dan pointed out, sometimes the dups are not easily found via an automated process (Sue Jones = Susan Jones), and yet the admin user may stumble upon them while using the system. It would be great if there were a way to select two users from search results and have the option to "merge records" from the activity dropdown -- taking advantage of the existing merge function in the dedupe tool. Because of the usefulness of that merge tool, I've found myself creating specific dedupe criteria to "catch" manually-found dups and be able to easily merge them (it's easy to manually merge when it's just contact info to copy over, but more difficult if there are memberships, events, contribs, activities, etc. that need to be merged).

        1. Apr 25, 2008

          'The mechanism for comparing and merging dups is very useful. It would be great if there were a way to manually initiate that process.'

          That's a good idea - adding some UI elements for this shouldn't be much of a problem.

          It's actually already possible to do this by hand - the merge screen's URL works for any contact_id pair. For example, if you want to merge contacts 44 an 69, just go to .../civicrm/contact/merge?reset=1&cid=44&oid=69

      2. Apr 25, 2008

        'In fact I want to ask if the following enhancement is possible – or feasible – could a table of first name synonyms be added, so that the system would treat "Mike" the same as "Michael", and "Sue" would match to "Susan"?'

        I'll think about it; it's not as trivial as it seems to implement this in an universal and efficient way (or I can't think of one off the top of my head).

        Also, on a more general note - would a soundex or metaphone approach work here?

        1. Apr 26, 2008

          Soundex and metaphones are mentioned in the early thesis draft I sent to Donald that describes my algorithm. I should send both of you the final version of my thesis.

          In any case, phonetic algorithms work very well with my algorithm.

      3. Apr 27, 2008

        One more vote for a table in which we can put the Mick=Mike=Michael(=Mikaere in New Zealand) and McDonald=MacDonald. Suspect this will need to be tables we can edit as requirements around the world will be very different.

        And having just done a big import i guess I am about to find out more about the current deduping. 

    2. Apr 25, 2008

      'Neither the 2.0 nor the proposed 2.1 approach uses the algorithm I created as intended.'

      That's true. Your algorithm is great (and I really admire its elegance) in the use case it addresses, and we hope to implement it 'properly' at some point in time. Unfortunately, at the moment it turns out the most popular use case is to run the dedupe scan across the whole database - either because it's a new functionality added to an already-existing set of contacts (i.e., somebody upgrades to CiviCRM with dedupe functionality and wants to cleanse their database) or, as Dan and Brian commented, a large import just happened and the results should be sorted out. (I see how your approach could be introduced into the import process.)

      'My intended use is running the algorithm for a contact whenever that contact is created or changed. This avoids all the scaling issues because you're not running the dedupe engine for every contact all at once. Unless there's a rule change for matching duplicates, there's no reason to ever run the engine for the whole database at once. The score between two contacts can only change when one of the two contacts changes.'

      I see how your approach could be based on a non-temporary table that would cache the results in the form (id1, id2, rule_id-or-rule_set_id, weight). The relevant cache entries would be deleted on rule changes and recalculated on contact changes. Does this make sense at all?

      'And by "running the dedupe algorithm," I mean running it for only one contact record versus all others. This is a very fast operation.'

      Agreed - still, we do need a mechanism to dedupe over the whole database. Also, as noted in the Extensions section, the above approach makes it possible to (I assume - quite quickly/cheaply) do the query for just one contact.

      1. Apr 26, 2008

        Well, then maybe the best approach is batched de-duping. Keep a record of all records that need duplicate checking and run the algorithm for 100 (or any other batch size) at a time. There could be an admin area for checking the status of duplicate checks. We would use the sort of table you mention to store the ongoing results.

  2. Apr 28, 2008

    Hi,

    On the dedupe, I have one issue : it isn't possible to keep a low threshold AND flag nuples as "not duplicates" ?

    Say that on a first step, Civi identify John doe john@doe.com and bill doe john@doe.com. I know that they are different persons. I would like to flag them as non dupes, so I won't have them that show up everytime I run the dedupe. The goal would be to lower the number of false positive, as when I run a dedupe and the first page is filled with the same contacts that I already have identified as non dupe, I feel like raising the threshold to get rid of them, albeit knowing that "new" real dupes won't be identified.

    The goal being to do a first pass, merge the one that needs it to be, flag the others, lower the threshold or simplify the rules, and run a second pass.

    Obviously, the second pass shouldn't display again the  contacts that have been identified as non dupes.

    Then we can keep a rather low threshold, but being able to raise it and do the multi-step approach after a big import or something if necessary.

    Does it make sense ?

    X+ 

    1. Apr 30, 2008

      'I would like to flag them as non dupes, so I won't have them that show up every time I run the dedupe.'

      This is a functionality we want to implement, yes. (smile)

  3. Apr 29, 2008

    Hi,

    Duplicate data is a classic information quality problem. The solution is fourfold:

    1. Standardise your data prematch to reduce variation (Mick/Mike/Michael, upper vs lower case) - this applies to both name and address data, apply soundex or nysiis keys, group you data around a common key (country/surname), create clusters to match within/across. Inevitably, standardisation relies on 'local' and localised reference dictionaries - each user group would need to tweak and tune these. Indeed you often need to have some customised code in place to help parse & standardise address data elements (8 Foxfield St. John is my mother's street address... it is parsed/standardised variously as "8 Foxfield Street John", "8 St. JOhn's Foxfield", "8 Foxfield Saint John (the correct version)
    2. Match your data based on matching algorithms/thresholds. My experience has shown that a combination of approaches averaged out works best... matching on standard keys like Soundex will NOT give a reliable result, you will need to include Edit Distance, Jaro, Hamming Distance and other similar algorithms depending on the complexity of the data. Edit Distance is useful in distinguishing matches where strings on a character for character match are different but are close enough to be matched... e.g. "Daragh/Darragh" (A) requires 1 edit step to match while "Daragh/Darren" (B) requires more steps and "Daragh/Kevin" (C) requires a total change of the string... A scores higher than B scores higher than C on an Edit distance algorithm. Combining these match scores based on a relative weighting of each improves the total "Weighted average" of your matching which should help reduce the level of false positives in your data. Including fields like date of birth (which US airlines are finally doing on their terrorist watch lists) further helps eliminate false positives in clusters as you can have a high edit distance score on "date-of-birth" (low match probability) which could reduce the overall weighted average of your match.
    3. Human review of match clusters is important - automated processes can 'cluster' and rank the probablity of a match, but scores can be skewed and unless you have a pre-test capability to preview clusters and the quality of matches within clusters, you inevitably will mis-match records. Human preview can be either "preview thresholds and approve automated threshold application (which matches all records above X% probability of match)" or it can be a 100% manual review of each cluster.
    4. Ensure processes for creation of records can prevent duplicates being created post-deployment, either through regular housekeeping or flagging that the record may be a dupe. The earlier suggestion about having the capability to merge records 'on the fly' when an Admin is told/learns that they are duplicates would allow for less 'heavy-lifting' in terms of maintaining a single view of member (I HATE the word 'User' when applied to the people who I serve in my organisation... makes me sound like a drug dealer!).

    Step 2 above avoids the scenario identified by Xavier... combining fields and conducting a mixed field match based on edit distances or other algorithms reduces the probability of duff matches. Thresholds to define what is a 'match' should be set. Step 3 basically calls for 3 levels of match threshold - NoMatch, HumanReview_Match, and Probable_Match (which may be human reviewed or may be automated.

    Pre-matching data cleansing/standardisation is a useful first step to reduce variance and improve your 'odds' of matching correctly (and personally I wouldn't run a matching process that doesn't do this) .

    For a more detailed discussion of some of this, check out http://blogs.informatica.com/dataquality/2007/01/matching_determinism_and_probab.html. I'm not suggestion all CiviCRM customers need to run Informatica... there are other options

    One such option might be to not build from scratch what might be available as an OpenSource tool... http://www.infosolvetech.com (haven't used their data quality tool yet so this is not a recommendation but more of a suggestion- but will be kicking the tyres on it before long I'm sure) - perhaps there is someway to build a relationship there to provide enterprise scale matching/deduping on a SAAS basis?

    A quick scan through sourceforge found me this: http://sourceforge.net/projects/dataquality/ , again I haven't used it and can't vouch for quality (no pun intended) or currently delivered functionality. At the very least, a simple capability to profile your data will shed light quickly on common issues that you might need to address by cleaning (perhaps with SQL statements or record by record manual tidying) to ensure better quality data. Another option is FERBI, also on SourceForge. Ferbi has slightly better documentation and more straightforward installation than the first tool I referenced. Note - both of these tools are desktop tools for analysis of data quality and are not drupal modules or similar, but they might provide a model for consideration or other options for achieving the goal.

    1. Apr 30, 2008

      Thanks a lot, Daragh, for your through reply!

      To address some of the raised issues/ideas:

      The data standarisation is a big issue, one one hand quite important for the whole dedupe process, on the other - not so trivial to implement. Hopefully we'll tackle it for CiviCRM 2.2 in some way.

      As for match distances, I was thinking about using Levenshtein distance for starters. Thanks for pointing out the other metrics!

      We plan to address the multi-threshold approach by allowing creation of similar rule groups with different thresholds (for starters), and will see how this catches up.

      Thanks a lot for the links (the Informatica one should drop the last 'b': http://blogs.informatica.com/dataquality/2007/01/matching_determinism_and_proba.html)! I'll make sure to check out the projects mentioned.

      1. Apr 30, 2008

        Another part of the paper on my algorithm describes how I really only intend for the current algorithm to provide an initial, small candidate set of duplicates. Items in that small set can have comparisons like Levenshtein distance run on them.


Creative Commons License
Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution-Share Alike 3.0 United States Licence.