Deduping Contacts
(summary of conference call with Bob Schmitt)
Minimizing/eliminating duplicate contact records in a CRM database is important to many organizations. Deduping functionality may be needed in several workflows:
- Importing a batch of contacts from a legacy data store or other source
- Manual data entry: add or update of contact records
- Periodic overall 'cleanup' of data
- Output of contacts for batch communications (bulk mail, email broadcast...)
Configuring Dedupe Processing
- Matching Criteria - Records may be deduped using various combinations of properties. We should consider the following for possible dupe matching:
- first name, last name
- hash of address (limited address fields and reduce size to avoid variation)
- legacy id (e.g. a unique identifier from another application)
- internal generated database id (e.g. the CRM contact_id) - which had been exported and then reimported
- email address (probably use primary email)
NOTE: ebase currently allows only one choice (with exception of name and address hash which can be combined). We would probably allow mix and match (i.e. user can configure deduping operations to use any combination of criteria).
- Tolerance - A variable number of characters may be evaluated when checking for matches (e.g. use the first 10 bytes of the street_address). This may need to be configurable as some organizations want loose checking (smaller number of characters compared), while others want tighter constraints. Loose checking results in more false positives, but less dupes get through.
- Merging - We should be able to "merge" 2 contact records. This would include moving related artifacts (Contact_actions, Group membership, Relationships) to the "master" record. Does this need to be configurable? Merging might also support selective updating of fields from dupe to designated "master" (i.e. move address from dupe to master). Perhaps we do this only if master record field is empty? Another merge option might be to 'overwrite with newer data-Y/N' (e.g. last updated record 'wins').
- Disabling Dedupe - Probably need a configure option which turns deduping off ?
Deduping during Imports
- Need to dedupe the incoming data set prior to import (based on configured match criteria)
- For this step, probably give binary choice of "import/accept" or "reject" possible duplicates. If "reject" - allow them to export a file which contains the duplicate suspects so they can examine, cull and re-import as needed.
- Deduping incoming records against DB. Need to decide whether to do this inline - or as a separate process (simpler UI). If inline, would provide interface for user to decide on each case (reject incoming, accept-not a dupe, overwrite existing...). However, if too many suspects - would need to either accept all or reject all (w/ export) - as above. If not inline - how would they undo a large number of real dupes?? Rollback?
Dedupe checking on data entry (add/edit contact)
- Allow user to check for duplicates at various points in the entry screen (don't force them to complete everything before they can check if they're entering a dupe record).
- Initial approach probably will be adding 'check for duplicates' link/button after Name fields and again after Location to prompt user that they can do this at any point.
- If we find potential dupe, allow the user to pick the dupe record (takes them to View Contact for that record). Else, mark the suspect(s) as 'not a duplicate' relative to the current record. This gets saved in the session so user isn't shown these specific suspects again (e.g. during 'Save').
- Could popup repeated addresses (think auto completion) - altho this seems like a different piece of functionality.
- Probably will ALWAYS run the dupe check on SAVE (based on dedupe config rules).
Standalone Dedupe Checking
- Allows users to check for dupes in entire database (domain), and possibly within a selected subset (e.g. a Group or maybe even Search Result set).
- We should have a way to persist the marking of records that are false matches - so they don't show up repeatedly.
Deduping on Batch Communications
- Should have option to use dedupe logic against a postal mailing or emailing. Probably would be default behavior for broadcast emails. For postal mail, they may or may not want multiple mailings to same address (i.e. two roommates both want a newsletter).
Issues
- Should we normalize address elements like 'ST', 'RD', etc. to make dupe checking for accurate? Bob (and one other commenter?) said some addressing usages want more formal output (Street vs. ST). Perhaps we encode some of these conversions into the dedupe logic (i.e. when we create the address 'hash') - but don't transform the stored address values.
- How about normalizing punctuation usage. May be tricky since different countries' postal services have different requirements.
Soundex Resources
http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm
Soundex algorithm explanation with code (no PHP)
Soundex PHP function: http://docs.php.net/en/function.soundex.html

4 Comments
Hide/Show CommentsFeb 25, 2005
John Springer
This is a good start.
For batch imports, my preference is to be able to import and then dedup as a second step. When it's done in-line, it creates a very time-consuming process that's not easily interrupted.
One system I've seen that is surprisingly pretty good is the otherwise-awful FEC report program. It puts up the two records in two columns with arrows by each field to copy data from one record to the other, and a copy-all arrow at the bottom.
I also think standardizing on address elements is necessary, because we should make it easy to do CASS on records, and that's what we're going to get back. We need to fundamentally store addresses in postal std format, and then have "nice" format alternatives for output.
I'm not sure I'd use another dedup on a batch mailing. Should cover that by using householding for batch mailings. (It is possible to automate householding with fairly good accuracy.)
For on-line entry of new records, I've done some systems that work like this:
Click New to enter a new person; you get a screen with just name fields on it. Put in name and click enter.
If there are matching names already in db, then next screen gives you a list of them to choose from, or another New button. If there are no matching names, you go on to put the rest of the data in.
Apr 25, 2005
Anonymous
names work ok, unless
you have two people associated with a record as in joint members
you have different spellings of a name like "chris" and "kris"
I think you might need to look at a soundex using the type of screen described above.
If not you may want to do an email lookup and or a zip, lastname name or firstname as a possible way of finding duplicates before you enter a new person.
It just seems like a soundex would catch more.
Lisa Smith
This isn't perfect either
Mar 15, 2006
Mark Batten-Carew
A couple of years ago I implemented a deduping system for a 660K contact database. We also found that doing deduping during import slowed down import too much, so import only eliminated 100% matching records (to guard against re-importing a file or two files with lots of exact overlap).
To detect dups, we enabled users to
1) Use advanced search to select the set of records to be checked for dups
2) In addition to #1, select a separate set of fields which define a dup record. For our purposes, we generally recommended First Name, Last Name, and ONE of Phone Number, Postal Code, or Email. So a common set was Postal Code, Last Name, First Name. The dedup fields also defined the search order.
3) The user would tick the Dedup Search tick box to denote that any records for which the dudep fields were unique would NOT be displayed.
The result of this search would be a set of records, sorted by the dup fields, where it was guaranteed that there would be 2 or more of each of the dup field tuples (ie, two "Smith, Mary" followed by four "Smythe, Bruce")
To reconcile dups, the user could tick at two records, to display them vertically side-by-side. Initially every cell of the first column would be highlighted, denoting that its value would be kept, but for any row the user could click on the second column to specify that that value was better. Ultimately, when the user clicked Merge, all the highlighted values, whether from the first or second column would be used to overwrite the first record, which would be saved, and the second record would be deleted. The only missing feature was the ability to edit one of the two values and keep the edited value instead of either of the original values.
People really liked this implementation. It quickly found the records needing deduping and made the reconciliation very intuitive.
Hope this helps,
Mark
Jun 09, 2006
Walt Daniels
A few comments based mostly on Ebase Classic;
First name is not very useful, e.g Tom vs Thomas - you don't know which is in the database.
Soundex might be good (no experience) because we do see lots of spelling errors on data entered from unreadable handwriting.
Both email and phone numbers are useful. If you get a hit it is almost surely a dup, but both are prone to not being up-to-date.
For phones (and maybe email) you need to try all of the ones you have. For instance we have day, night and cell phone numbers. A common error is putting these in the wrong field so you are never sure which one to compare on.
Keeping historical addresses, emails and phone numbers might be helpful in many cases.
A useful speedup is to have preconcatenated strings (dupfield) in the database so you don't have to do joins or other slow operations at data entry time.
It is very important to bias your searches for possible duplicates to not miss any rather than worrying about showing too many.
If possible show the possible dups as you tab from field to field on data entry so that updates, which are more common than new for us, can be found early in the typing and just accepted. This means that tab order is important, e.g. last name, then zip, then street. Autofill of city/state from zip is important, but watchout for multiple cities with same zip and non-postoffice preferred city names.
I have big problems with the USPS uppercase standard addresses. The RD, ST, etc can be fixed easily but the name part is not necessarily reversible (similar problems with names, some of which appear as road names, e.g Martin Luther King Blvd.)
Part of making deduping easy is having very strict data entry standards. Of course this does not help for import from outside sources so the preferred dedup rules probably need to be selectable.
It is best to apply several independent rules before accepting a record, e.g. one based on address and one based on email.
Keeping donations with people is important, so aggressive matching is important. Best of luck on one I found only by accident, lady got married, changed name and moved, but retained her email.