Phase One Feature Span
As with other parts of CiviCRM, the dedupe module will be implemented in phases; based on the speed of phase one development and the reliability of the module, CiviCRM 1.8 will ship with just phase one or more features.
Phase one is set to implement the core of the dedupe module, which should be easily extensible in further phases.
Dedupe over the Current Database
The core application of the dedupe mechanism should be to find duplicate contacts in the current database. Other applications (dedupe on import, dedupe on contact creation, possibly others) should follow, but once the current database dedupe is in place, their first implementations might be based on this mechanism.
Dedupe on Import
Once the dedupe across the current database is implemented, the first approach to dedupe on import can be based on importing the contacts without any dedupe checking, and then running the database dedupe based on the newly-imported set.
In further works, dedupe on import can be extended to be a more-or-less transparent to the user (either by adding an optional dedupe step at the end of the import, or merging the dedupe as one of the interim import steps).
For the first stage we could assume that on import, every non-suspect gets imported as usual, while potential duplicates get their separate contacts created, but also get added to a 'potential duplicates' group; after import, the user could run the merging process for the current database with the source contacts limited just to the 'potential duplicates' (or 'freshly imported') group.
Dedupe Mechanism
The dedupe algorithm will be based on taking a contact and finding its possible matches; this process can be (transparently) repeated to present the user with sets of possible duplicates by running the algorithm for every contact in a group or in the whole database.
The algorithm creates a symmetric, non-transitive relation between contacts; if contact A is 'similar' to contact B, and contact B is 'similar' to contact C, then contact B is 'similar' (with the same similarity level) to contact A (symmetry), but contact A is not necessarily similar to contact C.
The core of the algorithm is based on Sparse Local People Clustering paper by David Strauss and his sample PHP code implementing simple clustering over the CiviCRM database.
Search Parameters
The basis of the search would be formed from the core parameters defined by David Strauss - name, email, telephone and other location parameters with proper weights. The user should be able to define the weights (and turn off a field by defining a weight of 0), as well as a threshold that has to be met to consider two contacts as 'similar.' The user should also be able to define whether the parameters should match fully, or just share the same prefix (of defined length).
The search parameters will specify which contact property should be matched, whether it should be a full string or prefix match and what is the weight of the rule. Also, a threshold will be provided; only matches with combined weight crossing the provided threshold will mean a 'similar' contact was found.
Search Algorithm
The search algorithm would work as described in David Strauss's paper and similarly to the code he wrote; for every contact, a group of similar contacts would be formed. This algorithm, based on prefix similarities, is considered to be quite fast (given that the compared properties are indexed in the database schema) and it should be feasible to run it over the whole database either for every contact or for a subset of contacts.
Backend
The backend will be implemented as a set of CRM_Dedupe_* classes, as well as a set of APIs that will allow to run the dedupe queries (and merges) from outside of CiviCRM.
CRM Classes
So far, two backend classes are planned; CRM_Dedupe_Engine that will contain the bulk of the contact-finding and -merging code and CRM_Dedupe_Criteria that will hold the requested similarity criteria object. Other classes (for building the user interface) are to follow, but are not designed yet.
APIs
API functions for calling the dedupe code from outside of CiviCRM will be provided.
The initial plans cover API function to take a contact, an array of four search parameters (which table, which column, whether to match full string or just the prefix, the weight of the rule) and a threshold and return the matching contact ids. Also, a blanket function that runs the above over described group or the whole database will be provided.
The initial plans also cover a merge function that, for two contact ids, would specify which contact properties should be copied/added from the second contact to the first, whether this contact's relationships, activities, etc. should be 'moved' (reassigned) to the first contact, and whether the second contact should be deleted after the merge.
User Interface
Properties Selection Screen
The admin interface will have a properties selection screen that will let the user to specify which properties should be taken into account when searching for duplicate contacts, whether the properties should be matched in full or just by prefixes of specified length, what weight should each of the property rule have and what should be the threshold the cumulative weight should cross for two contacts to be considered 'similar.'
This screen should be pre-populated with sane values (for our current default implementation, that would mean three rules: First Name should match in full, Last Name should match in full, and Email should match in full).
Admin Screen
Our current idea is to have a new action in the Administer CiviCRM screen, which leads to a screen with groups of similar contacts (the user should be able to specify whether to do the search based on all of the contacts in the database, or just find contacts similar to the ones in a certain group).
In every sub-part of the screen, two contacts can be chosen to be merged; one as the main contact (the one that will stay in the system) and the second one as the contact that will be merged (and deleted). After selecting these two contacts, the user will be redirected to a contact-merging screen; after merging, they will be returned back to this general merging screen.
The initial idea looks like the following:
main |
merged |
name |
|---|---|---|
( ) |
( ) |
Contact A |
( ) |
( ) |
Contact B |
( ) |
( ) |
Contact C |
|
|
[merge] |
main |
merged |
name |
|---|---|---|
( ) |
( ) |
Contact D |
( ) |
( ) |
Contact E |
( ) |
( ) |
Contact F |
|
|
[merge] |
Contact Merge Screen
For stage one, we would implement the merging of two contacts as a screen in the contact context. As hinted above, the merging will mean that one of the contacts - the main contact - gets updated with data from the other one, which will be deleted after the merge (default behavior). The screen (in the context of the main contact) will consist of two columns of radio-selected data; the rows will contain the properties that are different for the two contacts, and the user can select which value for every given property the main contact ends up with (there are some tricky cases for things like email addresses, which can be several per-location, but this can be solved with checkboxes).
property |
Contact A |
Contact B |
|---|---|---|
First Name |
(•) Chris |
( ) Christian |
Last Name |
(•) Smith |
( ) Smithers |
Gender |
(•) Male |
( ) none |
Date of Birth |
(•) 1977 |
( ) 1979 |
|
[merge «] |
[✓] delete |
The last line's checkbox will let the user to do the merge without deleting Contact B.
This merging screen can be either a separate page in the contact context, or an Ajax-retrieved page element for other screens.
Default behavior will also merge all of the relationships/activities/contributions/etc. of the contacts by replacing the other contact's contact_id with the contact_id of the main contact. (Does the user need the option to see these related/transactional records prior to making the decision to transfer them to the main contact? dgg)
Places that Hook to this Interface
The idea is that the above interface can be hooked to from different places.
The group manager can have a link that points to the admin screen (with results for matching contacts found) pre-populated with matches of contacts in that group.
A contact screen can have a 'possible matches' link that links to a similar screen, but with matches just of this contact.
The admin screen could embed the search properties screen so the user can quickly tweak the parameters of the duplicate search.
Also, the import wizard could embed the duplicate search screen as one of its steps, with the imported contacts being the ones for which the duplicates are found.
