(Contributed by Mark Batten-Carew, Note that some of the issues below have been resolved in v1.5)
CiviCRM used with a Large Multi-Group Data Set
One of the applications we have implemented using CiviCRM is a large Canada-wide political database. In our specific case, this involves approximately 525K source records (a subset of the 22 million voting citizens known to Elections Canada) spread across 308 groups (ridings, the smallest political division in Canada).
Roles and Permission
In attempting to work with a dataset this large, we ran into a number of problems with CiviCRM. Part of the design of this system is that we needed to be able to restrict access to each riding's data individually, so each riding is represented by a group, and each riding-group has its own role. That way any user can be assigned whichever ridings they are allowed access to individually.
So the first problem we ran into was the Access Control page. With 308 extra groups (which added 716 rows to the access control page since each group has one View and one Edit row) crossed with 308 extra roles (one per group), the Access Control page was attempting to download and display a matix of over 210,000 tick boxes. With tweaking the PHP time limits, etc this page was finally able to be displayed after about 30 minutes of processing time, but it was wasted effort, because the attempt to submit the form failed due to trying to submit too long a string of data (the entire list of ticked and unticked boxes) back to the server.
The solution to this problem was to add group and role inclusion and exclusion parameters. That is, we defined four regular expressions, one each for inclusion/exclusion of groups/roles, and we restrict the list of groups and list of roles displayed on the Access Control page to those which fit the inclusion/exclusion criteria.
There are two implications of this solution. One is that the role-group permissions for all our groups/roles cannot be set via any GUI, so those settings have to be done manually via mySQL. This is OK because we want a one-one mapping and that never changes. The second issue is that the user role assignment screen is now over 300 entries long. This is on the edge of being unacceptable, but is still usable. The inclusion/exclusion solution does not work here because user permission must be modifiable by distributed senior users on a daily basis, not just by the central system developers. (In a newer project, each of the 308 ridings has up to 16 subgroups, and this solution does NOT work because the user-role list becomes almost 5000 entries long.)
To import the 525K records, we had to tweak the system as follows:
- Turn off any duplicate checking. The current duplicate checking mechanism is way too slow. My colleague said that it seems CiviCRM is processing one record at a time, checking for duplication, rather than batching hundreds or thousands at a time and using database set matching to find duplicate records.
- Import the records 25K at a time.
- To do (b), the PHP memory limit had to be increased to 585Mb. memory leak issue has been addressed in v1.5
- To do (b), the time limit had to be increased to 20 minutes. we plan to address this in v1.6
- To do (b), the system memory had to be increased from 1Gb to 2Gb. I believe it was discovered that system memory use was not configured correctly, so this step may not have been necessary, or was only useful after some system tweaks that I don't know the details of.
There is an outstanding problem that with CiviCRM 1.3 and 1.4, when I click on Advanced Search, the system first does a search for ALL records, displaying the beginning of that list, then allows me to select criteria to restrict the result set. This initial search takes a long time (especially if a potential customer is standing over you), probably 60 seconds or more. Switching to Advanced Search should not do an automatic retrieval. fixed in v1.5
Search first implementation of the below capability is included in v1.5
In general, the Search capabilities are not nearly adequate for dealing with a set of data this large. It should be possible to construct a general boolean _expression, using any standard or custom field, joined to other expressions with any combination of AND, OR, and NOT. The user interface I have implemented a few times for previous (non-Drupal) versions of this system is as follows:
Provide as many of the following lines as the user needs, adding more when they are used up:
( NOT <field-name> <operator> <value> ) AND/OR
where the parentheses are independently optional, the NOT is optional, <field-name> is any standard or custom field, <operator> is a list of appropriate operations dependent on the type of <field-name>, <value> is either a text field or a drop list of appropriate values, depending on <field-type> and <operator>, and finally, one of AND/OR is required.
With a series of the above lines, the user can build up any _expression they want. New empty lines are added as necessary.