Skip to end of metadata
Go to start of metadata

This is an attempt to push the discussion at http://civicrm.org/node/674 to a further level of detail.

This is intended to specify Phase I functionality and its implementation approach. A Phase II section is included as a catch-all for ideas that might be looked at later.

Phase I

Objective

To create a logging system that will provide a fairly complete record of all data changes: who made them, when they were made, and what the changes were. This logging system is intended to be the basis for the future development of user facing functionality.

Exclusions

When data structures change, eg dropping a custom field or dropping a group of custom fields, data in the eliminated structures will not be preserved.

Implementation

  1. The following tables will be dropped:
    1. civicrm_log
  2. A new page will be added Administer > Configure > Global Settings > Logging, which will allow users to enable and disable logging.
    1. When logging is disabled, the triggers for logging will be dropped.
    2. When the user requests to enable logging, a check will be made to ensure that the MySQL version and database user permissions support creating triggers. If these are not appropriate, an error message will be provided to users with instructions to upgrade, as well as the need to ensure permission grants are properly changed after a 5.0 -> 5.1 upgrade. When adequate db privileges exist, the gencode produced code to modify the database structure will be called by this page.
    3. When logging is enabled, if log tables exist they will be truncated before their base table has triggers added to it. The user will be required to approve a warning message before this action is carried out.
  3. After the database connection has been initialized, if logging is enabled, civicrm_initialize will set a MySQL variable to save the current UserID for later use in triggers set up for all loggable events, eg:
    1. $dao->query(sprintf("set @current_user = %d", $session->get('userID')));
  4. The gencode program that reads the database schema and creates the DAO layer and mysql to create the CiviCRM database will be modified to support logging in the following ways:
    1. The existing <log> tag for tables in the XML schema will be repurposed to support the new more extensive logging proposed here.
    2. Logging will be enabled for the following CiviCRM tables (generally, ones that contain contact relationship information and with records that can be edited and/or deleted through the UI), which will be indicated by adding <log>true</log> to their XML schema definitions:

      civicrm_activity

      civicrm_activity_assignment

      civicrm_activity_target

      civicrm_address

      civicrm_case

      civicrm_case_activity

      civicrm_case_contact

      civicrm_contact

      civicrm_contribution

      civicrm_contribution_page

      civicrm_contribution_product

      civicrm_contribution_recur

      civicrm_contribution_soft

      civicrm_contribution_type

      civicrm_contribution_widget

      civicrm_currency

      civicrm_custom_field

      civicrm_custom_group

      civicrm_discount

      civicrm_email

      civicrm_entity_file

      civicrm_entity_tag

      civicrm_event

      civicrm_file

      civicrm_financial_trxn

      civicrm_grant

      civicrm_group

      civicrm_group_contact

      civicrm_group_nesting

      civicrm_group_organization

      civicrm_im

      civicrm_line_item

      civicrm_loc_block

      civicrm_location_type

      civicrm_membership

      civicrm_membership_block

      civicrm_membership_payment

      civicrm_membership_status

      civicrm_membership_type

      civicrm_note

      civicrm_option_group

      civicrm_option_value

      civicrm_participant

      civicrm_participant_payment

      civicrm_participant_status_type

      civicrm_pcp

      civicrm_pcp_block

      civicrm_phone

      civicrm_pledge

      civicrm_pledge_block

      civicrm_pledge_payment

      civicrm_preferences

      civicrm_preferences_date

      civicrm_premiums

      civicrm_premiums_product

      civicrm_price_field

      civicrm_price_set

      civicrm_price_set_entity

      civicrm_product

      civicrm_project

      civicrm_relationship

      civicrm_relationship_type

      civicrm_subscription_history

      civicrm_tag

      civicrm_task

      civicrm_task_status

      civicrm_uf_field

      civicrm_uf_group

      civicrm_uf_join

      civicrm_uf_match

      civicrm_value_*

    3. Tables with logging enabled (called base tables) will have an additional log table created for them, eg civicrm_contact will have a civicrm_contact_log table created to store the log of inserts, updates and deletes to its records. Each log table will contain all the fields in its base table, and the following additional fields:
      `log_id` int(10) unsigned NOT NULL AUTO_INCREMENT COMMENT 'Unique Contact Log ID',
      `log_ts` TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT 'Timestamp when Contact Log record created',
      `log_contact` INT(10) unsigned NOT NULL COMMENT 'ID of Contact who acted to create the Contact Log record',
      `log_action` ENUM ('Initialization', 'Insert', 'Update', 'Delete') NOT NULL COMMENT 'Action that created Contact Log record'
    4. If data exists in a base table when logging for it is enabled, corresponding records will be inserted into the log table with log_action set to 'Initialization'.
    5. For performance reasons, indexes will kept to a minimum on log tables. Indexes will be selectively added as functionality to use the log tables is developed. For example, the following indexes exist on civicrm_contact: PRIMARY KEY (`id`),
        UNIQUE KEY `UI_external_identifier` (`external_identifier`),
        KEY `index_contact_type` (`contact_type`),
        KEY `index_contact_sub_type` (`contact_sub_type`),
        KEY `index_sort_name` (`sort_name`),
        KEY `index_preferred_communication_method` (`preferred_communication_method`),
        KEY `index_hash` (`hash`),
        KEY `index_api_key` (`api_key`),
        KEY `index_first_name` (`first_name`),
        KEY `index_last_name` (`last_name`),
        KEY `UI_prefix` (`prefix_id`),
        KEY `UI_suffix` (`suffix_id`),
        KEY `UI_gender` (`gender_id`),
        KEY `index_household_name` (`household_name`),
        KEY `index_organization_name` (`organization_name`),
        KEY `FK_civicrm_contact_mail_to_household_id` (`mail_to_household_id`),
        KEY `FK_civicrm_contact_primary_contact_id` (`primary_contact_id`),
        KEY `FK_civicrm_contact_employer_id` (`employer_id`). The corresponding log table will have the following indexes (other log tables will only have the first four indexes): PRIMARY KEY (`id`, `log_id`),
        KEY `index_id` (`id`),
        KEY `index_log_id` (`log_id`),
        KEY `index_log_ts` (`log_ts`),
        KEY `index_contact_type` (`contact_type`),
        KEY `index_contact_sub_type` (`contact_sub_type`),
        KEY `index_sort_name` (`sort_name`),
        KEY `index_household_name` (`household_name`),
        KEY `index_organization_name` (`organization_name`).
    6. Constraints on base tables will not be retained for log tables to improve performance.
    7. Here are samples of the three triggers that will be created for each log table:
      1. CREATE DEFINER='civicrm_db_user'@'localhost'
        TRIGGER civicrm_contact_after_insert
        AFTER INSERT ON civicrm_contact FOR EACH ROW
        INSERT INTO civicrm_contact_log
        (log_contact, log_action, `id` , `contact_type` , `contact_sub_type`,`do_not_email` ,  `do_not_phone`,  `do_not_mail` ,  `do_not_sms` ,  `do_not_trade` ,  `is_opt_out`,  `legal_identifier`,  `external_identifier`,  `sort_name`,  `display_name` ,  `nick_name`,  `legal_name`,  `home_URL` ,  `image_URL` ,  `preferred_communication_method` ,  `preferred_mail_format` ,  `hash`,  `api_key`,  `source` ,  `first_name` ,  `middle_name` ,  `last_name` ,  `prefix_id` ,  `suffix_id` ,  `email_greeting_id` ,  `email_greeting_custom` ,  `email_greeting_display` ,  `postal_greeting_id`,  `postal_greeting_custom` ,  `postal_greeting_display`,  `addressee_id` ,  `addressee_custom`,  `addressee_display` ,  `job_title` ,  `gender_id` ,  `birth_date` ,  `is_deceased` ,  `deceased_date` ,  `mail_to_household_id`,  `household_name`,   `primary_contact_id`,  `organization_name` ,   `sic_code` ,  `user_unique_id` ,  `employer_id` )
        VALUES (@current_user, 'INSERT', NEW.`id`, NEW.`contact_type`, NEW.`contact_sub_type`, NEW.`do_not_email`, NEW.`do_not_phone`, NEW.`do_not_mail`, NEW.`do_not_sms`, NEW.`do_not_trade`, NEW.`is_opt_out`, NEW.`legal_identifier`, NEW.`external_identifier`, NEW.`sort_name`, NEW.`display_name`, NEW.`nick_name`, NEW.`legal_name`, NEW.`home_URL`, NEW.`image_URL`, NEW.`preferred_communication_method`, NEW.`preferred_mail_format`, NEW.`hash`, NEW.`api_key`, NEW.`source`, NEW.`first_name`, NEW.`middle_name`, NEW.`last_name`, NEW.`prefix_id`, NEW.`suffix_id`, NEW.`email_greeting_id`, NEW.`email_greeting_custom`, NEW.`email_greeting_display`, NEW.`postal_greeting_id`, NEW.`postal_greeting_custom`, NEW.`postal_greeting_display`, NEW.`addressee_id`, NEW.`addressee_custom`, NEW.`addressee_display`, NEW.`job_title`, NEW.`gender_id`, NEW.`birth_date`, NEW.`is_deceased`, NEW.`deceased_date`, NEW.`mail_to_household_id`, NEW.`household_name`, NEW.`primary_contact_id`, NEW.`organization_name`, NEW.`sic_code`, NEW.`user_unique_id`, NEW.`employer_id`) ;
      2. CREATE DEFINER='civicrm_db_user'@'localhost'
        TRIGGER civicrm_contact_after_update
        AFTER UPDATE ON civicrm_contact FOR EACH ROW
        INSERT INTO civicrm_contact_log
        (log_contact, log_action, `id` , `contact_type` , `contact_sub_type`,`do_not_email` ,  `do_not_phone`,  `do_not_mail` ,  `do_not_sms` ,  `do_not_trade` ,  `is_opt_out`,  `legal_identifier`,  `external_identifier`,  `sort_name`,  `display_name` ,  `nick_name`,  `legal_name`,  `home_URL` ,  `image_URL` ,  `preferred_communication_method` ,  `preferred_mail_format` ,  `hash`,  `api_key`,  `source` ,  `first_name` ,  `middle_name` ,  `last_name` ,  `prefix_id` ,  `suffix_id` ,  `email_greeting_id` ,  `email_greeting_custom` ,  `email_greeting_display` ,  `postal_greeting_id`,  `postal_greeting_custom` ,  `postal_greeting_display`,  `addressee_id` ,  `addressee_custom`,  `addressee_display` ,  `job_title` ,  `gender_id` ,  `birth_date` ,  `is_deceased` ,  `deceased_date` ,  `mail_to_household_id`,  `household_name`,   `primary_contact_id`,  `organization_name` ,   `sic_code` ,  `user_unique_id` ,  `employer_id` )
        VALUES (@current_user, 'INSERT', NEW.`id`, NEW.`contact_type`, NEW.`contact_sub_type`, NEW.`do_not_email`, NEW.`do_not_phone`, NEW.`do_not_mail`, NEW.`do_not_sms`, NEW.`do_not_trade`, NEW.`is_opt_out`, NEW.`legal_identifier`, NEW.`external_identifier`, NEW.`sort_name`, NEW.`display_name`, NEW.`nick_name`, NEW.`legal_name`, NEW.`home_URL`, NEW.`image_URL`, NEW.`preferred_communication_method`, NEW.`preferred_mail_format`, NEW.`hash`, NEW.`api_key`, NEW.`source`, NEW.`first_name`, NEW.`middle_name`, NEW.`last_name`, NEW.`prefix_id`, NEW.`suffix_id`, NEW.`email_greeting_id`, NEW.`email_greeting_custom`, NEW.`email_greeting_display`, NEW.`postal_greeting_id`, NEW.`postal_greeting_custom`, NEW.`postal_greeting_display`, NEW.`addressee_id`, NEW.`addressee_custom`, NEW.`addressee_display`, NEW.`job_title`, NEW.`gender_id`, NEW.`birth_date`, NEW.`is_deceased`, NEW.`deceased_date`, NEW.`mail_to_household_id`, NEW.`household_name`, NEW.`primary_contact_id`, NEW.`organization_name`, NEW.`sic_code`, NEW.`user_unique_id`, NEW.`employer_id`) ;
      3. CREATE DEFINER='civicrm_db_user'@'localhost'
        TRIGGER civicrm_contact_after_delete
        AFTER DELETE ON civicrm_contact FOR EACH ROW
        INSERT INTO civicrm_contact_log
        (log_contact, log_action, `id` , `contact_type` , `contact_sub_type`,`do_not_email` ,  `do_not_phone`,  `do_not_mail` ,  `do_not_sms` ,  `do_not_trade` ,  `is_opt_out`,  `legal_identifier`,  `external_identifier`,  `sort_name`,  `display_name` ,  `nick_name`,  `legal_name`,  `home_URL` ,  `image_URL` ,  `preferred_communication_method` ,  `preferred_mail_format` ,  `hash`,  `api_key`,  `source` ,  `first_name` ,  `middle_name` ,  `last_name` ,  `prefix_id` ,  `suffix_id` ,  `email_greeting_id` ,  `email_greeting_custom` ,  `email_greeting_display` ,  `postal_greeting_id`,  `postal_greeting_custom` ,  `postal_greeting_display`,  `addressee_id` ,  `addressee_custom`,  `addressee_display` ,  `job_title` ,  `gender_id` ,  `birth_date` ,  `is_deceased` ,  `deceased_date` ,  `mail_to_household_id`,  `household_name`,   `primary_contact_id`,  `organization_name` ,   `sic_code` ,  `user_unique_id` ,  `employer_id` )
        VALUES (@current_user, 'DELETE', OLD.`id`, OLD.`contact_type`, OLD.`contact_sub_type`, OLD.`do_not_email`, OLD.`do_not_phone`, OLD.`do_not_mail`, OLD.`do_not_sms`, OLD.`do_not_trade`, OLD.`is_opt_out`, OLD.`legal_identifier`, OLD.`external_identifier`, OLD.`sort_name`, OLD.`display_name`, OLD.`nick_name`, OLD.`legal_name`, OLD.`home_URL`, OLD.`image_URL`, OLD.`preferred_communication_method`, OLD.`preferred_mail_format`, OLD.`hash`, OLD.`api_key`, OLD.`source`, OLD.`first_name`, OLD.`middle_name`, OLD.`last_name`, OLD.`prefix_id`, OLD.`suffix_id`, OLD.`email_greeting_id`, OLD.`email_greeting_custom`, OLD.`email_greeting_display`, OLD.`postal_greeting_id`, OLD.`postal_greeting_custom`, OLD.`postal_greeting_display`, OLD.`addressee_id`, OLD.`addressee_custom`, OLD.`addressee_display`, OLD.`job_title`, OLD.`gender_id`, OLD.`birth_date`, OLD.`is_deceased`, OLD.`deceased_date`, OLD.`mail_to_household_id`, OLD.`household_name`, OLD.`primary_contact_id`, OLD.`organization_name`, OLD.`sic_code`, OLD.`user_unique_id`, OLD.`employer_id`) ;
    8. gencode.php will produce code to enable and disable logging, including the creation of 'Initialization' records in log tables as the log table is enabled.
    9. Where database constraints are used to cascade changes to other tables (if they are) from base tables, these cascading actions will be rewritten into triggers on the base tables. For the sake of simplicity, when logging is disabled, these cascading actions will still be implemented via triggers rather than constraints.
  5. The CiviCRM upgrade script will be modified in the following ways:
    1. If *_log tables exist, then all changes to their base tables will also be made to the *_log tables.
  6. Inserts, updates, and deletions of custom fields and custom groups will be mirrored to the corresponding log tables.

Unit Testing

  1. Gencode.php will produce the following PHP unit tests for each table that is loggable:
    1. Does enabling logging on a table for the first time create the appropriate *_log table?
    2. Does enabling logging on a table for the first time create appropriate 'Initialization' records for each record in the base table?
    3. Does enabling logging on a table after it has been disabled truncate the *_log table?
    4. Does enabling logging on a table after it has been disabled create appropriate 'Initialization' records for each record in the base table?
    5. Record level tests for insert, update and delete triggers:
      1. When logging is enabled, do the actions on the base table produce a record in the log table?
      2. When logging is disabled, do the actions on the base table produce no record in the log table?
      3. If there are cascading actions on a base table, do the cascading actions occur before logging is enabled?
      4. If there are cascading actions on a base table, do the cascading actions occur when logging is enabled?
      5. If there are cascading actions on a base table, do the cascading actions occur after logging has been enabled and disabled?
    6. Field level tests for insert, update and delete triggers:
      1. Does each base table field get accurately copied to log table?
      2. Do the values for the four log_* fields get accurately set?

Phase II

Administer > Configure > Global Settings > Logging will allow logging to be selectively enabled by table.

Functionality will be developed to make use of the logs. In particular, a Quality Assurance function will allow users with appropriate permissions (eg are members of a Data QA role ACL) to review and approve all data changes to all records since they were last QA'd. This will likely involve creating a new enum value of 'DataQA' for log_action.

  1. We will create permissions that decide who can commit changes without QA approval. If the user does not hav that permission their changes go into a queue.
    1. This permission will only affect demographic tables (like civicrm_contact, civicrm_address, civicrm_email, etc...) as we don't want to prevent Contributions, Event registrations, Actions and Memberships from hitting the database immediately.
  2. Administrators, with "commit changes" rights will have access to the queue and will have the ability to review. The queue will be a list of records that have been changed and with the date of the latest change.
    1. names in the queue will be links to the record in question or a the header on a cascading fieldset that will open to reveal a list of the fields that are changed grouped by date. So if you have changes on 1/1/10 and 12/31/09 you will have two groups of fields.
      1. the list of changes will be structured in a table/grid (using divs of course (smile)) three rows, field name, the new value as an editable field and the old value. The admin will be able to edit the 2nd row (the new value) and save the entire record. One major trick I have worked through is what to do if the same field is edited on multiple days, hence it gets rendered in multiple groups. The simple answer is to take the latest revision and train admins to only edit that one.
      2. How do we reject an edit so it doesn't show up each time the queue is visited. The logging tables will have to have a "reject" boolean field to track this.
        1. to this end we will want to save records for later consideration or for staff to ask someone else to look at while retaining the ability to mark records a reviewed to ensure they don't continue to show up.
    2. We will likely want for the admin to be able to sort, but that's low priority for now.
Labels:
  1. May 04, 2010

    It's worth noting that there is a membership_log table that does this already.

  2. May 06, 2010

    couple thoughts,

    in reference to bullet 4 h gencode.php doesn't ship with the tar ball, so we will have to rely on something else to initialize the log tables.

    We need to think through incremental updates as all alter table statements will have to update both the base and log tables. Further we need to think about what happens if logging is turned off and then back on. If you add a custom field after turning logging off your base and log tables schemas will not longer be synched.

    I'm of the opinion that the logging tables should ship and be created on the fly for custom fields as groups/fields are edited. They don't necessarily have ot be used but just having them there makes them easier to keep synched. It means a whole bunch more tables ship with core but I think it's worth the sacrifice. This is a crazy though but can turning logging on push the changes in schema from base to logging using triggers or can trigger push alter tables statements form base to logging (wishful thinking I know).

    1. May 25, 2010

      4 h) GenCode produces code that is shipped. The code to truncate a log table, then initialize it with records copying the base table values can be created without reference to the schema, so it does not need to be written by GenCode. The only exception might be a need to have a list of all of the base tables for which logging can be enabled or disabled, which can be worked out during coding.

      Incremental updates: Agreed re: updates to base and log tables. Whether logging is on or off, field modifications should be made to both tables after warning user.

      I think there is a significant overhead in logging, few will initially use it, and while there is not a lot of overhead in having unused empty log table in the db there is little advantage. I think when logging is enabled, there should be a drop table if exists followed by code to create the log table. I'd have to look if triggers can modify schema in MySQL; my sense is that no, they aren't allowed. But I don't see need for that.

      1. May 26, 2010

        disclaimer: using new cool toys is always fun and a good thing

        since logging is more of a safety valve, how about using mongodb for archival purposes? Since logging is primarily a write once (and forget in most cases), its a good use case. Other advantages:

        1. we are storing the values and a snapshot of the schema for that version. We basically dont really care too much about schema upgrades with this mechanism

        2. fits the use case for something like mongo perfectly (a loggging engine).

        3. avoids the schema explosion problem. potentially easier to code and expand what objects are logged

        4. ultimately what we really want stored is a consistent state of each object, not the low level table details (since if we need to roll back we roll back at the object level). better match for a more flexible DB

        just my initial thoughts

        lobo

        1. May 26, 2010

          D D

          Some ideas:

          1) If speed is an issue can use MySQL memory tables for logging instead of InnoDB/ISAM, then periodically write them out to whatever with a background cron task.

          2) Use subversion to track changes - serialize the object and revision it. That won't be fast if done in realtime, but maybe serialize to files which is fast and then periodic cron to svn. http://php.net/manual/en/book.svn.php

          1. Jun 15, 2010

            I like the idea of memory tables for logging in general, though it would drive up server requirements.

            With regard to svn, I'm concerned that I want to be able to easily use the data during data QA in phase II. Those who will be regularly using data are not very tech savvy. It's not to provide an occasional lookup in case of problems, but to regularly process all changes. So it's not quite a normal logging operation for svn versioning, or so it seems to me. It seems significantly more trouble to pull diff out of repo, present to user, and update current data if necessary compared to using existing structures and support in DAO layer and MySQL. But then perhaps I'm not seeing an elegant pattern for implementing this.

            1. Jun 16, 2010

              D D

              Fair enough.

              A clarification on what I meant by serialize won't answer the concern about the review/edit phase, but would simplify view-only uses. I meant serialize to some text format that makes the changes plain text obvious, not PHP's serialize() where the diff isn't really helpful.

              Although if it was PHP's serialize() then you could e.g. pull r5 and instantiate, pull r6 and instantiate, run the object's diff() method on them (that method to-be-written - a civi not an svn feature), then display in a table for review/edit. But yes maybe less efficient and not necessarily easier, and you're no longer really leveraging svn.

              Also while I'm here a clarification which on re-read isn't obvious: Repository url's would be file urls not http/svn, so there isn't a subversion server required. It would all happen on the civi server locally.

        2. Jun 15, 2010

          The logging is intended more to enable the data QA functionality for this client, though yes, in general, most will use it as a safety valve.

          This is intended to be part of regular workflow for a client that is very good at maximizing response rates in direct mail program by ensuring their data is in tip top shape, e.g. salutation is correct and appropriate for email versus mail communication. They also manually fix invalid addresses both to reduce mailing costs and also to ensure they have as big a prospect list as possible.

          I've started learning about mongodb but haven't got my head around its advantages yet versus the issues with introducing a new technology dependence to project, learning curves, etc. Seems like it would complicate the QA in phase II significantly.

  3. Jun 15, 2010

    Yea I think it's a mistake to think of it as log it and forget about it.

    Organizations that hire temp workers to do data entry en-masse or those that relay on members to keep data up to date are likely to use this kind of thing on a nearly daily basis. I've worked with orgs that add 150K names to their db every summer, you can only imagine the data entry issues we encountered.

    I agree with Joe adding Mongodb would like speed up performance but be add a learning curve. But that said it's not like record level locking is essential here, as these records won't be edited, so MyISAM isn't out of question which should speed things up.


Creative Commons License
Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution-Share Alike 3.0 United States Licence.