Tutorial: Using Change Data Capture

Reference:

Change Data Capture allows loading data targets using only the rows that have changed since the last load. This approach is useful when dealing with large databases that would take too much time and resources to reload completely.

The source table used in this example is an inventory table called dminv that is created by running Create Sample Procedures and Data.

In this example, the source table is in DB2 or UDB.

The procedure has the following steps:


Top of page

Example: Create a Target Table and Perform an Initial Load

The first step is to create a data flow that creates the target table and loads it with data from the current contents of the source table.

  1. Create a new data flow in an application directory that is in the server path, like baseapp.
  2. Select dminv as a source.
  3. Select all the columns from PROD_NUM through COST.
  4. Add a new target, selecting a database type, entering newinv as the name, and specifying one key.
  5. Save the flow asdminv_init.
  6. Run the flow.
  7. Open the newinv target and click Sample Data to confirm that the data was loaded correctly.

Note: Key column(s) are required for loading a table to uniquely identify each row so that Insert/Update/Delete processing can be performed.


Top of page

Example: Enable Database Table for Logging

To use change data capture, the source database must be configured to use database logging. In most organizations this must be done by a database administrator. To enable the database for logging, see Enabling Database for Logging.

This example uses dminv as the source table. To create this table see How to Create Sample Procedures and Data.

The database table must be enabled for logging and the procedure depends on the database being used.

If using:

MICROSOFT SQL SERVER 2008

The table to be used must have CDC enabled. This can be done from Microsoft SQL Server Management Studio. You must be a member of the db_owner database role to run the stored procedure. Issue the following commands:

exec sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'dminv', @role_name = NULL
select is_tracked_by_cdc FROM sys.tables WHERE name = 'dminv'

You will see a 1 if the table is enabled, a 0 otherwise.

UDB and DB2

The table to be logged must have capture mode enabled. This can be done from the DB2 Command Center or a stored procedure.

Note: For IBM i users from the user id that will run CDC flows start the program strsql and issue the following commands:

CREATE COLLECTION databaseCHGPRF CURLIB(database)
ALTER TABLE dminv DATA CAPTURE CHANGES
ORACLE 10 and 11

The table to be logged must have all data columns logged. This can be done from the ORACLE SQL Developer or a stored procedure.

ALTER TABLE dminv ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;

Top of page

Example: Create a Synonym for a Log Table

A synonym for a log table allows reading from the database log records for a selected table. The synonym contains the same columns as the table itself plus three additional columns that identify changes to the table.

This synonym can be used as a source in a data flow with the special load type of "IUD Processing." The name is an abbreviation for Insert/Update/Delete.

Note: A synonym for a log table is not a substitute for a synonym for the table itself. A synonym is still needed for the table to be able to read or write from the table directly.

To create a synonym for a Log table, in the DMC:

  1. Right-click an application directory in the navigation pane and choose New and then Synonym.

    The Select Adapter window opens.

  2. Select a connection and click OK.

    The Select Synonym Candidates window opens.

  3. From the Restrict object type to drop-down menu, select Table Log Records.

    Optionally, check Filter by owner/schema checkbox and for object name enter dminv. Click Next.

  4. Check the checkbox in front of DMINV.

    Under the Default synonym name, click DMINV and change the name to DMINV_LOG, as seen in the following image.

    Create Synonym window

  5. Click Create synonym. The Create Synonym Status window opens.
  6. Click Open First Synonym. The synonym opens by default to the Segment and Field View. If it does not, click the tab to switch to that view. Select the segment name if it is not already highlighted.
  7. In the DB Log Parameters settings, input CHKPT for START, YES for CHKPT_SAVE, and leave CHKPT_FILE blank.

    The window now looks like this:

    Segment and Field View of synonym

    Note that the first three columns for the synonym are added automatically and are used for DataMigrator IUD processing. They have the following functions:


Top of page

Example: Create a DataMigrator Data Direct Load Flow

In this step, we will create a DataMigrator flow to copy data from the log table to the target table. Since there is just one source table we will use a Direct Load Flow, but a Data Flow could also be used.

  1. Right-click an application directory and select New and then Direct Load Flow.
  2. Drag dminv_log to the left side of the flow.
  3. Drag newinv to the right side of the flow.
  4. Right-click newinv and select Properties.

    The target properties page opens.

  5. For the Load Type drop-down menu select IUD Processing.
  6. For the IUD Control column drop-down, select CDC_OPER, as shown in the following image.

    Target properties

    Click the X to close the properties page.

  7. Right-click newinv and select Target Transformations.

    The Transformations window opens.

  8. Click the Automap Automap button button to map the source to target columns.

    Click OK to close the window.

  9. Click the Save button and save the flow as dminv_cdc.

Top of page

Example: Updating the Source Table

To test the change data capture process, make changes to the dminv source table. In this example, three rows are inserted, one row is updated, and one is deleted.

  1. Right-click an application directory and select New and then Stored Procedure.

    A stored procedure window opens.

  2. Enter the following lines where db is the engine used, SQLMSS, SQLDB2 or SQLORA.
    SQL db insert into dminv values ('2001','Compact Flash',1000,50,25) ;
    END
    SQL db insert into dminv values ('2002','Memory Stick',1000,80,40) ;
    END
    SQL db insert into dminv values ('2003','Memory Stick Pro', 1000,200,100) ;
    END
    SQL db update dminv set QTY_IN_STOCK=500 where prod_num = '2002' ;
    END
    SQL db delete from dminv where prod_num = '2001' ;
    END
    SQL db commit;
    END
  3. Click the Run button to run the stored procedure. The console log shows the number of rows affected by each command.

    Note that the CDC configuration saves a checkpoint of the last transaction processed. If you were to run a Sample Data or Test Transforms using the synonym for the log table, that would set a checkpoint after the rows that are read, so those rows would not be processed by a subsequent DataMigrator flow.

    The following image shows a Sample Data report for the log.

    Sample data for log


Top of page

Example: Running the CDC Flow and Viewing the Results
  1. Open the CDC flow you created. Click the Run button and select Submit.
  2. Click the View Last Log button. The following image shows the log report for the flow.

    Log report for flow

    Note that the statistics show the number of rows inserted, updated, and deleted.

    Close the report.

  3. In the open flow, dminv_cdc, double-click the newinv target table and select the tab Sample Data. The following image shows that the changes were applied correctly.

    Sample data for flow

    Note: To return the dminv table back to its original state, enter the following lines in a stored procedure, where db is the engine used, and run them:

    SQL db delete from dminv where prod_num > '2001' ;
    END
    SQL db commit;
    END

Top of page

Example: Scheduling the Flow

The final step in setting up Change Data Capture processing is scheduling the flow so that it runs periodically and processes transactions.

  1. Switch to the process flow view by clicking the Process Flow tab.
  2. Drag the Schedule Schedule button button into the workflow and double-click it to open the Schedule Properties window.
  3. Select a Schedule type, Interval type, and Interval value to run the flow.
  4. Save the flow.

Top of page

x
Reference: Enabling Database for Logging

A synonym for a log table allows reading from the database log records for a selected table. The synonym contains the same columns as the table itself plus three additional columns that identify changes to the table.

This synonym can be used as a source in a data flow with the special load type of "IUD Processing." The name is an abbreviation for Insert/Update/Delete.

In addition, there are specific requirements for supported databases.

Microsoft SQL Server 2008

Microsoft SQL Server 2008 supports Change Data Capture in the Enterprise and Developer editions.

Change Data Capture must be enabled at the database level since it is disabled by default. This can be done from the Microsoft SQL Server Management Studio. To enable CDC, you must be connected as a member of the sysadmin server role. Connect to the database you want to enable and issue the following commands:

exec sys.sp_cdc_enable_db
select name, is_cdc_enabled from sys.databases

You will see a list of all databases and a 1 if CDC is enabled for the database, a 0 otherwise.

In addition, the SQL Server Agent and two jobs must be running. The jobs are created automatically when the database is enabled. For example, if the database is called "main" the jobs would be named "cdc.main_capture" and "cdc.main_cleanup."

To check if the SQL Server Agent is running, in the Microsoft SQL Server Management Studio in the Object Explorer window, check the icon for SQL Server Agent. It should have a green arrow. If it does not, right-click SQL Server Agent and select Start.

Each table to be used must have CDC enabled. You must be a member of the db_owner database role to run the stored procedure. Issue the following commands, where tablename is the name of the table.

exec sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name =
'tablename', @roll_name = NULL
select is_tracked_by_cdc FROM sys.tables WHERE name = 'tablename'

You will see a 1 if the table is enabled, a 0 otherwise.

UDB and DB2

To use Change Data Capture the database must be configured to use database logging. In most organizations this must be done by a database administrator. Note that CDC support for UDB and DB2 requires the CLI interface.

Note:iSeries users that are using the user ID that will run the CDC flow should start the program strsql and issue the following commands:

CREATE COLLECTION databaseCHGPRF CURLIB(database)
UPDATE DATABASE CFG FOR database USING DFT_SQLMATHWARN YES
UPDATE DATABASE CFG FOR database USING LOGARCHMETH1 LOGRETAIN
UPDATE DBM CFG USING DISCOVER DISABLE
UPDATE DBM CFG USING DISCOVER_INST DISABLE

On all platforms the user id to run the flows should be an administrator on the DataMigrator server.

Note: For DB2/2400, the user that runs the flows should also be:

  • An Administrator, or belong to the Administrators group on DB2.
  • Associated with a current library on DB2.
GRANT DBADM ON database TO USER userid

The table to be logged must have capture mode enabled. This can be done from DB2 command center or a stored procedure, where tablename is the name of the table.

ALTER TABLE tablename DATA CAPTURE CHANGES

Note that if you recreate a database table, you must re-enable capture mode.

To see which tables have capture mode enabled, you can issue the following DB2 select statement.

SELECT CREATOR, NAME, TYPE
    FROM SYSIBM.SYSTABLES WHERE DATA_CAPTURE = 'Y'

Note: For z/OS, use DATACAPTURE (without underscore).

ORACLE 10g and 11g

To use the Change Data Capture, the database must be configured to use archiving. This can be done from ORACLE SQL Developer or from a stored procedure. In most organizations, this must be done by a database administrator.

ALTER SYSTEM ARCHIVE LOG START;

The date and time formats must be altered so that they can be retrieved properly.

ALTER SESSION SET NLS_DATE_FORMAT='YYYY/MM/DD HH24:MI:SS';
ALTER SESSION SET NLS_TIMESTAMP_FORMAT='YYYY/MM/DD HH24:MI:SSXFF';

To enable logging a user ID must be granted the following where userid represents the user ID to be used.

GRANT DBA TO userid;
GRANT EXECUTE ON dbms_logmnr TO userid;
GRANT SELECT ANY TRANSACTION TO userid;

The table to be logged must have all data columns logged by issuing the following command where tablename represents the table to be logged.

ALTER TABLE tablename ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;

Note that if you recreate a database table, you must re-issue this command.

Note: Change Data Capture for ORACLE uses the DBMS_LOGMNR package to extract values from the ORACLE redo log files. This does not support the following data types: LONG, LONG RAW, CLOB, BLOB, NCLOB, ADT, and COLLECTION.


Top of page

x
Reference: CDC Configuration

The CDC Configuration section of the synonym properties which are created for a Table Log.

Note: Each permitter has a default value of a global variable whose name begins with &&CDC_ and is followed by the parameter name. Normally you would just type over or select a value to replace the default with a constant value, then save the synonym.

Alternatively the values could be set dynamically in a process flow in a "Set variables" object.

The following options are available:

DB Log Parameters
DATA_ORIGIN

This is set to DBMSLOG to indicate the origin of the data is a DBMS Table Log.

START

Indicates the starting point for reading log records.

  • CHKPT. After last transaction retained in checkpoint file. This is the default.
  • CUR_TRAN. First transaction in the DBMS log after the job started.
  • CUR_LOG. First available transaction in the current active DBMS log.
  • timestamp. First transaction after selected point in time (YYYY/MM/DD HH:MM:SS)
CHKPT_SAVE

Indicates if the last processed transaction should be saved in the checkpoint file.

  • YES. Retains the last processed transaction in the checkpoint file.
  • NO. Does not retain the last processed transaction
CHKPT_FILE

Indicates the location and name of a file to use to store checkpoint information in a:

  • Physical location. The full directory and file name in the format for operating system used. This is the preferred option as it allows individual users to maintain their own checkpoint files.
  • Application directory. The file is created in the same application directory as the synonym.
  • Blank. If no file name is specified, a file is created with the same name as the synonym and the extension "chp." This option facilitates testing because use of Sample Data or Test Transforms does not reset the checkpoint.

The following two parameters, LOG_NAME and LOG_LOCATION, are available on iSeries platforms only, with DB2 database. These parameters can be used when the requirement is to use a non-standard journal and/or a non-current location (for example, alternate library), where the journal and location that are explicitly manually defined by a database administrator (versus a CREATE COLLECTION). One example would be a requirement to use a remote journal, instead of the local one.

LOG_NAME

Indicates the name of the journal.

LOG_LOCATION

Indicates the location name.

Listening Parameters
POLLING

Indicates the polling interval in seconds, how often the database log is scanned. Default is 1 second.

TIMEOUT

Indicates the timeout interval in seconds. If there is no activity in this time interval, the processing will stop. A value of zero means there is no limit. The default value is 1 second.

Note that the two options above work together. For example, if POLLING is 2 and TIMEOUT is 10, the server polls the log every 2 seconds for new transactions. If there are no new transactions after 10 seconds, then polling stops.

Read Limits
MAXLUWS

Indicates maximum number of database transactions to process before stopping the job. A value of zero (0) means all transactions. The default value is 1 transaction.

Note: For normal Change Data Capture processing, specify a permanent location for CHKPT_FILE so that checkpoint information and set CHKPT_SAVE to YES. Then set START to CHKPT so that processing starts at the last saved checkpoint.

However, note that any operation that reads from the log table including "Sample Data," resets the checkpoint. In that case, a sample data uses the data and that data is no longer available to subsequent runs of the data flow.


iWay Software