Change data capture


Change data capture

In databases, change data capture (CDC) is a set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data. Also, Change data capture (CDC) is an approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources.

CDC solutions occur most often in data-warehouse environments since capturing and preserving the state of data across time is one of the core functions of a data warehouse, but CDC can be utilized in any database or data repository system.

Contents

Methodology

System developers can set up CDC mechanisms in a number of ways and in any one or a combination of system layers from application logic down to physical storage.

In a simplified CDC context, one computer system has data believed to have changed from a previous point in time, and a second computer system needs to take action based on that changed data. The former is the source, the latter is the target. It is possible that the source and target are the same system physically, but that does not change the design patterns logically.

Not uncommonly, multiple CDC solutions can exist in a single system.

Timestamps on rows

Tables whose changes must be captured may have a column that represents the time of last change. Names such as LAST_UPDATE, etc. are common. Any row in any table that has a timestamp in that column that is more recent than the last time data was captured is considered to have changed.

Version Numbers on rows

Database designers give tables whose changes must be captured a column that contains a version number. Names such as VERSION_NUMBER, etc. are common. When data in a row changes, its version number is updated to the current version. A supporting construct such as a reference table with the current version in it is needed. When a change capture occurs, all data with the latest version number is considered to have changed. When the change capture is complete, the reference table is updated with a new version number.

Three or four major techniques exist for doing CDC with version numbers, the above paragraph is just one.

Status indicators on rows

This technique can either supplement or complement timestamps and versioning. It can configure an alternative if, for example, a status column is set up on a table row indicating that the row has changed (e.g. a boolean column that, when set to true, indicates that the row has changed). Otherwise, it can act as a complement to the previous methods, indicating that a row, despite having a new version number or an earlier date, still shouldn't be updated on the target (for example, the data may require human validation).

Time/Version/Status on rows

This approach combines the three previously discussed methods. As noted, it is not uncommon to see multiple CDC solutions at work in a single system, however, the combination of time, version, and status provides a particularly powerful mechanism and programmers should utilize them as a trio where possible. The three elements are not redundant or superfluous. Using them together allows for such logic as, "Capture all data for version 2.1 that changed between 6/1/2005 12:00 a.m. and 7/1/2005 12:00 a.m. where the status code indicates it is ready for production."

Triggers on tables

May include a publish/subscribe pattern to communicate the changed data to multiple targets. In this approach, triggers log events that happen to the transactional table into another queue table that can later be "played back". For example, imagine an Accounts table, when transactions are taken against this table, triggers would fire that would then store a history of the event or even the deltas into a separate queue table. The queue table might have schema with the following fields: Id, TableName, RowId, TimeStamp, Operation. The data inserted for our Account sample might be: 1, Accounts, 76, 11/02/2008 12:15am, Update. More complicated designs might log the actual data that changed. This queue table could then be "played back" to replicate the data from the source system to a target.

[More discussion needed]

An example of this technique is the pattern known as the log trigger.

Log scanners on databases

Most database management systems manage a transaction log that records changes made to the database contents and to metadata. By scanning and interpreting the contents of the database transaction log one can capture the changes made to the database in a non-intrusive manner.

Using transaction logs for change data capture offers a challenge in that the structure, contents and use of a transaction log is specific to a database management system. Unlike data access, no standard exists for transaction logs. Most database management systems do not document the internal format of their transaction logs, although some provide programmatic interfaces to their transaction logs (for example: Oracle, DB2, SQL/MP, SQL/MX and SQL Server 2008).

Other challenges in using transaction logs for change data capture include:

  • Coordinating the reading of the transaction logs and the archiving of log files (database management software typically archives log files off-line on a regular basis).
  • Translation between physical storage formats that are recorded in the transaction logs and the logical formats typically expected by database users (e.g., some transaction logs save only minimal buffer differences that are not directly useful for change consumers).
  • Dealing with changes to the format of the transaction logs between versions of the database management system.
  • Eliminating uncommitted changes that the database wrote to the transaction log and later rolled back.
  • Dealing with changes to the metadata of tables in the database.

CDC solutions based on transaction log files have distinct advantages that include:

  • minimal impact on the database (even more so if one uses log shipping to process the logs on a dedicated host).
  • no need for programmatic changes to the applications that use the database.
  • low latency in acquiring changes.
  • transactional integrity: log scanning can produce a change stream that replays the original transactions in the order they were committed. Such a change stream include changes made to all tables participating in the captured transaction.
  • no need to change the database schema

Several off-the-shelf products perform change data capture using database transaction log files. These include:

  • Attunity Stream
  • Centerprise Data Integrator from Astera
  • DatabaseSync from WisdomForce
  • GoldenGate Transactional Data Integration
  • HVR from HVR Software
  • DBMoto from HiT Software
  • Shadowbase from Gravic[1]
  • IBM InfoSphere Change Data Capture (previously DataMirror Transformation Server)
  • Informatica PowerExchange CDC Option (previously Striva)
  • Oracle Streams[2]
  • Oracle Data Guard[3]
  • Replicate1 from Vision Solutions
  • SharePlex from Quest Software
  • FlexCDC, part of Flexviews for MySQL[4]

Functionality of CDC

Replication on databases

Replication on storage

Comparison to target

Full rebuild of target

Confounding factors

As often occurs in complex domains, the final solution to a CDC problem may have to balance many competing concerns.

Sub-optimal source schemas

Tracking the capture

Push versus pull

CDC Tool Comparison


See also

External links

References

  1. ^ Holenstein, Paul (August 2011). "Shadowbase white papers for Change Data Capture". Gravic, Inc.. http://www.gravic.com/shadowbase/whitepapers.html. Retrieved 2011-08-25. 
  2. ^ Van de Wiel, Mark (September 2007). "Asynchronous Change Data Capture Cookbook" (PDF). Oracle Corporation. pp. 6. http://www.oracle.com/technology/products/bi/db/10g/pdf/twp_cdc_cookbook_0206.pdf. Retrieved 2009-02-04. "Oracle Streams provides the underlying infrastructure for this CDC method." 
  3. ^ Schupmann, Vivian; et al. (2008). "What's New in Oracle Data Guard?". Oracle Data Guard Concepts and Administration 10g Release 2 (10.2). Oracle Corporation. http://download.oracle.com/docs/cd/B19306_01/server.102/b14239/whatsnew.htm. Retrieved 2009-02-04. "Data Guard enhancements in Oracle Enterprise Manager: [...] New support for Change Data Capture and Streams: [...] Distributed (heterogeneous) Asynchronous Change Data Capture" 
  4. ^ Swanhart, Justin (2011-02-01). "Flexviews Google Code Homepage". Flexviews Google Code Homepage. Justin Swanhart. http://flexvie.ws. Retrieved 2011-02-24. "Flexviews includes FlexCDC, a change data capture utility for MySQL 5.1+" 

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Data Vault Modeling — is a database modeling method that is designed to provide historical storage of data coming in from multiple operational systems. It is also a method of looking at historical data that, apart from the modeling aspect, deals with issues such as… …   Wikipedia

  • Data integrity — in its broadest meaning refers to the trustworthiness of system resources over their entire life cycle. In more analytic terms, it is the representational faithfulness of information to the true state of the object that the information represents …   Wikipedia

  • Data, context and interaction — (DCI) is a paradigm used in computer software to program systems of communicating objects. Its goals are: To improve the readability of object oriented code by giving system behavior first class status; To cleanly separate code for rapidly… …   Wikipedia

  • Data scraping — is a technique in which a computer program extracts data from human readable output coming from another program. Contents 1 Description 2 Screen scraping 3 Web scraping 4 …   Wikipedia

  • Data modeling — The data modeling process. The figure illustrates the way data models are developed and used today. A conceptual data model is developed based on the data requirements for the application that is being developed, perhaps in the context of an… …   Wikipedia

  • Data transformation (statistics) — A scatterplot in which the areas of the sovereign states and dependent territories in the world are plotted on the vertical axis against their populations on the horizontal axis. The upper plot uses raw data. In the lower plot, both the area and… …   Wikipedia

  • RODIN Data Asset Management — Infobox Software name = RODIN Data Asset Management caption = developer = Coglin Mill latest release version = 5.05 latest release date = May 15, 2008 operating system = IBM i5/OS genre = Software website = [http://www.thinkrodin.com… …   Wikipedia

  • Climate change mitigation — Fossil fuel related CO2 emissions compared to five of IPCC s emissions scenarios. The dips are related to global recessions. Data from IPCC SRES scenarios; …   Wikipedia

  • Scientific opinion on climate change — This article is about scientific opinion on climate change. For public perception and controversy about the scientific consensus, see Public opinion on climate change and Global warming controversy. For opinions of individual dissenting… …   Wikipedia

  • Climate change in Australia — Climate change has become a major issue in Australia due to drastic climate events since the turn of the 21st century that have focused government and public attention.[1] Rainfall in Australia has increased slightly over the past century,… …   Wikipedia