What is CDF:
The Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records “change events” for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated.
How to Enable CDF on Delta Table: Following are the 2 options to enable CDF on delta table.
Option 1: On Delta Table
On new table:
CREATE TABLE <Table_Name> (Col1 INT, Col2 STRING) TBLPROPERTIES (delta.enableChangeDataFeed = true)
On the existing table:
ALTER TABLE <Table_Name> SETTBLPROPERTIES (delta.enableChangeDataFeed = true)
Option 2: On Delta Path
ALTER TABLE delta.`<mount_path>//<delta_path>`
SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
Read changes in batch queries:
Data can be read using versions or timestamps. Following are the syntax of both options(versions and timestamp).
On delta table using version:
SELECT FROM table_changes(<tableName>, <min_version>, <max_version>)
or
SELECT FROM table_changes(<tableName>, <min_version>)
Note: Both min_version and max_version are included.
On delta table using timestamps:
SELECT * FROM table_changes(<tableName>, '2020–04–01 01:45:46', '2020–04–21 02:00:00')
or
SELECT * FROM table_changes(<tableName>, '2020–04–01 01:45:46')
Note: Both timestamps are included.
Path-based delta table:
SELECT * FROM table_changes_by_path('\<delta_path>', '<min_version or min_timestamp>', '<max_version or max_timestamp>')
or
SELECT * FROM table_changes_by_path('\<delta_path>', '<min_version or min_timestamp>')
Note: Both timestamps and versions are included.
Schema for CDF:
In addition to the data columns from the schema of the Delta table, the change data feed contains metadata columns that identify the type of change event:
Note: We can also read changes in streaming queries.