How to enable Change Data Capture (CDC) in Amazon AWS (Aurora) PostgreSQL Relational Data Service (RDS) using logical replication of the PostgreSQL write-ahead log (WAL)
Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in real-time or near-real-time to a downstream process or system enabling seamless data synchronization between systems.
It captures changes (inserts, updates, and deletes) as they happen and delivers them to downstream systems.
Why it’s important
Keeps systems in sync: CDC ensures that data is consistent across multiple systems and environments. This is especially important for tracking changes (updates, deletes) in already replicated data in data warehouses or data lakes. E.g when you need a record of deletions in ETL/ELT processes.
Real-time data integration: CDC enables real-time data integration between systems. This makes it well suited for moving data into a stream processing solution like Apache Kafka.
Change Data Capture Methods
Log-based CDC: This method uses the database’s transaction log to capture changes. The transaction log contains a record of all changes made to the database. A CDC tool reads the transaction log and captures the changes. There are two types of log-based CDC:
- Logical: Replicating data changes (inserts, updates, deletes) at a transaction level based upon their replication identity (usually a primary key). Logical replication uses a publish and subscribe model with one or more subscribers subscribing to one or more publications on a publisher node. This method is efficient and flexible because it only replicates changes that are needed.
- Physical: Copying the entire database byte-by-byte (by using exact block addresses), ensuring data consistency at a very granular level but with less flexibility.
Query-based CDC: This method uses database queries to capture changes. The CDC tool runs queries against the database to identify changes. This method is less efficient than log-based CDC because it requires more resources to run queries against the database. You also need something like a timestamp (cursor field) in the data itself to be able to use this method.
Trigger-based CDC: This method uses triggers to capture changes in the database. When a change is made to a table, the trigger fires and captures the change. The change is then written to a change table, which is read by a downstream process.
Enabling CDC in Amazon (Aurora) PostgreSQL RDS
Prerequisites
- An AWS account with the necessary permissions to create and modify resources
- An Amazon RDS (Aurora) PostgreSQL instance
- A postgreSQL user with the necessary permissions to assign replication permissions to other users as well as create publications and subscriptions slots. We’ll assume the RDS master user named
admin
with superuser access to the database. See the disclaimer in step one below. - A PostgreSQL user to use for replication. We’ll assume a user named
cdc_user
with initially only read-only access to the database.
Step 1: Provide additional permissions to read-only user
Using the admin
user, grant REPLICATION
permissions to the cdc_user
user:
GRANT rds_replication TO cdc_user;
N/B:
SUPERUSER
andREPLICATION
attributes are not available tords_superuser
, which is the group your master user belongs to. There is a separate role calledrdsadmin
who has all the permissions but is not accessible to RDS users.- Replication on AWS PostgreSQL is only possible with the
CREATE PUBLICATION
andCREATE SUBSCRIPTION
commands, available from Pgsql v10 (v10.6 in Aurora). You cannot use the old way withALTER USER some_user REPLICATION
.
Step 2: Enable logical replication on your AWS Postgres RDS or Aurora
- On the AWS console, navigate to the
Aurora and RDS
dashboard, and select your Aurora PostgreSQL instance from theDB Instances
. - On your DB cluster’s
Configuration
tab, find yourDB cluster parameter group
. Open it and note it'sResource Type
, andParameter Group Family
.
- N/B: In AWS, you cannot modify a default parameter group and hence the need to create a new one (copy of default)
- Navigate to the
Parameter Groups
section on the left sidebar, select theCustom
tab, and click onCreate Parameter Group
.
- Enter a
parameter group name
anddescription
, selectAurora PostgreSQL
Engine type for Aurora orPostgreSQL
depending on your RDS type and select theParameter Group Family
andType
that matches the ones you noted earlier. ClickCreate
.
- Select the newly created parameter group, click on
edit
, on theModifiable parameters
tab, search forrds.logical_replication
and set it to1
then click onSave changes
.
- Navigate back to the
DB Instances
section, select your Aurora PostgreSQL instance, and click onModify
.
- In the
Additional configuration
section, select the newly created parameter group from theParameter group
dropdown. ClickContinue
andApply immediately
to apply the changes immediately without needing to wait for a maintenance window. Alternatively you can navigate back to the DB list, select your db, click on actions and then onPatch Now
orrestart
it manually.
- Verify the values for wal_level and rds.logical_replication:
SELECT name,setting FROM pg_settings WHERE name IN ('wal_level','rds.logical_replication');
name | setting
-------------------------+---------
rds.logical_replication | on
wal_level | logical
(2 rows)
Step 3: Create a replication slot on your Postgres database
- Create a replication slot called
example_slot
using thepgoutput
plugin
SELECT pg_create_logical_replication_slot('example_slot', 'pgoutput');
Step 4: Create publication and replication identities for each Postgres table
For each table you want to replicate with CDC, follow the steps below:
- Add the replication identity (the method of distinguishing between rows) for each table you want to replicate:
ALTER TABLE tbl1 REPLICA IDENTITY DEFAULT;
N/B:
- In rare cases, if your tables use data types that support TOAST or have very large field values, consider instead using replica identity type full:
ALTER TABLE tbl1 REPLICA IDENTITY FULL;
- Ensure that TOAST-able tables use non-TOAST-able primary keys (integers, varchars, etc), and there will only be a modest increase in resource utilization, in addition to increased WAL storage size.
2. Create the Postgres publication. You should include all tables you want to replicate as part of the publication:
CREATE PUBLICATION example_publication FOR TABLE <tbl1, tbl2, tbl3>;`
N/B:
- To replicate all the tables in a database, run this command:
CREATE PUBLICATION example_publication FOR ALL TABLES;
- The publication name is customizable. Refer to the Postgres docs if you need to add or remove tables from your publication in the future.
3. Verify that the source tables are added to the publication:
SELECT * FROM pg_publication_tables;
pubname | schemaname | tablename
--------------------+------------+------------
example_publication | public | tbl1
example_publication | public | tbl2
example_publication | public | tbl3
(2 rows)
4. Follow this tutorial to create subscriptions from a target (secondary) database
FIN!