Let's be clear about what CDC is. It is a vital mechanism to capture only the data that has changed , ensuring minimal load on the source database. For a Data Warehouse, this is the gold standard for data ingestion and even lightweight transformations. Of Course we're not talking about querying the whole table every few minutes; we're talking about direct, surgical capture from the transaction logs. (redo logs and archived log in Oracle case)
Oracle GoldenGate is not just a CDC tool; it's a real-time replication solution, and dedicated to that . For an Oracle-to-Oracle environment, it’s arguably the best in class.
Some background info about the key capabilities of Oracle Goldengate;
While best with Oracle, it can handle replication across diverse environments, including PostgreSQL, MS SQL, and MySQL.
By using Goldengate, we can transform data before it hits the target, a key feature for ensuring data quality early.
The architecture supports multi-thread and/or multi-process/multi-target setups, which is critical for handling massive volumes of transactional data.
By using Goldengate, we can transform data before it hits the target, a key feature for ensuring data quality early.
The architecture supports multi-thread and/or multi-process/multi-target setups, which is critical for handling massive volumes of transactional data.
From now on; I will use OGG acronym for Goldengate. Just noted that here..
The beauty of OGG lies in its streamlined and robust process architecture .
Let's give some information about the processes of OGG, and how we used them in our recent project..
Extract: This is the capture agent. In our project, we chose the Integrated Extract. This process is configured on the source database and is responsible for capturing changes from the source database, writing all captured DML operations to a local trail file. We chose Integrated Extract because it uses Oracle server-side internal operations to get the changes, and it automatically scales with the CPU for better performance and high throughput.
Pump: In our recent project, a Pump process reads the local trail files and securely sends them over the network to the target OGG environment. This process acts as a high-speed data distributor. (note that, these path can be secured by TLS/SSL enablement)
Replicat: This is like the the apply agent, configured on the target database. It read the remote trail files and apply the changes to the target table(s). We use parameters like ASSUMETARGETDEFS to map columns automatically and even UPDATEDELETES to treat incoming DELETE operations as UPDATEs, crucial for DWH history tracking.
OGG have light transformation capabilities, so at a point a crucial handover might take place. So in our case, in a recent DWH project that handover was from "OGG to ODI for Data Conversion"..
This is where the entire CDC flow comes together for our DWH project. OGG’s job is to efficiently and reliably get the changed transactional data from the the critical core PROD / source system into a stable staging layer, which we call the ODS (Operational Data Store)
The flow is explicit.
The flow is explicit.
Source - > ODS
Once the data hits the ODS, OGG's primary task is done. Of course we do some lightweight transformation here, we used SQLEXEC configurations for doing these things in OGG and we enriched the replicated data, with the operation type and timestamp to make the change history for the records to be maintained in the target / ODS & DWH. But the heavy transformation was in the next state.. The next stage involves the complex, business-rule-heavy transformations necessary to structure the data for the final DWH targets.
This is the point of handover to Oracle Data Integrator (ODI).
ODS -> ODS2 / DWH
ODI then uses its trigger-based mechanism, specifically designed like CDC, to be triggered by the changed records and read the changed records in the ODS. ODI's job is to manage the parsing and heavy transformation logic.
We are talking about journalizing here.. Journalizing in ODI is a change data capture mechanism that tracks inserts, updates, and deletes in source tables , allowing downstream processes to load only the changed data.
So OGG captures the changed data and write to the target, there ODI captures the written data, triggered with it and continues processing..
So, OGG for high-speed, real-time capture and replication into a staging area, and ODI for complex, CDC-aware transformations into the final structure. At the end; we achieve an architecture that is both highly efficient and massively scalable.
This is how you build a modern DWH fit for todays need. Real Time DWH and Data Lakehouse(s) are one step ahead. In order to have real time DWH, all data processing logic should be done in a single streaming layer, and all data should be considered a stream. Next time, we will take a look at those things as well.
That's it for today. I hope you'll find this blogpost helpful.
No comments :
Post a Comment
If you will ask a question, please don't comment here..
For your questions, please create an issue into my forum.
Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html
Register and create an issue in the related category.
I will support you from there.