The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be more current even as total data volume grows has created a demand for faster and faster ETL engines along with bigger clusters to run them on, and even a growing industry of consultants expert at low level optimizations of these ETL jobs.
Streaming transformations provides an alternative to ETL that allows you to maintain an analytics database which remains current with the data in your systems of record and where the ongoing resource cost is based on the volume of data changes occurring rather than the total amount of data. Streaming transformation uses the stream of changes occurring in your systems of record to make the appropriate changes in the analytics database to keep it synchronized.
A streaming transformation pipeline will generally involve a combination of tools. As an example, an enterprise-grade database change capture technology (such as IBM’s InfoSphere Replication Server) uses log-based capture detection technology to create the stream of changes with minimal impact to your systems of record. This stream of change data is cached in Apache Kakfa. Spark Streaming then processes the stream of changes and makes the changes to the analytics database.
Apache Kafka plays a crucial role in this pipeline and you will want not just Apache Kafka but the surrounding Confluent Platform, especially the Schema Registry. You’ll store your data in Avro format and leverage the log compaction garbage collection mechanism so that the Kafka Topic will always retain a full copy of the source data along with the ongoing stream of changes. There is a great blog post about this strategy (called bottled water) at the Confluent website
I won’t go into it here, but it's useful to note that this style of data in Kafka deals well with structural changes to your relational data as well, a modern inevitability.
When there is a full copy of the source data in Kafka then you only ever need rely on the Kafka topic for maintaining your analytics database. If your analytics database ever needs to be re-initialized or you are requested to create an additional analytics representation of your system of record data you can do all that solely from the data stored in Apache Kafka avoiding the risks and complications in gaining additional access and adding additional load to your systems of record.
A key consideration in making the move to streaming analytics is that you don’t want your ETL developers to require retraining to become experts in stream processing. The standard ways to define ETL jobs provides all the information necessary to specify what the streaming transformation needs to do. You can think of the contract for a streaming transformation that it keeps the target database always looking like it would if the conventional ETL job had been run against the current data in the source systems.
As streaming transformation becomes more commonplace I expect that there will be tools created that can use an ETL-style job description as the input to define or generate the corresponding streaming transformation job but unfortunately we aren’t there yet. At this point there is still some expert work required to do this translation. Any transformations (for example, vertical or horizontal filtering) that occurs at the row level can be done identically in a streaming transformation but more complex transformations require translations. Let me provide examples of this translation for some of the standard operations:
Joins: If you are joining two tables together you will need to keep in mind that you’ll be processing the stream of changes for each table independently. The most practical option is to maintain an Outer Join in the destination database. If, for example, you see a new row inserted for Table 1 you would check if a row already exists for this key and if so just update the columns specific to Table 1. If the row doesn’t yet exist for the key then you’ll need to insert that row, providing Null values for all the columns specific to Table2. If you see a delete for Table 1 you would just update the Table1 specific columns to be Null, so that a subsequent Insert will be able to just update that row, not losing the corresponding values for that key from Table2. A simple view on that Outer Join can be used to provide a version of the data corresponding to an Inner join for your analytics queries.
Insert into Table1 (Key, Col1) values (77,”val1”); -> Insert into TargetTable (Key, Col1, Col2) values (77,”val1”,null); Insert into Table2 (Key, Col2) values (77, “val2); -> Update TargetTable Set Col2 = “val2” where Key = 77; Delete from Table1 where Key = 77; -> Update TargetTable Set Col1 = null where Key == 77; Insert Into Table1 (Key, Col1) values (77,”newValue1”); -> Update TargetTable Set Col1 = “newValue1”) where Key ==77;
Accumulations: If you are maintaining totals based on a group of rows from the source table you will need to do math on those totals as you see new data. For a Fact/Dimension model you may want to leverage the Dimension table to dynamically identify all the various fact rows a new source row corresponds to, and add the value from the new row to each of the totals. There is a potential complication since change streams may not provide the original value for an Update or Delete operation. Without that original value you aren’t able to determine the correct adjustment to the total. But generally in these cases the source data is immutable (e.g. an audit trail of purchase events) so your streaming transformation can assume that you will only see Insert operations in your change stream.
Insert Into Sales (Day, Time, Product, Amount) values (“Tuesday”,”10am”,”Apple”,2); -> Update DayFacts Set Total = Total + Amount Where Day = “Tuesday”; Update TimeFacts Set Total = Total + Amount where TimeSlice = “10-11am”; Update ProductFacts Set Total = Total + Amount where Product = “Apple”;
Sorts: Sorting is no longer relevant in streaming transformations as you will be processing each row independently. Sorts are most commonly done prior to Join operations and can just be ignored.
I hope I’ve encouraged some of you to revisit your ETL and try converting some of that to streaming transformations. Enabling your business to get their insight from more current data can make this conversion an easy sell. This can even serve as an effective strategy to get the buy-in necessary to retire some of your older ETL technology and replace it with more Spark!