Skip to content

Pull change event producer

Regunath B edited this page Jul 24, 2014 · 9 revisions

The Pull based change event producer is an alternative to the log-mining approach for sourcing data changes. This producer too however remains true to design principles of Aesop described in Aesop Architecture. This producer may be used for data sources that do not have a log-mining producer implementation. It may also be used when change data is spread across multiple data sources and consequently establishing update transaction boundaries becomes difficult with log-mining.

The pull producer makes one big tradeoff though - that of being able to detect changes in near real-time. This constraint is due to the fact that this producer works on the approach of taking periodic snapshots of the entire source data and performing diffs between snapshots to detect changes. The granularity of detected changes is still at attribute-level for individual instances of a type (or) record. Snapshot creation and diff evaluation is done using Netflix Zeno

The Pull change event producer is implemented as two Maven modules comprising:

  • Snapshot creator - Available in the runtime-snapshot-serializer module. Snapshot creation is designed as a cron-like batch job implemented on Trooper Batch. Provides components to create the Netflix Zeno State Engine and append data to it. Also manages creation of serialized snapshots and deltas. Refer to the Zeno documentation on Transporting Data for understanding the Zeno State engine, snapshots and deltas.
  • Diff based Change event producer - Available in the diff-producer module. The Diff event producer uses Zeno's libraries for reading existing State engine snapshots and deltas and performing a Diff on them. The Diff is then interpreted suitably to construct the Avro change data. The snapshots and deltas may be created using the Snapshot creator described on this page or any other suitable alternate mechanism.

The following considerations are worth keeping in mind when using this producer:

  • Staleness in data between the source and consumer systems. The latency in change propagation will be a function of frequency of data snap-shotting, the size of the source data and the throughput of snapshot creation.
  • A common way to create snapshots is using an Iterator API i.e. a Http REST like service that enables scanning through the source data in a paginated manner. The throughput of this service determines snapshotting throughput.
  • Snapshots and deltas are loaded into memory using the Zeno state engine. The size of serialized snapshots will be limited by physical memory available on the host. Partitioning data and thereby reducing the size of snapshots might be a viable option.
  • Snapshots are a point-in-time extract of source data. The version of data is therefore tied to the time of these snapshots and not to individual updates on an entity. It is therefore quite possible for multiple updates to a data record/entity to occur between snapshots. Change data consumers will however only see updates grouped by time of snapshot and not as individual updates/versions.