track per-stream record counts and records committed #3247

jrhizor · 2021-05-05T17:49:45Z

Goals

We currently track the total number of records emitted by the Source. We want to track two messages with regard to volume.

number of records emitted per stream.
number of records committed to the Destination
Ideally we would have the combination of the two: number of records per stream committed to the Destination.

In this iteration of this feature, we do not want to add a new message type to the Airbyte Protocol to track this stat. Instead we want to see how much was simply track in the ReplicationWorker.

Proposed Solution

One solution we have discuss is the following:

Modify the MessageTracker to track the record count at the stream level. That means instead of tracking a single count for the total records, we would have it track records by stream.

In order to support tracking the number of records committed to a destination, we need to track counts by state. That is because the Destination will emit state messages when it commits the associated records. There is no guarantee that every state message gets emitted by the Destination. The only guarantee is that it will commit at least the last state message that was committed. Leveraging this knowledge means that if the MessageTracker tracks record counts by state, we can take whatever State message was from the Destination and know that at least the records emitted at that State message have now been committed.

Thus the MessageTracker needs to track counts by state objects. Here is an example of what this might look like:

If we saw the following messages emitted from the source:

Message (stream 1), Message (stream 1), Message (stream 2), State 1, Message (stream 2), State2.

We would expect the MessageTracker to be able to tell us:

state 1: 
  stream 1: 2
  stream 2: 1
state 2:
  stream 1: 2
  stream 2: 2

Storing the Metadata

Because the number of State message is technically unbounded, we need to be careful to store this metadata in such a way that does not cause memory problems. This section describes 2 potential approaches.

Option 1: Use an fixed-length array to store counts for each State message

We can store a Pair of the state object and the total records counts for each stream. Instead of storing the whole state object we can hash the state objects to 4 bytes with murmur hash (guava: Hasing.mumur()). The counts can be stored in an array where the index in the array corresponds to the stream. For example if we had stream1, stream2, stream3, we would know that given this array [10, 100, 32] that at this State message, stream1 had 10 records, stream2 had 100, and stream3 has 32.

# of state message	# of streams	memory used (bytes unless otherwise specified)
1	1	16
1 million	1	16 MB
1	5,000	40 KB
1 million	5,000	40 GB

Pros

The convenience of this approach is that it is easy to count the memory usage. Each entry would be 8 bytes (hash of the state) + 8 bytes * # of stream.
Cons
Packing into arrays is pretty wasteful as generally the counts will not change that much from state message to state message. (# of state messages scales with number of streams.)

Thus the final data structure could just be an Long array, where the first value was the long representation of the state and each other element would correspond to a stream. We could store each of these arrays in a List.

Option 2: For each state, only store counts for streams that actually had a record

Instead of storing a count for every stream for each state, we could just store the delta of the streams that actually saw at least one record. They could be stored in a byte array with the following schema.
<8 bytes: state hash><4 bytes: index of stream><8 bytes: records added in this state message>... the last 2 repeat for as many streams are present in that state message

For example, reusing this case:

Message (stream 1), Message (stream 1), Message (stream 2), State 1, Message (stream 2), State2.

We would have 2 byte arrays:
<hash of stream 1 (long)>1(int)2(long)2(int)1(long)
<hash of stream 2 (long)>2(int)1(long)

We could then store each byte array in a java list.

Pros

Generally speaking, if there are more state messages, we expect each state message to affect fewer streams, so we expect the memory usage to be more efficient.
Cons
In the dense case we store 12 bytes per stream, instead of 8.

Note: we could actually use a short instead of an int for the index of the stream names. We do not need to support more 32K stream names.

Based on our best guess on the density, we should go with the second solution.

Edge Cases

Collisions

If 2 state objects have hashes that collide, we will not be able to tell which state object the metadata belongs to. If we see two, non-consecutive state files that hash to the same value, then we should not track the metadata for the states with collisions. We would do this by: 1. don't track the metadata for the new state that collided 2. remove the metadata for the state that was already tracked (the one that was collided with), 3. add this has into a hash set of bad states so that if see this hash again we don't store it. See the next section for how the system should handle the case when a Destination outputs a state message that is not tracked.

There is an alternative more complicated approach where we track multiple hashes for state objects when this case arises, but because we think the case will be rare, it's not worth adding the extra complexity.

Memory Management

In pathological cases, it is still possible for the amount of memory to be used by this metadata tracking to get large. We do not want to risk it causing an OOM for a sync, so we should make sure it is capped. In the case where that cap is reached, we should start removing the state metadata starting with the oldest first. If when the job completes and the Destination emits a state, if it is in what is left of the track metadata, then we can still report the committed records count. If it is not, then we should note that we not know how many records were committed.

We think running into memory issues will be rare, if this turns out not to be true, we can consider spilling to disk or some other non-memory persistence layer.

Persistence

In the StandardSyncSummary we already have a recordsSynced field. That name is a little vague, because it implies it is the number of records committed but in practice it is really the number of records emitted. Thus we will focus on adding new, more clearly named, fields to the struct. After they are added we will populate the new fields and old fields. Then we can run a database migration to remove recordsSynced and for old jobs move the value of recordsSynced into the new schema.

New fields:

totalRecordsEmitted - long, required (when we run the migration to remove recordsSynced the values will go here)
totalRecordsCommitted - long, nullable
streamNameToRecordsEmitted - map<string, long>, required
streamNameToRecordsCommitted - map<string, long>, nullable
totalStateMessagesEmitted - long, required

Open Questions

Is tracking bytes per stream useful as well? If so is it important to track bytes per stream committed? or just bytes per stream emitted?
- I think this probably is a useful thing to track, but the committed part is probably not important. It is helpful to understand the size of records in streams (as they can vary a lot from stream to stream), but understanding the total number of bytes committed doesn't seem to answer any pressing question.

The text was updated successfully, but these errors were encountered:

jrhizor · 2021-05-05T17:50:43Z

We should also consider tracking dropped/failing records if there are cases.

cgardens · 2022-01-13T16:48:24Z

to do:

migrate old values to use new columns? this should be a pretty easy query
UI work - to create FE issue

cgardens · 2022-02-15T04:29:52Z

@pmossman i think we can close this. it made it to the FE!

jrhizor added the type/enhancement New feature or request label May 5, 2021

cgardens changed the title ~~track per-stream record counts~~ track per-stream record counts and records committed Dec 11, 2021

cgardens assigned pmossman Dec 16, 2021

cgardens added this to the Platform 2021-12-23 milestone Dec 16, 2021

cgardens modified the milestones: Platform 2021-12-23, Platform 2022-01-06 Jan 3, 2022

pmossman mentioned this issue Jan 6, 2022

Track per-stream record counts and records committed, and other sync summary metadata #9327

Merged

pmossman mentioned this issue Jan 18, 2022

Add totalStats and streamStats in Attempts API response #9583

Merged

cgardens added 2022-q1-platform area/platform issues related to the platform labels Jan 20, 2022

cgardens modified the milestones: Platform 2022-01-06, Platform 2022-01-27 Jan 20, 2022

cgardens mentioned this issue Jan 21, 2022

Do not run normalization if no records are committed or pending records from a previous partial sync #9672

Closed

cgardens modified the milestones: Platform 2022-01-27, Platform 2022-02-03 Jan 27, 2022

pmossman closed this as completed Feb 16, 2022

bleonard added team/compose team/platform-move labels Apr 15, 2022

pmossman mentioned this issue Aug 4, 2022

Front-End: Replace deprecated AttemptRead field access with updated fields #15333

Closed

pmossman mentioned this issue Dec 1, 2022

Further simplify metrics tracker. #19988

Merged

37 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

track per-stream record counts and records committed #3247

track per-stream record counts and records committed #3247

jrhizor commented May 5, 2021 •

edited by cgardens

Loading

jrhizor commented May 5, 2021

cgardens commented Jan 13, 2022

cgardens commented Feb 15, 2022

track per-stream record counts and records committed #3247

track per-stream record counts and records committed #3247

Comments

jrhizor commented May 5, 2021 • edited by cgardens Loading

Goals

Proposed Solution

Storing the Metadata

Option 1: Use an fixed-length array to store counts for each State message

Option 2: For each state, only store counts for streams that actually had a record

Edge Cases

Collisions

Memory Management

Persistence

Open Questions

jrhizor commented May 5, 2021

cgardens commented Jan 13, 2022

cgardens commented Feb 15, 2022

jrhizor commented May 5, 2021 •

edited by cgardens

Loading