Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

track per-stream record counts and records committed #3247

Closed
jrhizor opened this issue May 5, 2021 · 3 comments
Closed

track per-stream record counts and records committed #3247

jrhizor opened this issue May 5, 2021 · 3 comments
Assignees
Labels
area/platform issues related to the platform team/compose team/platform-move type/enhancement New feature or request

Comments

@jrhizor
Copy link
Contributor

jrhizor commented May 5, 2021

Goals

We currently track the total number of records emitted by the Source. We want to track two messages with regard to volume.

  1. number of records emitted per stream.
  2. number of records committed to the Destination
    Ideally we would have the combination of the two: number of records per stream committed to the Destination.

In this iteration of this feature, we do not want to add a new message type to the Airbyte Protocol to track this stat. Instead we want to see how much was simply track in the ReplicationWorker.

Proposed Solution

One solution we have discuss is the following:

Modify the MessageTracker to track the record count at the stream level. That means instead of tracking a single count for the total records, we would have it track records by stream.

In order to support tracking the number of records committed to a destination, we need to track counts by state. That is because the Destination will emit state messages when it commits the associated records. There is no guarantee that every state message gets emitted by the Destination. The only guarantee is that it will commit at least the last state message that was committed. Leveraging this knowledge means that if the MessageTracker tracks record counts by state, we can take whatever State message was from the Destination and know that at least the records emitted at that State message have now been committed.

Thus the MessageTracker needs to track counts by state objects. Here is an example of what this might look like:

If we saw the following messages emitted from the source:

Message (stream 1), Message (stream 1), Message (stream 2), State 1, Message (stream 2), State2.  

We would expect the MessageTracker to be able to tell us:

state 1: 
  stream 1: 2
  stream 2: 1
state 2:
  stream 1: 2
  stream 2: 2

Storing the Metadata

Because the number of State message is technically unbounded, we need to be careful to store this metadata in such a way that does not cause memory problems. This section describes 2 potential approaches.

Option 1: Use an fixed-length array to store counts for each State message

We can store a Pair of the state object and the total records counts for each stream. Instead of storing the whole state object we can hash the state objects to 4 bytes with murmur hash (guava: Hasing.mumur()). The counts can be stored in an array where the index in the array corresponds to the stream. For example if we had stream1, stream2, stream3, we would know that given this array [10, 100, 32] that at this State message, stream1 had 10 records, stream2 had 100, and stream3 has 32.

# of state message # of streams memory used (bytes unless otherwise specified)
1 1 16
1 million 1 16 MB
1 5,000 40 KB
1 million 5,000 40 GB

Pros

  • The convenience of this approach is that it is easy to count the memory usage. Each entry would be 8 bytes (hash of the state) + 8 bytes * # of stream.
    Cons
  • Packing into arrays is pretty wasteful as generally the counts will not change that much from state message to state message. (# of state messages scales with number of streams.)

Thus the final data structure could just be an Long array, where the first value was the long representation of the state and each other element would correspond to a stream. We could store each of these arrays in a List.

Option 2: For each state, only store counts for streams that actually had a record

Instead of storing a count for every stream for each state, we could just store the delta of the streams that actually saw at least one record. They could be stored in a byte array with the following schema.
<8 bytes: state hash><4 bytes: index of stream><8 bytes: records added in this state message>... the last 2 repeat for as many streams are present in that state message

For example, reusing this case:

Message (stream 1), Message (stream 1), Message (stream 2), State 1, Message (stream 2), State2.  

We would have 2 byte arrays:
<hash of stream 1 (long)>1(int)2(long)2(int)1(long)
<hash of stream 2 (long)>2(int)1(long)

We could then store each byte array in a java list.

Pros

  • Generally speaking, if there are more state messages, we expect each state message to affect fewer streams, so we expect the memory usage to be more efficient.
    Cons
  • In the dense case we store 12 bytes per stream, instead of 8.

Note: we could actually use a short instead of an int for the index of the stream names. We do not need to support more 32K stream names.

Based on our best guess on the density, we should go with the second solution.

Edge Cases

Collisions

If 2 state objects have hashes that collide, we will not be able to tell which state object the metadata belongs to. If we see two, non-consecutive state files that hash to the same value, then we should not track the metadata for the states with collisions. We would do this by: 1. don't track the metadata for the new state that collided 2. remove the metadata for the state that was already tracked (the one that was collided with), 3. add this has into a hash set of bad states so that if see this hash again we don't store it. See the next section for how the system should handle the case when a Destination outputs a state message that is not tracked.

There is an alternative more complicated approach where we track multiple hashes for state objects when this case arises, but because we think the case will be rare, it's not worth adding the extra complexity.

Memory Management

In pathological cases, it is still possible for the amount of memory to be used by this metadata tracking to get large. We do not want to risk it causing an OOM for a sync, so we should make sure it is capped. In the case where that cap is reached, we should start removing the state metadata starting with the oldest first. If when the job completes and the Destination emits a state, if it is in what is left of the track metadata, then we can still report the committed records count. If it is not, then we should note that we not know how many records were committed.

We think running into memory issues will be rare, if this turns out not to be true, we can consider spilling to disk or some other non-memory persistence layer.

Persistence

In the StandardSyncSummary we already have a recordsSynced field. That name is a little vague, because it implies it is the number of records committed but in practice it is really the number of records emitted. Thus we will focus on adding new, more clearly named, fields to the struct. After they are added we will populate the new fields and old fields. Then we can run a database migration to remove recordsSynced and for old jobs move the value of recordsSynced into the new schema.

New fields:

  • totalRecordsEmitted - long, required (when we run the migration to remove recordsSynced the values will go here)
  • totalRecordsCommitted - long, nullable
  • streamNameToRecordsEmitted - map<string, long>, required
  • streamNameToRecordsCommitted - map<string, long>, nullable
  • totalStateMessagesEmitted - long, required

Open Questions

  • Is tracking bytes per stream useful as well? If so is it important to track bytes per stream committed? or just bytes per stream emitted?
    • I think this probably is a useful thing to track, but the committed part is probably not important. It is helpful to understand the size of records in streams (as they can vary a lot from stream to stream), but understanding the total number of bytes committed doesn't seem to answer any pressing question.
@jrhizor jrhizor added the type/enhancement New feature or request label May 5, 2021
@jrhizor
Copy link
Contributor Author

jrhizor commented May 5, 2021

We should also consider tracking dropped/failing records if there are cases.

@cgardens cgardens changed the title track per-stream record counts track per-stream record counts and records committed Dec 11, 2021
@cgardens cgardens added this to the Platform 2021-12-23 milestone Dec 16, 2021
@cgardens
Copy link
Contributor

to do:

  • migrate old values to use new columns? this should be a pretty easy query
  • UI work - to create FE issue

@cgardens
Copy link
Contributor

@pmossman i think we can close this. it made it to the FE!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform team/compose team/platform-move type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants