Track max and mean time between state message emitted and committed #15702

alovew · 2022-08-16T20:03:17Z

Segment tracking for time between state message emitted from source and committed by destination

New class StateMetricsTracker for tracking state message data and calculating max/means for new metrics
This includes a memory limit so we don't fail syncs due to StateMetricsTracker hashmap hogging memory & Datadog tracking so we know whether this is happening
We are now tracking metrics for all types of syncs: per-stream, global, and legacy. For per-stream we are tracking state message commits per stream, but for global and legacy we are hashing all stream data together

gosusnp

I think there may be one issue with how we track timestamps here.

Also, can we have a test with more than one state to track? For example, I don't think I saw test where we'd only expire some states.

gosusnp · 2022-08-16T21:26:29Z

airbyte-workers/src/main/java/io/airbyte/workers/internal/AirbyteMessageTracker.java

+    streamDescriptorsToUpdate.forEach(streamDescriptor -> {
+      final HashMap<Integer, DateTime> stateHashToTimestamp = new HashMap<>();
+      stateHashToTimestamp.put(stateHash, timeEmitted);
+      streamDescriptorToStateMessageTimestamps.put(streamDescriptor, stateHashToTimestamp);


Aren't we erasing the previous data of the streamDescriptor in this loop? It should probably be a create if not exist instead.

yes! thank you

gosusnp · 2022-08-16T21:31:57Z

airbyte-workers/src/main/java/io/airbyte/workers/internal/AirbyteMessageTracker.java

+
+      // update minTime to earliest timestamp that exists for a state message for this particular stream
+      // and delete state message entries that are equal to or earlier than the destination state message
+      for (final Map.Entry<Integer, DateTime> stateMessageTime : stateMessagesForStream.entrySet()) {


OPT: Looking at this loop, it feels like we may as well track the <hash, datetime> in a queue.
In the current implementation, even though we have direct access through the hash, we have to scans through all the items to expire older messages.
Feels like having a list ordered by time more effective overall.

I think we'd still have to iterate through the whole list though and check the timestamp of each to see when we've reached the correct timestamp -- or are you thinking of a different kind of implementation that's not occurring to me right now?

If the list is ordered, we can stop as soon as we find a match of the hash. All the items we saw should be expired/removed. Depending on the number of states we end up having in flight, it may matter.

If I got it right, current expectation is that for a given stream, state messages order is preserved, so if we get a state from the destination, it means that everything until that state has been processed. With that in mind, it would be more robust to just keep them in order rather than rely on timestamp for that.

ah i see, so still O(n) but in practice it could be faster. so it would look like this:

stream_1: [[state_hash_1, timestamp_1], [state_hash_2, timestamp_2], [state_hash_3, timestamp_3]],
stream_2: [[etc etc]]

then when we get a state message from the destination, we can safely start at the beginning of the state hash list and use the first timestamp as the min timestamp. then we'll remove them one by one as we iterate through until we get to the state hash that matches.

alovew · 2022-08-16T22:16:47Z

can we have a test with more than one state to track?

@gosusnp you're right about tests, will add some now.

benmoriceau · 2022-08-17T16:04:29Z

airbyte-workers/src/main/java/io/airbyte/workers/internal/AirbyteMessageTracker.java

    sourceOutputState.set(new State().withState(stateMessage.getData()));
    totalSourceEmittedStateMessages.incrementAndGet();
    final int stateHash = getStateHashCode(stateMessage);
+
+    if (AirbyteStateType.LEGACY != stateMessage.getType()) {


Why do we avoid the tracking if the state is legacy?

we need to track the emit -> commit timing data per stream, and with legacy we can't track per stream state messages so it requires a separate code path, and since it's deprecated anyway we decided against supporting it.

now I'm actually getting confused about how state is stored for global vs per stream vs legacy though. in airbyte_protocol.yaml it looks like connections with GLOBAL state should look at the data field (the same as legacy) but I thought this was updated?

Global state should eventually be reading the AirbyteStateMessage::global attribute not the data. I thought reading data was part of our backward compatible implementation.

alovew · 2022-08-17T21:25:30Z

airbyte-workers/src/main/java/io/airbyte/workers/internal/AirbyteMessageTracker.java

+    } else if (AirbyteStateType.STREAM == stateMessage.getType()) {
+      stateMessageData = stateMessage.getStream().getStreamState();
+    } else if (AirbyteStateType.GLOBAL == stateMessage.getType()) {
+      stateMessageData = stateMessage.getGlobal().getSharedState();


@gosusnp is shared state what I should be looking at here?

@benmoriceau was saying that this may not uniquely identify a state

gosusnp

There might be a few issues around error tracking and namespace that may cause unexpected errors.
Happy to go over my comments if that helps.

gosusnp · 2022-08-19T22:07:00Z

airbyte-workers/src/main/java/io/airbyte/workers/internal/AirbyteMessageTracker.java

    } catch (final StateDeltaTracker.StateDeltaTrackerException e) {
      log.warn("The message tracker encountered an issue that prevents committed record counts from being reliably computed.");
      log.warn("This only impacts metadata and does not indicate a problem with actual sync data.");
      log.warn(e.getMessage(), e);
      unreliableCommittedCounts = true;
+    } catch (final StateMetricsTracker.StateMetricsTrackerException e) {


I don't think this is right. If a StateDeltaTrackerException is thrown, we would not update the stateMetricsTracker which would also make the metrics inaccurate.

I Feel like trying to split the errors here could still lead to inaccurate metrics, if accuracy is our concern, I'd probably go for tracking exceptions as a whole to decide whether we should emit the metric. Keeping track of which exception was thrown would be good for our understanding of where it failed.

this is probably confusing since I bundled it with the StateDeltaTracker, but it actually is the only error that can occur for the StateMetricsTracker. I can reword the error message too since I think it's not clear.

gosusnp · 2022-08-19T22:09:07Z

airbyte-workers/src/main/java/io/airbyte/workers/internal/AirbyteMessageTracker.java


 @Slf4j
 public class AirbyteMessageTracker implements MessageTracker {

-  private static final long STATE_DELTA_TRACKER_MEMORY_LIMIT_BYTES = 20L * 1024L * 1024L; // 20 MiB, ~10% of default cloud worker memory
+  private static final long STATE_DELTA_TRACKER_MEMORY_LIMIT_BYTES = 10L * 1024L * 1024L; // 10 MiB, ~5% of default cloud worker memory
+  private static final long STATE_METRICS_TRACKER_MEMORY_LIMIT_BYTES = 10L * 1024L * 1024L; // 10 MiB, ~5% of default cloud worker memory


This looks fairly suspicious overall, do we know if those numbers are still accurate?

gosusnp · 2022-08-19T22:09:46Z

airbyte-workers/src/main/java/io/airbyte/workers/internal/AirbyteMessageTracker.java

+   * If the StateMetricsTracker throws an exception, this flag is set to true and the metrics around
+   * max and mean time between state message emitted and committed are unreliable
+   */
+  private boolean unreliableStateTimingMetrics;


See my comment below, I think we should probably just keep track of errors instead of having a split here.

gosusnp · 2022-08-19T22:12:26Z

airbyte-workers/src/main/java/io/airbyte/workers/internal/AirbyteMessageTracker.java

    try {
      if (!unreliableCommittedCounts) {
-        stateDeltaTracker.commitStateHash(getStateHashCode(stateMessage));
+        stateMetricsTracker.updateStates(stateMessage, stateHash, timeCommitted);
+        stateDeltaTracker.commitStateHash(stateHash);


Will this ever fail? I notice that we only track StateDelta exceptions and not StateMetrics.

oh good call - yes I refactored to use separate booleans here but didn't fully split the updating part.

gosusnp · 2022-08-19T22:13:56Z

airbyte-workers/src/main/java/io/airbyte/workers/internal/AirbyteMessageTracker.java

+      return hashFunction.hashBytes(Jsons.serialize(stateMessage.getStream().getStreamState()).getBytes(Charsets.UTF_8)).hashCode();
+    } else {
+      // state type is GLOBAL
+      return Objects.hashCode(stateMessage.getGlobal());


Why is this one using hashCode and the others hashBytes(Json.serialize(?

gosusnp · 2022-08-19T23:52:44Z

airbyte-workers/src/main/java/io/airbyte/workers/internal/StateMetricsTracker.java

+      // do not track state message timestamps per stream for GLOBAL or LEGACY state
+      final byte[] stateTimestampByteArray = populateStateTimestampByteArray(stateHash, epochTime);
+      stateHashesAndTimestamps.add(stateTimestampByteArray);
+      remainingCapacity -= stateTimestampByteArray.length;


remainingCapacity should also be updated in the STREAM case.

gosusnp · 2022-08-19T23:54:26Z

airbyte-workers/src/main/java/io/airbyte/workers/internal/StateMetricsTracker.java

+
+  public StateMetricsTracker(final Long memoryLimitBytes) {
+    this.stateHashesAndTimestamps = new ArrayList<>();
+    this.streamStateHashesAndTimestamps = new HashMap<>();


Would it simplify things to track LEGACY and GLOBAL with a hardcoded namespace? I suspect this would remove the need for two different code path most of the time.

gosusnp · 2022-08-19T23:55:24Z

airbyte-workers/src/main/java/io/airbyte/workers/internal/StateMetricsTracker.java

+                                                          final Long epochTimeEmitted) {
+
+    final StreamDescriptor streamDescriptor = stateMessage.getStream().getStreamDescriptor();
+    final String streamNameAndNamespace = streamDescriptor.getName() + streamDescriptor.getNamespace();


same as above, namespace could be null.

gosusnp · 2022-08-20T00:01:10Z

airbyte-workers/src/main/java/io/airbyte/workers/internal/StateMetricsTracker.java

+    while (iterator.hasNext()) {
+      final byte[] stateMessageTime = iterator.next();
+      final ByteBuffer current = ByteBuffer.wrap(stateMessageTime);
+      remainingCapacity += current.capacity();


Why do we increase the capacity by the capacity of the buffer? I suspect it should be BYTE_ARRAY_SIZE.

airbyte-workers/src/main/java/io/airbyte/workers/internal/StateMetricsTracker.java

…message

…mmit

gosusnp

Added a small note, probably not a big thing, but worth thinking about.

gosusnp · 2022-08-24T00:18:00Z

airbyte-workers/src/main/java/io/airbyte/workers/internal/AirbyteMessageTracker.java

@@ -334,56 +354,45 @@ public Optional<Long> getTotalRecordsCommitted() {

  @Override
  public Long getTotalSourceStateMessagesEmitted() {
-    return totalSourceEmittedStateMessages.get();
+    return stateMetricsTracker.getTotalSourceStateMessageEmitted();


I think it is more about how we track than the actual counter itself. If we run into an error, we throw an exception which may prevent us from updating some counters. For example, looking at StateMetricsTracker:84, if we fail to find a state, we'd throw an exception and we would not have updated some of those counts.
Some of it feels order dependent and a bit brittle, I wonder if it makes sense to just ignore metrics to be safe or if a partial view is better than nothing here.

alovew · 2022-08-24T01:18:28Z

@gosusnp I don't think state message emitted counts would be affected by the errors thrown in the state message tracker - max & mean metrics would, but we're setting those to null to avoid issue. the counts are incremented right when a source message is emitted and this should continue to increment even if there's some other error

…15702) * Add logic in AirbyteMessageTracker for calculating max and mean time between state message emit and commit

github-actions bot added area/platform issues related to the platform area/scheduler area/worker Related to worker labels Aug 16, 2022

alovew temporarily deployed to more-secrets August 16, 2022 20:05 Inactive

alovew requested review from evantahler, gosusnp, jdpgrailsdev and benmoriceau August 16, 2022 21:03

gosusnp requested changes Aug 16, 2022

View reviewed changes

alovew temporarily deployed to more-secrets August 16, 2022 22:02 Inactive

benmoriceau reviewed Aug 17, 2022

View reviewed changes

alovew temporarily deployed to more-secrets August 17, 2022 20:17 Inactive

alovew commented Aug 17, 2022

View reviewed changes

alovew temporarily deployed to more-secrets August 18, 2022 22:30 Inactive

alovew temporarily deployed to more-secrets August 18, 2022 23:42 Inactive

alovew temporarily deployed to more-secrets August 19, 2022 19:16 Inactive

alovew temporarily deployed to more-secrets August 19, 2022 19:54 Inactive

alovew force-pushed the anne/max-and-mean-time-state-message-to-commit branch from 23f6af9 to b4a34aa Compare August 19, 2022 20:41

alovew temporarily deployed to more-secrets August 19, 2022 20:42 Inactive

alovew requested review from gosusnp and benmoriceau August 19, 2022 20:43

alovew temporarily deployed to more-secrets August 19, 2022 20:51 Inactive

alovew temporarily deployed to more-secrets August 20, 2022 00:03 Inactive

gosusnp requested changes Aug 20, 2022

View reviewed changes

alovew temporarily deployed to more-secrets August 22, 2022 18:24 Inactive

alovew temporarily deployed to more-secrets August 22, 2022 18:30 Inactive

alovew temporarily deployed to more-secrets August 22, 2022 18:46 Inactive

alovew temporarily deployed to more-secrets August 22, 2022 19:42 Inactive

alovew temporarily deployed to more-secrets August 22, 2022 20:36 Inactive

alovew added 15 commits August 22, 2022 15:43

track oom errors in datadog

abe6607

Fix tests and rename class

1433551

tests

eca3e18

more complex per stream test

718a0a3

formatting

94cc810

pmd test

5b70252

track number of messages

9ad2514

update state timing metrics separately from state delta tracker

60379a8

use leagacy as default else option

f0e85de

add test for throwing exception

282505f

use stream descriptor key

92bc17a

throw error if destination message cannot be matched to source state …

53a633f

…message

test for other error and update tracking name

345c5d3

separate try catch blocks

5acd354

formatting and pmd

9b70d23

alovew force-pushed the anne/max-and-mean-time-state-message-to-commit branch from d81804d to 9b70d23 Compare August 22, 2022 22:43

alovew temporarily deployed to more-secrets August 22, 2022 22:46 Inactive

evantahler removed their request for review August 22, 2022 23:00

Merge branch 'master' into anne/max-and-mean-time-state-message-to-co…

81e7007

…mmit

alovew temporarily deployed to more-secrets August 22, 2022 23:09 Inactive

Merge branch 'master' into anne/max-and-mean-time-state-message-to-co…

171b706

…mmit

alovew temporarily deployed to more-secrets August 23, 2022 17:15 Inactive

gosusnp approved these changes Aug 24, 2022

View reviewed changes

alovew merged commit 6991452 into master Aug 24, 2022

alovew deleted the anne/max-and-mean-time-state-message-to-commit branch August 24, 2022 01:18

octavia-squidington-iii mentioned this pull request Aug 24, 2022

Bump Airbyte version from 0.40.1 to 0.40.2 #15931

Merged

sophia-wiley pushed a commit that referenced this pull request Aug 25, 2022

Track max and mean time between state message emitted and committed (#…

f700f6f

…15702) * Add logic in AirbyteMessageTracker for calculating max and mean time between state message emit and commit

rodireich pushed a commit that referenced this pull request Aug 25, 2022

Track max and mean time between state message emitted and committed (#…

ff7157a

…15702) * Add logic in AirbyteMessageTracker for calculating max and mean time between state message emit and commit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track max and mean time between state message emitted and committed #15702

Track max and mean time between state message emitted and committed #15702

alovew commented Aug 16, 2022 •

edited

Loading

gosusnp left a comment

gosusnp Aug 16, 2022

alovew Aug 16, 2022

gosusnp Aug 16, 2022

alovew Aug 16, 2022

gosusnp Aug 16, 2022

alovew Aug 16, 2022

alovew commented Aug 16, 2022

benmoriceau Aug 17, 2022

alovew Aug 17, 2022

alovew Aug 17, 2022

gosusnp Aug 17, 2022

alovew Aug 17, 2022

alovew Aug 17, 2022

gosusnp left a comment

gosusnp Aug 19, 2022

alovew Aug 22, 2022

gosusnp Aug 19, 2022

gosusnp Aug 19, 2022

gosusnp Aug 19, 2022

alovew Aug 22, 2022

gosusnp Aug 19, 2022

gosusnp Aug 19, 2022

gosusnp Aug 19, 2022

gosusnp Aug 19, 2022

gosusnp Aug 20, 2022

gosusnp left a comment

gosusnp Aug 24, 2022

alovew commented Aug 24, 2022

Track max and mean time between state message emitted and committed #15702

Track max and mean time between state message emitted and committed #15702

Conversation

alovew commented Aug 16, 2022 • edited Loading

gosusnp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alovew commented Aug 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gosusnp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gosusnp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alovew commented Aug 24, 2022

alovew commented Aug 16, 2022 •

edited

Loading