Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: load CDK clears partial aggregates at cadence #48551

Merged
merged 17 commits into from
Nov 20, 2024

Conversation

tryangul
Copy link
Contributor

@tryangul tryangul commented Nov 19, 2024

What

Stream aggregates (spilled files) will be closed and flushed at cadence of 15 minutes, so 15-30 minutes to prevent backing up indefinitely on a large number of partial aggregates (chunks of work below the size threshold for high cardinality syncs.

How

  • Adds FlushTickTask which publishes a flush event on each streams input channel at a cadence — default is 15 minutes but this can be tuned
  • Adds TimeWindowTrigger which will respond to flush events and trigger a window publish if the aggregate has been open at least as long as the configured window width — default 15 minutes but this can be tuned
  • Renames DestinationRecordWrapped -> DestinationStreamEvent since 2/3 of it's variants do not wrap a record.
  • Lots of test refactoring to validate behavior

Not in this PR

Un-wiring the checkpoint time based flush task.
It does not currently work today and may not be relevant with this broader time based mechanism in place.

Reading Order

FlushTickTask
SpillToDiskTask
TimeWindowTrigger

@tryangul tryangul requested review from a team as code owners November 19, 2024 05:52
Copy link

vercel bot commented Nov 19, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Nov 20, 2024 7:11pm

@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Nov 19, 2024
@tryangul tryangul changed the title chore: load CDK flushes partial aggregates at cadence chore: load CDK clears partial aggregates at cadence Nov 19, 2024
*
* In a future where we deserialize only the info necessary for routing, this could include a dumb
* container for the serialized, and deserialization could be deferred until the spooled records
* were recovered from disk.
*/
sealed class DestinationRecordWrapped : Sized
sealed class DestinationStreamEvent : Sized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 s/wrapped/event

@@ -163,6 +165,10 @@ class DefaultDestinationTaskLauncher(
}
}

// Start flush task
log.info { "Starting flush tick task" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit can we change these log lines to call out the different types of tasks

Starting flush timer for state staged files or something
Starting flush timer for stale checkpoints

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe include $task since that will appear in the logs when the task is launched/completed/canceled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me see what I can do

@tryangul tryangul force-pushed the rbroughan/flush-task branch from cfa272d to 1a5cf93 Compare November 19, 2024 20:09
Copy link
Contributor

@johnny-schmidt johnny-schmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this works as advertised but I'm not sure that it solves the problem. Let's talk through my comment before going forward

@tryangul tryangul force-pushed the rbroughan/flush-task branch from 1a5cf93 to c8ee5c7 Compare November 19, 2024 20:35
@tryangul tryangul merged commit 7810e32 into master Nov 20, 2024
29 checks passed
@tryangul tryangul deleted the rbroughan/flush-task branch November 20, 2024 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants