Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support toggling >= incremental logic for SQL taps #1200

Open
pnadolny13 opened this issue Nov 16, 2022 · 6 comments
Open

[Feature]: Support toggling >= incremental logic for SQL taps #1200

pnadolny13 opened this issue Nov 16, 2022 · 6 comments

Comments

@pnadolny13
Copy link
Contributor

Feature scope

Taps (catalog, state, stream maps, etc.)

Description

The Singer norm is to use >= when doing an incremental sync with a bookmark. Its for good reason but if someone knows they arent at risk of missing records using only > then we should allow it.

I couldnt find where its documented but I know its to avoid ties that might have been missed on the last sync. For example if the bookmark is a date, and an incremental sync happens early on in the day, then more records arrive within that day, using > would skip those that arrive later in the day and the incremental sync wouldnt have worked as expected. So >= uses "at least once" semantics and requires deduping on the target end.

The use case is for reverse ETL where the target cant handle dupes. Specifically Squared's target-apprise implementation posting to singer-ecosystem-activity in slack. There will always be < 5 records so ties are almost impossible and I think I'd honestly prefer the edge case to be not sending a record than sending it twice. In this case I'd like the option to override the default >= filter to use >.

Another case is if someone uses an auto incrementing PK integer in a postgres database then theres no reason to be inclusive (although it has pretty minimal harm also), the user should have the preference if they know what theyre doing.

I suggest we add something like incremental_filter_inclusive or inclusive_filter (names are tough 😄 ) that defaults to the status quo but allows us to override.

Am I misunderstanding anything?

@pnadolny13 pnadolny13 added kind/Feature New feature or request valuestream/SDK labels Nov 16, 2022
@aaronsteers
Copy link
Contributor

aaronsteers commented Nov 16, 2022

@pnadolny13 - Thanks for this suggestion, and thanks for raising. You make really good points about the fact that some developers can confidently assert that their use case does not include cases where the > comparison is insufficient. I think we have two paths forward:

On the developer side:

I am coming around to agreeing that developers should be able to declare if their replication key should be compared using > or >= operator.

If a developer does implement a faulty or 'lossy' comparison logic, that would rightfully be reported as a bug in the tap. Even if lost records only occur in 1 out of 10 million or 100 million instances, those are guaranteed to occur for a significant number of total users of the tap. It is absolutely critical that Singer maintain the promise of 'at least once' delivery because other systems that can only promise a certain amount of 'nines' are just completely out of the question when it comes to the needs of dependable ELT data pipelines. We need users to be able to assert not just 'it's very very unlikely Meltano missed records' but rather: 'it is systematically impossible for any records to be missed'.

On the user side:

I don't think the user should not have an option to take the gamble of maybe missing out on records, but we can give them access to "opt in" to a more advanced and costly dedupe algorithm. At the cost of a larger STATE artifact, and some additional compute time for comparisons, we can make this a universal config option for all SDK-based taps. I've added a proposed spec that would deliver this in:

cc @edgarrmondragon to check my logic here 😄

@pnadolny13
Copy link
Contributor Author

@aaronsteers thanks for your thoughts! I agree that we should default to current state thats the safest most robust way of running the tap but yeah having 2 power user mechanisms available would be awesome:

  1. the option to use only >
  2. dedupe ties using key hashes in the state file i.e. Feature: duplicate-proof replication using record hashes in STATE #161

Option 2 would solve my problem also but it might also be nice to have Option 1 to skip the hashing of IDs in the state file for the case that I'm super confident > will work and accept the risks. If we allow option 1 we should document it well and potentially log a warning or something so its abundantly clear that this is a risky move and should only be used if you know what youre doing.

@stale
Copy link

stale bot commented Jul 18, 2023

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@stale stale bot added the stale label Jul 18, 2023
@pnadolny13
Copy link
Contributor Author

Still relevant

Copy link

stale bot commented Jul 18, 2024

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@stale stale bot added the stale label Jul 18, 2024
@edgarrmondragon
Copy link
Collaborator

Still relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

3 participants