-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Support toggling >=
incremental logic for SQL taps
#1200
Comments
@pnadolny13 - Thanks for this suggestion, and thanks for raising. You make really good points about the fact that some developers can confidently assert that their use case does not include cases where the On the developer side: I am coming around to agreeing that developers should be able to declare if their replication key should be compared using If a developer does implement a faulty or 'lossy' comparison logic, that would rightfully be reported as a bug in the tap. Even if lost records only occur in 1 out of 10 million or 100 million instances, those are guaranteed to occur for a significant number of total users of the tap. It is absolutely critical that Singer maintain the promise of 'at least once' delivery because other systems that can only promise a certain amount of 'nines' are just completely out of the question when it comes to the needs of dependable ELT data pipelines. We need users to be able to assert not just 'it's very very unlikely Meltano missed records' but rather: 'it is systematically impossible for any records to be missed'. On the user side: I don't think the user should not have an option to take the gamble of maybe missing out on records, but we can give them access to "opt in" to a more advanced and costly dedupe algorithm. At the cost of a larger STATE artifact, and some additional compute time for comparisons, we can make this a universal config option for all SDK-based taps. I've added a proposed spec that would deliver this in: cc @edgarrmondragon to check my logic here 😄 |
@aaronsteers thanks for your thoughts! I agree that we should default to current state thats the safest most robust way of running the tap but yeah having 2 power user mechanisms available would be awesome:
Option 2 would solve my problem also but it might also be nice to have Option 1 to skip the hashing of IDs in the state file for the case that I'm super confident |
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the |
Still relevant |
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the |
Still relevant |
Feature scope
Taps (catalog, state, stream maps, etc.)
Description
The Singer norm is to use
>=
when doing an incremental sync with a bookmark. Its for good reason but if someone knows they arent at risk of missing records using only>
then we should allow it.I couldnt find where its documented but I know its to avoid ties that might have been missed on the last sync. For example if the bookmark is a date, and an incremental sync happens early on in the day, then more records arrive within that day, using
>
would skip those that arrive later in the day and the incremental sync wouldnt have worked as expected. So>=
uses "at least once" semantics and requires deduping on the target end.The use case is for reverse ETL where the target cant handle dupes. Specifically Squared's target-apprise implementation posting to singer-ecosystem-activity in slack. There will always be < 5 records so ties are almost impossible and I think I'd honestly prefer the edge case to be not sending a record than sending it twice. In this case I'd like the option to override the default
>=
filter to use>
.Another case is if someone uses an auto incrementing PK integer in a postgres database then theres no reason to be inclusive (although it has pretty minimal harm also), the user should have the preference if they know what theyre doing.
I suggest we add something like
incremental_filter_inclusive
orinclusive_filter
(names are tough 😄 ) that defaults to the status quo but allows us to override.Am I misunderstanding anything?
The text was updated successfully, but these errors were encountered: