feat: add incremental lag for datetime, int, and float cursors #1957

donotpush · 2024-10-15T13:27:08Z

Related Issues

Fixes add lag / attribution window to incremental #970

netlify · 2024-10-15T13:27:25Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`401587d`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/6720f76d921cf100087f8c3e
😎 Deploy Preview	https://deploy-preview-1957--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

rudolfix

@donotpush the way you solve it is good! if you apply lag to last_value prop all the code should work.

there are a few things that - mostly a result of not perfect code - that you need to pay attention:

def parse_native_representation(self, native_value: Any) -> None: - you need to transfer lag manually
make sure merge works (lag is copied via universal mechanism)
we have ensure_pendulum_datetime that converts strings into datetimes (also int and floats) - maybe it is better to use that. you should also convert the result back to string in the same representation. or the comparison will stop working. tldr;> _apply_lag must return exactly the same type as in input
we should also support date. then lag is in days

one more thing which IMO will be super helpful: allow to define lag in rest_api toolkint:

    start_param: str
    end_param: Optional[str]

you can add lag here and hopefully it will be passed to incremental

rudolfix

please see my comments. on top of that:

I'll ask @burnash to help you with adding lag to rest api. Or maybe we'll propose a commit to speed it up. just fix the tests we have so we know all works
I'm not sure you test the lag on string and date fields? those should be quite simple tests ie. for append only resources that detect expected duplicates due to lag. see my comments

dlt/common/time.py

rudolfix · 2024-10-22T07:15:20Z

tests/load/pipeline/test_pipelines.py

+    name = "events"
+
+    @dlt.resource(name=name, primary_key="id")
+    def r1(_=dlt.sources.incremental("created_at")):


the concept of test is good. but IMO you should create just one resource with merge write disposition that you call with the lag you want from the very beginning. this is how people will use it IMO.

emulate returning "updated_events" on the second call to it ie. via some kind of nonlocal flag that tells to add those events on the second time.

the results of the test look good but please test two additional things

that you do not apply lag to "initial_value" (set it in incremental)

there's IMO issue with internal deduplication. please test for updated_events that update id=3:

{ "id": 3, "created_at": "2023-03-03T02:00:01Z", "event": "updated", }

IMO it will not be included into final result because we'll deduplicate it. you should IMO disable deduplication when lag is defined ```py @property def deduplication_disabled(self) -> bool: """Skip deduplication when length of the key is 0""" return isinstance(self.primary_key, (list, tuple)) and len(self.primary_key) == 0

You are right, there is a deduplication bug - I have captured it in the tests but surprisingly setting deduplication_disabled to True doesn't solve issue. Any tips on how to solve it?

I have refactored the code and tests with your suggestions.

Question, what do you mean with the following point?

that you do not apply lag to "initial_value" (set it in incremental)

In the currently implementation lag only applies to last_value but I see that initially last_value is set to initial_value on get_state.

Do I need to ignore the _apply_lag exec when last_value == initial_value?

def last_value(self) -> Optional[TCursorValue]: s = self.get_state() if self.lag is not None: return self._apply_lag(s["last_value"]) return s["last_value"] # type: ignore

rudolfix · 2024-10-22T07:19:24Z

tests/load/pipeline/test_pipelines.py

+    name = "items"
+
+    @dlt.resource(name=name, primary_key="id")
+    def r1(_=dlt.sources.incremental("id")):


I have similar comments as to the test below:

use just one resource with "append" write disposition. you can return the same response each time

call the resource several time with different lags (use apply_hints to replace incremental)

test if expected elements got duplicated

IMO you'll have a deduplication bug that I describe above

Implemented your comments except for number 2:

call the resource several time with different lags (use apply_hints to replace incremental)

I have used @pytest.mark.parametrize("lag", [-1, 10]) to execute the code with different lags instead of using apply_hints.

Also facing the deduplication bug - more info above in the other comments.

donotpush · 2024-10-22T14:59:24Z

@donotpush the way you solve it is good! if you apply lag to last_value prop all the code should work.

there are a few things that - mostly a result of not perfect code - that you need to pay attention:

def parse_native_representation(self, native_value: Any) -> None: - you need to transfer lag manually

make sure merge works (lag is copied via universal mechanism)

we have ensure_pendulum_datetime that converts strings into datetimes (also int and floats) - maybe it is better to use that. you should also convert the result back to string in the same representation. or the comparison will stop working. tldr;> _apply_lag must return exactly the same type as in input

we should also support date. then lag is in days

one more thing which IMO will be super helpful: allow to define lag in rest_api toolkint:
    start_param: str
    end_param: Optional[str]
you can add lag here and hopefully it will be passed to incremental

I’ve added lag in parse_native_representation, added a simple unit test for merge too.

The rest of the implementation is complete, except for the rest_api.

donotpush · 2024-10-23T06:58:00Z

I have introduced a bug in IncrementalTransform in my last commit - fixing it.

donotpush · 2024-10-23T13:50:58Z

Fixed the tests:

test_pipeline_resource_incremental_datetime_lag
test_pipeline_resource_incremental_int_lag

Please check them carefully because I didn't really manage at anytime to reproduce deduplication bug that you mentioned - I was wrong in my previous comments.

Even without modifying the deduplication_disabled code my tests pass which it is an indication to revert my code changes in deduplication_disabled:

rudolfix

date format detection function is really cool! I think it will be useful in other places. we still need more tests. I expect this feature to be frequently used and all edge cases will be quickly explored. missing tests:

can we test how min last value function behaves?
please make sure lag does not impact initial values
please test if lag is disabled for custom value functions
can we test lag on a date and float?

also please move incremental tests to tests/extract/incremental. it is sufficient to just run them on duckdb which is available there

dlt/common/time.py

dlt/extract/incremental/__init__.py

tests/load/pipeline/test_pipelines.py

donotpush · 2024-10-28T23:56:23Z

@rudolfix implemented all the requested changes.

Please review carefully, specially the end cases.

…g result

rudolfix

LGTM!
we still needs docs before we merge

I did a few fixes please check it out @donotpush

I moved lag functions to a separate module. makes incremental code simpler. allows to write unit tests for apply_lag for various edge cases (also done)
if self.end_value: - this also includes "0" which is a valid end value. @donotpush it is OK to use those implicit casts to bool but you must always make sure this is exactly what you want. lag==0 is OK to skip. end_value==0 is a valid value and ie. we should skip lag also for that (it is tested)
I refactored the code that makes sure that lagged values are within the range of initial_value by just using last_value_func.
type: ignore - always give a reason in brackets
added lag as field to dataclass, not via property
extended TypedDict to support lag in rest-api

we'll merge this once docs are ready

feat: add incremental lag for datetime, int, and float cursors

b29d75e

rudolfix requested changes Oct 15, 2024

View reviewed changes

donotpush added 3 commits October 16, 2024 17:41

chore: reverve eng datetime format

17dc69a

test: incremental lag datetime

a06d552

chore: add lag in merge function

ccc68ab

rudolfix requested changes Oct 22, 2024

View reviewed changes

VioletM mentioned this pull request Oct 22, 2024

Remove incorrect example #1932

Merged

donotpush added 2 commits October 22, 2024 14:36

chore: extended ISO compliance datetime detect format

0f44fe4

chore: changed deduplication_disabled

36bbbaf

fix: native_representation and merge

f5c860e

donotpush requested a review from rudolfix October 22, 2024 15:51

donotpush added 2 commits October 23, 2024 10:43

fix: _deduplication_disabled function

6a63cf6

test: changed expected results

56e492d

rudolfix requested changes Oct 23, 2024

View reviewed changes

dlt/common/time.py Show resolved Hide resolved

dlt/extract/incremental/__init__.py Outdated Show resolved Hide resolved

tests/load/pipeline/test_pipelines.py Outdated Show resolved Hide resolved

tests/load/pipeline/test_pipelines.py Outdated Show resolved Hide resolved

donotpush added 7 commits October 24, 2024 19:43

test: test incremental lag for datetime and int

80926b7

chore: incremental lag disabled with custom last_value_func

d38ab87

test: lag incremental for min function

3cdf773

chore: lag date tests and adjustments

45ad90f

chore: edge case end_values

afba568

chore: edge case initial_values

a1e3263

test: incremental lag float

6a3c4d3

donotpush requested a review from rudolfix October 28, 2024 23:56

donotpush marked this pull request as ready for review October 29, 2024 00:12

rudolfix added 2 commits October 29, 2024 15:39

supports lag in rest-api

e883e94

moves lag to separate module, simplifies applying initial_value to la…

717acbf

…g result

rudolfix previously approved these changes Oct 29, 2024

View reviewed changes

donotpush added 2 commits October 29, 2024 15:55

fix: add test missing variable

00b01e8

docs: lag incremental loading

401587d

donotpush dismissed rudolfix’s stale review via 401587d October 29, 2024 14:55

rudolfix merged commit 3fc9fe4 into devel Oct 29, 2024
59 of 61 checks passed

rudolfix deleted the feat/970-add-incremental-lag branch October 29, 2024 18:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add incremental lag for datetime, int, and float cursors #1957

feat: add incremental lag for datetime, int, and float cursors #1957

donotpush commented Oct 15, 2024

netlify bot commented Oct 15, 2024 •

edited

Loading

rudolfix left a comment

rudolfix left a comment

rudolfix Oct 22, 2024

donotpush Oct 22, 2024

rudolfix Oct 22, 2024

donotpush Oct 22, 2024

donotpush commented Oct 22, 2024 •

edited

Loading

donotpush commented Oct 23, 2024

donotpush commented Oct 23, 2024 •

edited

Loading

rudolfix left a comment

donotpush commented Oct 28, 2024

rudolfix left a comment

feat: add incremental lag for datetime, int, and float cursors #1957

feat: add incremental lag for datetime, int, and float cursors #1957

Conversation

donotpush commented Oct 15, 2024

Related Issues

netlify bot commented Oct 15, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

rudolfix left a comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

rudolfix Oct 22, 2024

Choose a reason for hiding this comment

donotpush Oct 22, 2024

Choose a reason for hiding this comment

rudolfix Oct 22, 2024

Choose a reason for hiding this comment

donotpush Oct 22, 2024

Choose a reason for hiding this comment

donotpush commented Oct 22, 2024 • edited Loading

donotpush commented Oct 23, 2024

donotpush commented Oct 23, 2024 • edited Loading

rudolfix left a comment

Choose a reason for hiding this comment

donotpush commented Oct 28, 2024

rudolfix left a comment

Choose a reason for hiding this comment

netlify bot commented Oct 15, 2024 •

edited

Loading

donotpush commented Oct 22, 2024 •

edited

Loading

donotpush commented Oct 23, 2024 •

edited

Loading