Event time processing #127

Psykopear · 2022-09-13T12:48:19Z

This PR introduces a new Clock Config: EventClockConfig.

EventClockConfig is a clock based on datetimes retrieved from events.
Whenever an event is received, its datetime is extracted through a user specified function.
Events are awaited until a user specified duration is passed reached, if an event comes late it's dropped.

This configuration requires the user to specify 3 parameters:

dt_getter: a function that receives an event as input and should return a datetime extracted from the event
late: a duration after which (in event-time, not realtime) the window for waiting late events is closed
system_clock: by default the system clock is used internally, but a TestingClock can be passed here

~~This PR is ready for review, but is blocked by~~:

TODO

Do not rely on time.sleep in tests
Add documentation
Add an example
More tests

davidselassie

Good work working on this. I know you have a bit of stuff that is blocking this, but wanted to give you some comments on what I was thinking for event time.

src/window/event_time_clock.rs

pytests/test_event_time.py

Psykopear · 2022-09-15T13:03:14Z

@davidselassie I updated the PR following yesterday's conversation, using SystemTime do determine events' lateness.
I'm going to work on the possibility of using a TestingClock in tests for the EventTimeConfig, so that we don't need to rely on time.sleep in CI.

edit Added an EventTimeTestingClock. I don't particularly like this solution, since it requires to keep the implementation of the clock and the testing clock up to date with one another, but for the moment it will do. I'll try to come up with something better before it's time to merge this
Ok, I made the EventTimeClock able to have an internal testing clock set, and also refactored the building process of Clocks from ClockConfigs into a trait, to avoid some repetition. Let me know what you think

src/window/event_time_clock.rs

davidselassie · 2022-09-19T01:08:31Z

src/window/event_time_clock.rs

+    System(SystemClock),
+}
+
+impl InternalClock {


What's the thinking behind having a wrapping class here? Rather than using Box<dyn Clock<V>> and the methods on that directly?

The reason is that I only wanted to explicitly allow a subset of the clocks we have here.
It doesn't make sense to allow an EventTimeClock as the internal one for example.
We could still control this when building the clock from the config, but the generic type would not be representing what I wanted.
Using Box<dyn Clock<V>> also makes things a bit more complicated with the type system, but I think that can be solved.

There is another problem though, with the fix needed in the EventTimeClock::watermark function, we ask for TestingClock's time even if an event didn't arrive. This makes the internal counter of the TestingClock advance more than once per event.

I'm tempted to go back to just having an TestingEventTimeClock, duplicating some code, but having full control of what happens there, and not polluting the EventTimeClock with things that (right now at least) are only needed for tests. But if you think it's worth it for other possible use cases I'll look for a way to make this work.

pytests/test_event_time.py

src/window/event_time_clock.rs

src/window/mod.rs

Psykopear · 2022-09-19T13:06:31Z

Thanks for the thorough review @davidselassie , this was needed.
I addressed most of the comments.

Right now the tests are broken due to the use of the TestingClock as the internal clock in the test, since we increment its internal counter more than once per event after the fix in EventTimeClock::watermark.

As I said there, I'm tempted to go back to having a TestingEventTimeClock for the sake of simplicity, but let me know if you prefer me to find a way to make it work as it is.

Do we have a use case outside of testing to have different kind of clocks for the EventTimeClock?

blakestier · 2022-09-19T17:40:52Z

pytests/test_event_time.py

+    def input_builder(worker_index, worker_count, state):
+        # Each yield advances the clock by 1 second
+        # This should be processed in the first window
+        yield temp(1, 1)


Appreciate the comments! At the risk of having a less legit, real-life flow, you could modify the events to be strings instead of temp values, so their values could be more revealing, i.e. "window 1", "drop since late", "window 2".

src/window/event_time_clock.rs

blakestier · 2022-09-19T18:02:28Z

src/window/event_time_clock.rs

+    #[pyo3(get)]
+    pub(crate) late_after_system_duration: chrono::Duration,
+    #[pyo3(get)]
+    pub(crate) system_clock_config: Option<TestingClockConfig>,


We have an optional TestingClockConfig here to allow for testing windowing with event time?

It's now slightly different, but yes, we still accept a parameter to use a testing clock rather than the default system clock in the event time clock. As I said in another comment, initially I implemented two separate clocks, EventTimeClock and TestingEventTimeClock, with the only difference between the two that the testing clock was using an auto incremented counter rather than system's now.
This meant keeping the two implementations in sync by hand, and could lead to problems if not (a test passing when it shouldn't because test and production use two different implementation of the event time clock), so I decided to find a way to reuse the same EventClock changing only the meaning of now.
I'm still not convinced it's the best approach, since there is a parameter needed only for testing inside a "production" object, but I felt this was better than the way I did it previously.

Open to suggestions though

davidselassie

Thank you for working so hard on this. I try not to nitpick people's algorithms, but I wanted to really double check that this produced the correct behavior and happened to think of some maybe clearer phrasing? Feel free to discard this advice on your discretion.

src/window/mod.rs

src/window/event_time_clock.rs

davidselassie · 2022-09-21T18:26:37Z

src/window/system_clock.rs

@@ -18,8 +18,15 @@ use super::{Clock, ClockConfig};
 ///   Config object. Pass this as the `clock_config` parameter to
 ///   your windowing operator.
 #[pyclass(module="bytewax.window", extends=ClockConfig)]
+#[derive(Clone, Copy)]


Why are these suddenly needed on our config types?

So, this is due to the new ClockBuilder trait, which consumes self.

Previously in the build_clock_builder function we were manually cloning all the fields in the config, and passing them to the builder. With the new trait we clone the whole config instead (required by the FromPyObject trait, used in extract). So instead of implicitly requiring all the fields in the config to be Clone, we require the config itself to be cloneable.
Since we end up cloning all the fields in the end, I thought having this requirement here wouldn't do much of a difference.

This change is not needed for this PR, I can easily revert it, remove the ClockBuilder trait, and implement the builder function on the struct like it was previously. It's a refactoring I did along the way to make the pattern we use more explicit, and I feel it makes the code more readable, but it works the other way too.

src/window/event_time_clock.rs

pytests/test_event_time.py

blakestier · 2022-09-23T20:55:50Z

src/window/event_time_clock.rs

+///   Config object. Pass this as the `clock_config` parameter to your
+///   windowing operator.
+#[pyclass(module="bytewax.window", extends=ClockConfig)]
+#[pyo3(text_signature = "(dt_getter, wait_for_system_duration, system_clock)")]


I think wait_for_system_duration is explicit but perhaps leaking implementation. What about something like lateness_buffer or late_after_duration or something?

Yes, it's "internal implementation" but it's also the correct interpretation of this duration, as there's sort of multiple types of time going on.

Psykopear requested review from whoahbot, davidselassie, awmatheson and blakestier September 13, 2022 12:48

davidselassie reviewed Sep 14, 2022

View reviewed changes

src/window/event_time_clock.rs Outdated Show resolved Hide resolved

src/window/event_time_clock.rs Outdated Show resolved Hide resolved

pytests/test_event_time.py Outdated Show resolved Hide resolved

Psykopear requested a review from davidselassie September 15, 2022 15:29

davidselassie reviewed Sep 19, 2022

View reviewed changes

blakestier reviewed Sep 19, 2022

View reviewed changes

src/window/event_time_clock.rs Outdated Show resolved Hide resolved

blakestier reviewed Sep 19, 2022

View reviewed changes

davidselassie mentioned this pull request Sep 20, 2022

Introduce fate #137

Merged

Psykopear force-pushed the explore-datetimes-issues branch from 0cf8971 to bb629e2 Compare September 21, 2022 14:19

Psykopear force-pushed the event-time-processing branch from 3055da3 to 7d522e9 Compare September 21, 2022 14:19

davidselassie reviewed Sep 21, 2022

View reviewed changes

Psykopear force-pushed the explore-datetimes-issues branch from bb629e2 to 3434e3e Compare September 23, 2022 08:11

Psykopear force-pushed the event-time-processing branch 2 times, most recently from 023a8af to 423ba33 Compare September 23, 2022 09:08

Psykopear marked this pull request as ready for review September 23, 2022 13:00

Psykopear requested review from davidselassie and blakestier September 23, 2022 13:02

whoahbot approved these changes Sep 23, 2022

View reviewed changes

blakestier reviewed Sep 23, 2022

View reviewed changes

Psykopear force-pushed the explore-datetimes-issues branch from de7ffb3 to 684ee4a Compare September 27, 2022 08:00

Psykopear force-pushed the event-time-processing branch 5 times, most recently from f8139d7 to 1bcfc7d Compare September 27, 2022 13:57

Base automatically changed from explore-datetimes-issues to main September 27, 2022 15:33

Psykopear added 9 commits September 27, 2022 17:57

Event time processing

4e9c72a

PR fixes

dd9fb0d

Using new PyTestingClock, fixes

25d8aa8

Fixed and tested (de)serialization of EventClock

0788213

Minor fixes

aa1934d

Changed test

4433f80

Removed print

cae6a3b

Refactoring

47e88c6

Added an example

3450c6c

Psykopear force-pushed the event-time-processing branch from 1bcfc7d to 3450c6c Compare September 27, 2022 15:57

davidselassie approved these changes Sep 27, 2022

View reviewed changes

Psykopear merged commit 1e3a193 into main Sep 28, 2022

davidselassie deleted the event-time-processing branch September 28, 2022 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event time processing #127

Event time processing #127

Psykopear commented Sep 13, 2022 •

edited

Loading

davidselassie left a comment

Psykopear commented Sep 15, 2022 •

edited

Loading

davidselassie Sep 19, 2022

Psykopear Sep 19, 2022

Psykopear commented Sep 19, 2022

blakestier Sep 19, 2022 •

edited

Loading

blakestier Sep 19, 2022

Psykopear Sep 23, 2022

davidselassie left a comment

davidselassie Sep 21, 2022

Psykopear Sep 23, 2022

blakestier Sep 23, 2022

davidselassie Sep 27, 2022

Event time processing #127

Event time processing #127

Conversation

Psykopear commented Sep 13, 2022 • edited Loading

TODO

davidselassie left a comment

Choose a reason for hiding this comment

Psykopear commented Sep 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Psykopear commented Sep 19, 2022

blakestier Sep 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidselassie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Psykopear commented Sep 13, 2022 •

edited

Loading

Psykopear commented Sep 15, 2022 •

edited

Loading

blakestier Sep 19, 2022 •

edited

Loading