Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event time processing #127

Merged
merged 9 commits into from
Sep 28, 2022
Merged

Event time processing #127

merged 9 commits into from
Sep 28, 2022

Conversation

Psykopear
Copy link
Contributor

@Psykopear Psykopear commented Sep 13, 2022

This PR introduces a new Clock Config: EventClockConfig.

EventClockConfig is a clock based on datetimes retrieved from events.
Whenever an event is received, its datetime is extracted through a user specified function.
Events are awaited until a user specified duration is passed reached, if an event comes late it's dropped.

This configuration requires the user to specify 3 parameters:

  • dt_getter: a function that receives an event as input and should return a datetime extracted from the event
  • late: a duration after which (in event-time, not realtime) the window for waiting late events is closed
  • system_clock: by default the system clock is used internally, but a TestingClock can be passed here

This PR is ready for review, but is blocked by:

TODO

  • Do not rely on time.sleep in tests
  • Add documentation
  • Add an example
  • More tests

Copy link
Contributor

@davidselassie davidselassie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work working on this. I know you have a bit of stuff that is blocking this, but wanted to give you some comments on what I was thinking for event time.

src/window/event_time_clock.rs Outdated Show resolved Hide resolved
src/window/event_time_clock.rs Outdated Show resolved Hide resolved
pytests/test_event_time.py Outdated Show resolved Hide resolved
@Psykopear
Copy link
Contributor Author

Psykopear commented Sep 15, 2022

@davidselassie I updated the PR following yesterday's conversation, using SystemTime do determine events' lateness.
I'm going to work on the possibility of using a TestingClock in tests for the EventTimeConfig, so that we don't need to rely on time.sleep in CI.

edit Added an EventTimeTestingClock. I don't particularly like this solution, since it requires to keep the implementation of the clock and the testing clock up to date with one another, but for the moment it will do. I'll try to come up with something better before it's time to merge this
Ok, I made the EventTimeClock able to have an internal testing clock set, and also refactored the building process of Clocks from ClockConfigs into a trait, to avoid some repetition. Let me know what you think

src/window/event_time_clock.rs Outdated Show resolved Hide resolved
src/window/event_time_clock.rs Outdated Show resolved Hide resolved
System(SystemClock),
}

impl InternalClock {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the thinking behind having a wrapping class here? Rather than using Box<dyn Clock<V>> and the methods on that directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that I only wanted to explicitly allow a subset of the clocks we have here.
It doesn't make sense to allow an EventTimeClock as the internal one for example.
We could still control this when building the clock from the config, but the generic type would not be representing what I wanted.
Using Box<dyn Clock<V>> also makes things a bit more complicated with the type system, but I think that can be solved.

There is another problem though, with the fix needed in the EventTimeClock::watermark function, we ask for TestingClock's time even if an event didn't arrive. This makes the internal counter of the TestingClock advance more than once per event.

I'm tempted to go back to just having an TestingEventTimeClock, duplicating some code, but having full control of what happens there, and not polluting the EventTimeClock with things that (right now at least) are only needed for tests. But if you think it's worth it for other possible use cases I'll look for a way to make this work.

pytests/test_event_time.py Outdated Show resolved Hide resolved
src/window/event_time_clock.rs Outdated Show resolved Hide resolved
src/window/event_time_clock.rs Outdated Show resolved Hide resolved
src/window/event_time_clock.rs Outdated Show resolved Hide resolved
src/window/event_time_clock.rs Outdated Show resolved Hide resolved
src/window/mod.rs Outdated Show resolved Hide resolved
@Psykopear
Copy link
Contributor Author

Thanks for the thorough review @davidselassie , this was needed.
I addressed most of the comments.

Right now the tests are broken due to the use of the TestingClock as the internal clock in the test, since we increment its internal counter more than once per event after the fix in EventTimeClock::watermark.

As I said there, I'm tempted to go back to having a TestingEventTimeClock for the sake of simplicity, but let me know if you prefer me to find a way to make it work as it is.

Do we have a use case outside of testing to have different kind of clocks for the EventTimeClock?

def input_builder(worker_index, worker_count, state):
# Each yield advances the clock by 1 second
# This should be processed in the first window
yield temp(1, 1)
Copy link
Contributor

@blakestier blakestier Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the comments! At the risk of having a less legit, real-life flow, you could modify the events to be strings instead of temp values, so their values could be more revealing, i.e. "window 1", "drop since late", "window 2".

#[pyo3(get)]
pub(crate) late_after_system_duration: chrono::Duration,
#[pyo3(get)]
pub(crate) system_clock_config: Option<TestingClockConfig>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an optional TestingClockConfig here to allow for testing windowing with event time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's now slightly different, but yes, we still accept a parameter to use a testing clock rather than the default system clock in the event time clock. As I said in another comment, initially I implemented two separate clocks, EventTimeClock and TestingEventTimeClock, with the only difference between the two that the testing clock was using an auto incremented counter rather than system's now.
This meant keeping the two implementations in sync by hand, and could lead to problems if not (a test passing when it shouldn't because test and production use two different implementation of the event time clock), so I decided to find a way to reuse the same EventClock changing only the meaning of now.
I'm still not convinced it's the best approach, since there is a parameter needed only for testing inside a "production" object, but I felt this was better than the way I did it previously.

Open to suggestions though

Copy link
Contributor

@davidselassie davidselassie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working so hard on this. I try not to nitpick people's algorithms, but I wanted to really double check that this produced the correct behavior and happened to think of some maybe clearer phrasing? Feel free to discard this advice on your discretion.

src/window/mod.rs Outdated Show resolved Hide resolved
src/window/event_time_clock.rs Outdated Show resolved Hide resolved
src/window/event_time_clock.rs Outdated Show resolved Hide resolved
src/window/event_time_clock.rs Outdated Show resolved Hide resolved
src/window/event_time_clock.rs Outdated Show resolved Hide resolved
@@ -18,8 +18,15 @@ use super::{Clock, ClockConfig};
/// Config object. Pass this as the `clock_config` parameter to
/// your windowing operator.
#[pyclass(module="bytewax.window", extends=ClockConfig)]
#[derive(Clone, Copy)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these suddenly needed on our config types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is due to the new ClockBuilder trait, which consumes self.

Previously in the build_clock_builder function we were manually cloning all the fields in the config, and passing them to the builder. With the new trait we clone the whole config instead (required by the FromPyObject trait, used in extract). So instead of implicitly requiring all the fields in the config to be Clone, we require the config itself to be cloneable.
Since we end up cloning all the fields in the end, I thought having this requirement here wouldn't do much of a difference.

This change is not needed for this PR, I can easily revert it, remove the ClockBuilder trait, and implement the builder function on the struct like it was previously. It's a refactoring I did along the way to make the pattern we use more explicit, and I feel it makes the code more readable, but it works the other way too.

src/window/event_time_clock.rs Outdated Show resolved Hide resolved
src/window/event_time_clock.rs Show resolved Hide resolved
src/window/event_time_clock.rs Outdated Show resolved Hide resolved
pytests/test_event_time.py Outdated Show resolved Hide resolved
@Psykopear Psykopear force-pushed the event-time-processing branch 2 times, most recently from 023a8af to 423ba33 Compare September 23, 2022 09:08
@Psykopear Psykopear marked this pull request as ready for review September 23, 2022 13:00
/// Config object. Pass this as the `clock_config` parameter to your
/// windowing operator.
#[pyclass(module="bytewax.window", extends=ClockConfig)]
#[pyo3(text_signature = "(dt_getter, wait_for_system_duration, system_clock)")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think wait_for_system_duration is explicit but perhaps leaking implementation. What about something like lateness_buffer or late_after_duration or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's "internal implementation" but it's also the correct interpretation of this duration, as there's sort of multiple types of time going on.

@Psykopear Psykopear force-pushed the event-time-processing branch 5 times, most recently from f8139d7 to 1bcfc7d Compare September 27, 2022 13:57
Base automatically changed from explore-datetimes-issues to main September 27, 2022 15:33
@Psykopear Psykopear merged commit 1e3a193 into main Sep 28, 2022
@davidselassie davidselassie deleted the event-time-processing branch September 28, 2022 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants