-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flow.chunked with specified time period #1302
Comments
Could you please explain a bit your use-case? E.g. what kind of domain-specific problem you would like to solve with such operator |
I need to batch live events to reduce synchronization cost and limit refresh rate. Currently I implemented this feature in my project: events are sent to a channel and batched to 10Hz in an event bus. |
I have flow of trades and I need to collect trades within specified timespan (it's high probability for many trades to occur within 1-5 sec) and then do some processing of collected trades. |
Thanks. This kind of operator would be called |
I'm in need of this too. My use case is as following: I implemented a barcode scanner which scans the camera input live. That means that I get barcode results every few milliseconds. |
@PaulWoitaschek I had a requirement for a similar (though not exactly the same operator) and ended up implementing my own (final implementation at the end), if it helps. |
I have a need for a time based chunked too. I have a stream of analytics events originating in an mobile app. I would like to chunk them in time-based windows, so that I can send them to the backend periodically without initiating too many network connections. Say, chunk events for a minute, send them over the network and repeat for the next package. For this task I care for:
I do not care for:
I had a discussion earlier with @elizarov about size-based chunk operator: So a single operator for time- and size-based chunks is desired. So it is possible to chunk by size, by time and by both. I would like to propose a base operator with three parameters:
Important consideration is how exactly time-based chunking should be implemented. Should "chunking" start right at the moment of collection, or only after receiving first value? Should new "chunking window" start right after previous window, or only after receiving first value for a new chunk? Based on my needs, if I want chunks to not be empty, that means I am ok with emissions not happening in regular intervals. In fact I want chunking to gather as much values as possible in a time frame. So I would propose, that if minSize > 0, then chunking should start only after receiving first value for a new chunk. That way, we won`t be "chunking air". Only minSize = 0 could mean, that we want regular chunk emissions, no matter the size -> chunking should start right at the moment of collection and new chunking window should follow right after previous. Above may not be very intuitive however. Maybe we should always start time-based chunking right after collection and let one chunking window follow another. Chunks not fulfilling minSize would just be added to the following chunks and so on until the subsequent chunk fulfills minSize param. As for exact API shape, I think we could have one operator with @experimental Duration param and another with just chunkDurationMs: Long -> to not tie this operator experimentality to kotlin.time. I also thought, that for just size-based chunking it would still be convenient to have a separate operator without Duration param. Otherwise, with I have also decided to omit trailing lambda To sum up: public fun Flow.chunked(maxSize: Int, minSize: Int = 1): Flow<List> public fun Flow.chunked(chunkDuration: Duration, minSize: Int = 1, maxSize: Int = NO_MAXIMUM ): Flow<List> public fun Flow.chunked(chunkDurationMs: Long, minSize: Int = 1, maxSize: Int = NO_MAXIMUM): Flow<List> |
Please remember to consider natural batching use-case (#902) in the design. |
Good point @pacher To express this, yet another, use case in single operator something more expressive about is needed I think. A way, to precise what needs to happen for a chunk to be emitted. How about:
..expressing three most popular use cases? Do you think that last case One thing I regret, that casual user will no longer be able to type |
I was more thinking along the lines of what @elizarov suggested in this comment: duration is set to zero and size limit is set to very big value P.S. I, personally, would add an optional parameter like |
Not really. All in all, duration 0 for natural batching is a clear winner for API design in my opinion. |
Trying to fit natural batching into some magic combination of duration and maxSize of a chunk is IMHO not going to be intuitive at all.
That is why I tried to come up with a sealed class, clearly telegraphing possible chunking behaviours. SizeLimit chunking does not need minSize or maxSize, but just All in all I am seeing, that it is not easy to fit natural batching in here. Maybe instead of trying to do it, we could come up with something more general / abstract, similar to
kind of like select expression. You could freely omit / start / stop timer and use / ignore onCollectorReady Clause. And most importantly emit whenever and whatever you wish. Edit: Simplified signature. But boy, that is much more complicated than simple
|
@circusmagnus I am waiting for clarification here before discussing it further. |
Before we delve into the design we need to need to gather and discuss use-cases. @circusmagnus has great use-cases which call for |
Well, I think all three use cases:
... can live without minSize. It had two motivations:
After a few days I think @pacher may be right. Let it be: with usage: ...come to think of it, maybe we could somehow insert AND or OR between size and duration parameter for some more customization: That would solve all minSize issues without additional parameter. |
Discussion seems to have stalled. Perhaps we should take the issue to the coroutines slack channel? Or do something else to attract more public, gather use-cases and discuss design? |
my usecase is also batching (not sure if natural or not) specifically emit if either the buffer reaches size limit or time limit is reached and the buffer contains any elements, although i could also easily filter out empty lists in the flow later crucial is that after emitting it resets the timer, without that happening it might be that the next run emits too early because the previous timer triggers also backpressure and such being maintained ofc |
Ok, I will have another go at this issue. @elizarov mentioned, that there is no obvious case for minSize and it complicates things. So, let`s assume, that minimum chunk size is always one. Seems to fit all use cases:
No minSize param then. That leaves us with duration and size params: How to express our use cases with it..
Since our 'size' parameter is either maximumSize for time-based chunking or just desired size - it means we should try to emit immediately, when it is reached. Suspending upstream, if need be.
In other words reaching size limit - we do suspend upstream until emission happens. Chunk cannot grow bigger, than specified. Reaching time limit - we do not suspend upstream no matter whether we did emit or are still waiting for downstream to get ready. Our time limit may be breached due to busy downstream. We cannot prevent it. That shapes our design into `chunked(intervalConstraint OR sizeConstraint) consistently across all use cases. So the proposal boils down to Proposed impl (give or take - no sanity checks, etc):
Helper functions:
Plus optimized, non-concurrent, impl for purely size-based chunking:
|
I'm using my own rough implementation for this to batch database updates.
upstreamFlowOfManyFrequentUpdates
.chunked(sizeLimit = 100_000, timeLimit = 2.seconds)
.collect { updates ->
// write updates to DB
} |
I like the signature |
Come to think of it, this is not a good idea. Chunking should always happen in different coroutine, than downstream collecting. I think, having your upstream blocked due to slow consumer (like sending a chunk data over a network) is not desirable even in purely size-based chunking. It is also more consistent this way. |
backpressure is a desirable property in some systems, at least for our usecase |
I agree. We will have backpressure working all right in both cases. Main difference is that backpressure in concurrent solution will kick in a bit later.
For one, I would not like to have my analytics events "production" suspended for a time of making a network request with previous chunk. I want my analytics events "production" suspended only, if chunk hits size limit. For two, I think time- and size-based chunking should behave consistently / in same way, when it comes to backpressure due to size-limit constraint |
I just want to note that there's an impl here that solves at least 50% of the use case: #2193 (comment) |
You can use built-in functions to achieve the time-based chunk
|
Hi My implementation transforms
for example channel.asChunkedFlow(4, 130).collect { items ->
println("2: processing $items")
delay(100)
} makes a receiver flow from that
You can connect many workers that will consume chunks from that channel.
Of course multiple producers can send to that channel |
Based on discussion here and some of my own exepriences, I have updated PR #2378 Current APi proposal and usage:
Implementation details can be seen fully in the PR: #2378 Time based methods use a Channel as an accumulator and some form of signaller (a Job or a Channel in case of TimeOrSize method), to eventually indicate, that it should be emptied before interval passes. Natural Chunking uses just two concurrent coroutines - one sends to the channel, the other uses tryReceive in a loop (suspending while waiting for first element). |
@circusmagnus thanks a lot for this proposal and PR. I just have a couple considerations regarding the public API.
Maybe a simple solution to 2) would be to additionally define related high-level functions like An alternative could be to simply define
|
That is true, that I like very much your idea of
But I would be equally happy with merging a PR with such an (simpler) API. |
You're right. I guess as soon as we switch to simple params, the dilemma will be around default values. The defaults I was initially thinking about were:
So the use cases would look this way on the call site:
I'm not a fan of the unbounded default for time-based chunking though... :/ Other defaults could be considered, but there may never be a sweet spot unfortunately. All-in-all, I sort of agree that the interface (or potential sealed class) would make it more discoverable (I would never know about natural batching without this discussion by just seeing this API). I'd like to hear the Kotlin team's opinion about this. |
As I said, I would go with two/three operators, when without That way I am sticking to original param names, as timeLimit sounds unobvious for me. "Chunked by timeLimits?. TimeLimit = 0? (for Natural Batching)" But maybe that`s my poor english knowledge. :) |
Half a year later... What about such a signature?
example usage: Pros:
Cons:
|
I don't think it's a problem, it probably worth to add an overload that under the hood uses proposed function, there are already such simplified APIs on kotlinx.coroutines which use more generic versions of API to provide more case-specific version of a function |
Adding a variant on the "size" use case - where the downstream consumer is an API call it may have constraints on the set of data, such as:
Both involve maintaining and checking state to determine if chunk should be emitted. It would be helpful if From existing Java implementations of this (closed source), "weight" can default to Long.MAX_VALUE and the weighing function can default to return 0. This assumes an implementation that tracks both, which may not be the case for a Flow-based variant. |
I'm using this implementation which uses standard functions. I'm curious to know if there's anything to improve. It is a bit wasteful by using launch for every added value but the implementation is clean IMO since there's no polling.
EDIT: updated from circusmagnus comments. |
Otherwise it indeed looks clean and simple. :) |
We also need this in IJ |
Can I imp for this code? |
I guess that wont work, because it wont emit if the flow terminates but the buffer is not exactly full |
I just want to echo this. I'm currently using a
However, timeliness is not important in my use case. If emissions are buffered, or suspend, to satisfy the above requirements, that is fine. |
Would love this type of operator in stdlib. My use case is chunking bluetooth advertisements on Android so that we can determine when a device hasn't been seen before, and when a device has been turned off or is out of range. See JuulLabs/kable#521. |
Copied from #4069 Use caseI have several cases where I need to group flow events by time and then combine them into something. The simplest example is averaging. If I have the real-time data, I frequently want to take all points in a given time window and return only averaged value for this time frame (or default value if no events happened in this time window). I have several such cases in VisionForge (visualization library) and Controls-kt (device data acquisition. The Shape of the APII would call it sample, but it is already taken, so it could be something like chunkedByTime to be aligned with stdlib method for collections. So, it could look like this: fun Flow.chunkedByTime(duration: Duration): Flow<List. To make the API more universal one could provide external signal trigger. So, the collection is triggered by obtaining signal from a flow or channel. Another API consideration is usage of API on numbers. For my purposes List is OK, it could be grouped later, which gives flexibility. But when we use List of numbers, it impacts performance. So, one could provide an inline method that transforms array of numbers internally in the flow evaluation loop. Like fun Flow.chunkedDoubleByTime(duration: Duration, collector: (DoubleArray)->R): Flow. Such API is opinionated, so I am not sure it should be in the library. Prior Artsample and debounce has similar functionality, but they return only one value, not all of them. I think the implementation could share some code. |
Hey guys from 2024! Option A) @ExperimentalCoroutinesApi
@ObsoleteCoroutinesApi
fun <T> Flow<T>.bufferTimeout(size: Int, duration: Duration): Flow<T> {
require(size > 0) { "Window size should be greater than 0" }
require(duration.toMillis() > 0) { "Duration should be greater than 0" }
return flow {
coroutineScope {
val events = ArrayList<T>(size)
val tickerChannel = ticker(duration.toMillis())
try {
var hasTimedOut = false
val upstreamValues = produce { collect { send(it) } }
while (isActive) {
val tickerJob = launch {
tickerChannel.receive()
hasTimedOut = true
}
withTimeoutOrNull(10) { upstreamValues.receive() }
?.let { events.add(it) }
if (events.size == size || (hasTimedOut && events.isNotEmpty())) {
emit(events.toList())
events.clear()
tickerJob.cancel()
hasTimedOut = false
}
}
} finally {
tickerChannel.cancel()
}
}
}
} Option B @OptIn(FlowPreview::class)
private class TimedChunkFlow<T>(sourceFlow: Flow<T>, periodMs: Long) {
private val chunkLock = ReentrantLock()
private var chunk = mutableListOf<T>()
val resultFlow = flow {
sourceFlow.collect {
// the chunk is reused before it's collected by "sample()"
val localChunk = chunkLock.withLock {
chunk.add(it)
chunk
}
emit(localChunk)
}
}.sample(periodMs).onEach {
chunkLock.withLock {
chunk = mutableListOf()
}
}
} Option C fun <T> Flow<T>.chunked(maxSize: Int, interval: Duration) = channelFlow {
val buffer = mutableListOf<T>()
var flushJob: Job? = null
collect { value ->
flushJob?.cancelAndJoin()
buffer.add(value)
if (buffer.size >= maxSize) {
send(buffer.toList())
buffer.clear()
} else {
flushJob = launch {
delay(interval)
if (buffer.isNotEmpty()) {
send(buffer.toList())
buffer.clear()
}
}
}
}
flushJob?.cancelAndJoin()
if (buffer.isNotEmpty()) {
send(buffer.toList())
buffer.clear()
}
} I also checked the proposal https://github.com/Kotlin/kotlinx.coroutines/pull/2378/files |
Currently Flow supports only buffer operator with capacity.
It would be useful to buffer elements within specified time range.
flow.buffer(Duration.ofSeconds(5)).collect {...}
The text was updated successfully, but these errors were encountered: