-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: improve epoch monitoring #3197
base: master
Are you sure you want to change the base?
Conversation
You can find the image built from this PR at
Built from 67558ef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for it! I added a couple of minor enhancements:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On one hand, I'm not a super big fan of having a procedure running in a loop every 500 ms just to reset one counter. On the other hand, it is indeed a bug and have nothing better to propose right now lol.
So I think it's ok, but it will be cool to eventually design a solution that implies a bit less of brute force. Don't want to overcomplicate things either though.
Thanks so much for catching and fixing this issue!
aa23154
to
5becb37
Compare
Hi @Ivansete-status thanks for your review! I've updated the PR based on your suggestions. Could you please check it again? |
Hi @gabrielmer , thank you so much for your comment! I found a better approach and have updated the PR. Could you please take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks much better, thanks so much!
Added a couple comments to make sure everything works as intended :)
waku/waku_rln_relay/rln_relay.nim
Outdated
@@ -392,6 +393,31 @@ proc generateRlnValidator*( | |||
|
|||
return validator | |||
|
|||
proc monitorEpochs(wakuRlnRelay: WakuRLNRelay): Future[void] {.async.} = | |||
var nextEpochTime = epochTime() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't epochTime()
return the current UNIX epoch? If so, why does nextEpochTime
equal the current time?
waku/waku_rln_relay/rln_relay.nim
Outdated
let currentEpoch = wakuRlnRelay.calcEpoch(epochTime()) | ||
|
||
if currentEpoch != lastEpoch: | ||
nextEpochTime = epochTime() + float64(wakuRlnRelay.rlnEpochSizeSec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if we use the current time and add the size of the RLN epoch, there will be small inaccuracies that will keep adding up and nextEpochTime
will drift away from its intended value
Instead of calculating the current time and adding rlnEpochSizeSec
, we should calculate the time at which the current RLN epoch started and add rlnEpochSizeSec
to it (or just calculate directly at what exact time the next RLN epoch is supposed to start)
37b5204
to
8f2bd39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the enhancements!
Nevertheless, I have the impression that we are over complicating it :)
Let me elaborate a bit more.
We are looking for showing the generated proofs , and the remaining-available proofs within an RLN epoch when the node acts as a RLNaaS.
In this case, I think we have the epoch time window very well defined by the rlnEpochSizeSec
config param (10min in the case of The Waku Network.) Then, I believe we only need a while loop that resets the waku_rln_remaining_proofs_per_epoch
once every ``rlnEpochSizeSec` seconds.
Then, I think we only need:
while true:
...
await sleepAsync(rlnEpochSizeSec * 1000)
On the other hand, it is pending to cancelAndWait
the epochMonitorFuture
from within the stop
proc.
waku/waku_rln_relay/rln_relay.nim
Outdated
if firstChangeDetected: | ||
await sleepAsync(int((nextEpochTime - epochTime()) * 1000)) | ||
else: | ||
await sleepAsync(500) # 1 second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
await sleepAsync(500) # 1 second | |
await sleepAsync(500) |
waku/waku_rln_relay/rln_relay.nim
Outdated
|
||
proc calcEpoch*(rlnPeer: WakuRLNRelay, t: float64): Epoch = | ||
## gets time `t` as `flaot64` with subseconds resolution in the fractional part | ||
## and returns its corresponding rln `Epoch` value | ||
let e = uint64(t / rlnPeer.rlnEpochSizeSec.float64) | ||
return toEpoch(e) | ||
|
||
proc nextEpochTime*(rlnPeer: WakuRLNRelay, t: float64): float64 = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to give information about the expected units in the parameters
proc nextEpochTime*(rlnPeer: WakuRLNRelay, t: float64): float64 = | |
proc nextEpochTime*(rlnPeer: WakuRLNRelay, timeMillis: float64): float64 = |
waku/waku_rln_relay/rln_relay.nim
Outdated
wakuRlnRelay.groupManager.userMessageLimit.get().float64 | ||
) | ||
let nextEpochTime = wakuRlnRelay.nextEpochTime(epochTime()) | ||
await sleepAsync(int((nextEpochTime - epochTime()) * 1000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The await should better go outside the exception tracking to avoid possible infinite blocking
@@ -392,6 +415,17 @@ proc generateRlnValidator*( | |||
|
|||
return validator | |||
|
|||
proc monitorEpochs(wakuRlnRelay: WakuRLNRelay): Future[void] {.async.} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if already explained somewhere else but, what are we willing to achieve? That sounds overcomplicated :)
25c4040
to
f2832aa
Compare
The issue is that a sleep of 10 minutes does not take exactly 10 minutes, there will be a small error and in every iteration of the loop the error will increase and increase, until at some point it will be completely out of shift. Maybe it won't be immediately noticeable, but if you have a long-running node it does become an issue I believe |
excatly ! but recently i noticed that epoch time always sync across the waku network and triggers on the 10th minute |
Hi @gabrielmer @Ivansete-status @NagyZoltanPeter @SionoiS I recently noticed something interesting during test this PR. we all know, with RLN, we can send up to 100 messages per epoch (10 minutes). Here’s what I observed, and I’m unsure if it’s expected behavior or a bug: For example, if an epoch starts at 4:00, we can send 100 messages until 4:10. If I try to send more than 100 messages, they’re blocked, which makes sense. But in another scenario, if the epoch starts at 4:00 and I don’t send any messages until 4:06, then send all 100 messages between 4:06 and 4:09, when the new epoch starts at 4:10, I still can’t send any messages and get a ‘message limit exceeded’ error. Interestingly, I’m only able to send messages again at 4:16. Is this behavior expected, or could it be a bug? If it’s intentional, I’ll need to adjust the logic to handle this scenario. |
Yes! My understanding is that the epoch is something global for the whole network and does not depend on one node's start time. In this comment I explained how I found it is calculated #3197 (comment) We can't assume that the epoch is 10 minutes though, I think we have to use the general |
Ah wow that's super interesting. I would think that it is a bug, my understanding is that the limit should be according to the global epoch and not depending on when our node starts sending messages. But I'm not sure though, @alrevuelta should know better How are you sending the messages? via REST API? If so, I suspect it might be a bug in our REST layer. I might be wrong, but that's the first thing I would check |
Thanks! I agree, there's a possibility that sync times could change (currently it starts at the 10th, but this could change) or even the rlnEpochSizeSec could change. In this scenario, the idea would fail. On further thought, I believe the current implementation is much more reliable. |
I’m using the chat interface to send messages. There is a possibility of an issue with the chat interface, but let’s confirm with @alrevuelta. |
b876745
to
2aaf95e
Compare
2aaf95e
to
049fbea
Compare
I discovered a bug while monitoring Grafana metrics in PR #3181 Currently, the waku_rln_proof_remining counter is reset inside the generateProof function. This approach fails in the following scenario:
If a new epoch starts but no proofs are generated (i.e., no messages are sent), the counter doesn't reset. Ideally, the counter should reset immediately when a new epoch begins, regardless of whether any messages are sent.
This PR addresses the issue by implementing an independent epoch monitoring task that ensures the counter is reset as soon as a new epoch starts.
Note: The spike should occur around 23:50, but it happens around 23:53 because the user does not generate any RLN proof ( msg ) within the 3-minute timeframe.