Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: improve epoch monitoring #3197

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

chore: improve epoch monitoring #3197

wants to merge 2 commits into from

Conversation

darshankabariya
Copy link
Contributor

@darshankabariya darshankabariya commented Dec 4, 2024

I discovered a bug while monitoring Grafana metrics in PR #3181 Currently, the waku_rln_proof_remining counter is reset inside the generateProof function. This approach fails in the following scenario:

If a new epoch starts but no proofs are generated (i.e., no messages are sent), the counter doesn't reset. Ideally, the counter should reset immediately when a new epoch begins, regardless of whether any messages are sent.

This PR addresses the issue by implementing an independent epoch monitoring task that ensures the counter is reset as soon as a new epoch starts.

Note: The spike should occur around 23:50, but it happens around 23:53 because the user does not generate any RLN proof ( msg ) within the 3-minute timeframe.

image

Copy link

github-actions bot commented Dec 4, 2024

You can find the image built from this PR at

quay.io/wakuorg/nwaku-pr:3197

Built from 67558ef

Copy link
Collaborator

@Ivansete-status Ivansete-status left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for it! I added a couple of minor enhancements:)

waku/waku_rln_relay/rln_relay.nim Outdated Show resolved Hide resolved
waku/waku_rln_relay/rln_relay.nim Outdated Show resolved Hide resolved
Copy link
Contributor

@gabrielmer gabrielmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one hand, I'm not a super big fan of having a procedure running in a loop every 500 ms just to reset one counter. On the other hand, it is indeed a bug and have nothing better to propose right now lol.

So I think it's ok, but it will be cool to eventually design a solution that implies a bit less of brute force. Don't want to overcomplicate things either though.

Thanks so much for catching and fixing this issue!

@darshankabariya
Copy link
Contributor Author

Thanks for it! I added a couple of minor enhancements:)

Hi @Ivansete-status thanks for your review! I've updated the PR based on your suggestions. Could you please check it again?

@darshankabariya
Copy link
Contributor Author

On one hand, I'm not a super big fan of having a procedure running in a loop every 500 ms just to reset one counter. On the other hand, it is indeed a bug and have nothing better to propose right now lol.

So I think it's ok, but it will be cool to eventually design a solution that implies a bit less of brute force. Don't want to overcomplicate things either though.

Thanks so much for catching and fixing this issue!

Hi @gabrielmer , thank you so much for your comment! I found a better approach and have updated the PR. Could you please take a look?

Copy link
Contributor

@gabrielmer gabrielmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks much better, thanks so much!

Added a couple comments to make sure everything works as intended :)

@@ -392,6 +393,31 @@ proc generateRlnValidator*(

return validator

proc monitorEpochs(wakuRlnRelay: WakuRLNRelay): Future[void] {.async.} =
var nextEpochTime = epochTime()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't epochTime() return the current UNIX epoch? If so, why does nextEpochTime equal the current time?

let currentEpoch = wakuRlnRelay.calcEpoch(epochTime())

if currentEpoch != lastEpoch:
nextEpochTime = epochTime() + float64(wakuRlnRelay.rlnEpochSizeSec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if we use the current time and add the size of the RLN epoch, there will be small inaccuracies that will keep adding up and nextEpochTime will drift away from its intended value

Instead of calculating the current time and adding rlnEpochSizeSec, we should calculate the time at which the current RLN epoch started and add rlnEpochSizeSec to it (or just calculate directly at what exact time the next RLN epoch is supposed to start)

Copy link
Collaborator

@Ivansete-status Ivansete-status left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the enhancements!
Nevertheless, I have the impression that we are over complicating it :)
Let me elaborate a bit more.
We are looking for showing the generated proofs , and the remaining-available proofs within an RLN epoch when the node acts as a RLNaaS.

In this case, I think we have the epoch time window very well defined by the rlnEpochSizeSec config param (10min in the case of The Waku Network.) Then, I believe we only need a while loop that resets the waku_rln_remaining_proofs_per_epoch once every ``rlnEpochSizeSec` seconds.

Then, I think we only need:

while true:
  ...
  await sleepAsync(rlnEpochSizeSec * 1000)

On the other hand, it is pending to cancelAndWait the epochMonitorFuture from within the stop proc.

if firstChangeDetected:
await sleepAsync(int((nextEpochTime - epochTime()) * 1000))
else:
await sleepAsync(500) # 1 second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
await sleepAsync(500) # 1 second
await sleepAsync(500)


proc calcEpoch*(rlnPeer: WakuRLNRelay, t: float64): Epoch =
## gets time `t` as `flaot64` with subseconds resolution in the fractional part
## and returns its corresponding rln `Epoch` value
let e = uint64(t / rlnPeer.rlnEpochSizeSec.float64)
return toEpoch(e)

proc nextEpochTime*(rlnPeer: WakuRLNRelay, t: float64): float64 =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to give information about the expected units in the parameters

Suggested change
proc nextEpochTime*(rlnPeer: WakuRLNRelay, t: float64): float64 =
proc nextEpochTime*(rlnPeer: WakuRLNRelay, timeMillis: float64): float64 =

wakuRlnRelay.groupManager.userMessageLimit.get().float64
)
let nextEpochTime = wakuRlnRelay.nextEpochTime(epochTime())
await sleepAsync(int((nextEpochTime - epochTime()) * 1000))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The await should better go outside the exception tracking to avoid possible infinite blocking

@@ -392,6 +415,17 @@ proc generateRlnValidator*(

return validator

proc monitorEpochs(wakuRlnRelay: WakuRLNRelay): Future[void] {.async.} =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if already explained somewhere else but, what are we willing to achieve? That sounds overcomplicated :)

@gabrielmer
Copy link
Contributor

In this case, I think we have the epoch time window very well defined by the rlnEpochSizeSec config param (10min in the case of The Waku Network.) Then, I believe we only need a while loop that resets the waku_rln_remaining_proofs_per_epoch once every ``rlnEpochSizeSec` seconds.

Then, I think we only need:

while true:
  ...
  await sleepAsync(rlnEpochSizeSec * 1000)

On the other hand, it is pending to cancelAndWait the epochMonitorFuture from within the stop proc.

The issue is that a sleep of 10 minutes does not take exactly 10 minutes, there will be a small error and in every iteration of the loop the error will increase and increase, until at some point it will be completely out of shift.

Maybe it won't be immediately noticeable, but if you have a long-running node it does become an issue I believe

@darshankabariya
Copy link
Contributor Author

darshankabariya commented Dec 14, 2024

The issue is that a sleep of 10 minutes does not take exactly 10 minutes, there will be a small error and in every iteration of the loop the error will increase and increase, until at some point it will be completely out of shift.

Maybe it won't be immediately noticeable, but if you have a long-running node it does become an issue I believe

excatly !

but recently i noticed that epoch time always sync across the waku network and triggers on the 10th minute (e.g., XX:X0). It never starts at random times like 13 or 17 (It means it's not just tied to the node's start time). I think we can leverage this to simplify the logic, but just wanted to confirm if I’m missing something. I observed this after several tests.

i try to raplicate to give a idea.
image

image

@gabrielmer @Ivansete-status

@darshankabariya
Copy link
Contributor Author

darshankabariya commented Dec 14, 2024

Hi @gabrielmer @Ivansete-status @NagyZoltanPeter @SionoiS

I recently noticed something interesting during test this PR. we all know, with RLN, we can send up to 100 messages per epoch (10 minutes). Here’s what I observed, and I’m unsure if it’s expected behavior or a bug:

For example, if an epoch starts at 4:00, we can send 100 messages until 4:10. If I try to send more than 100 messages, they’re blocked, which makes sense. But in another scenario, if the epoch starts at 4:00 and I don’t send any messages until 4:06, then send all 100 messages between 4:06 and 4:09, when the new epoch starts at 4:10, I still can’t send any messages and get a ‘message limit exceeded’ error. Interestingly, I’m only able to send messages again at 4:16.

Is this behavior expected, or could it be a bug? If it’s intentional, I’ll need to adjust the logic to handle this scenario.

@gabrielmer
Copy link
Contributor

but recently i noticed that epoch time always sync across the waku network and triggers on the 10th minute (e.g., XX:X0). It never starts at random times like 13 or 17 (It means it's not just tied to the node's start time). I think we can leverage this to simplify the logic, but just wanted to confirm if I’m missing something. I observed this after several tests.

Yes! My understanding is that the epoch is something global for the whole network and does not depend on one node's start time.

In this comment I explained how I found it is calculated #3197 (comment)

We can't assume that the epoch is 10 minutes though, I think we have to use the general rlnEpochSizeSec parameter to compute when the next epoch is supposed to start

@gabrielmer
Copy link
Contributor

Hi @gabrielmer @Ivansete-status @NagyZoltanPeter @SionoiS

I recently noticed something interesting during test this PR. we all know, with RLN, we can send up to 100 messages per epoch (10 minutes). Here’s what I observed, and I’m unsure if it’s expected behavior or a bug:

For example, if an epoch starts at 4:00, we can send 100 messages until 4:10. If I try to send more than 100 messages, they’re blocked, which makes sense. But in another scenario, if the epoch starts at 4:00 and I don’t send any messages until 4:06, then send all 100 messages between 4:06 and 4:09, when the new epoch starts at 4:10, I still can’t send any messages and get a ‘message limit exceeded’ error. Interestingly, I’m only able to send messages again at 4:16.

Is this behavior expected, or could it be a bug? If it’s intentional, I’ll need to adjust the logic to handle this scenario.

Ah wow that's super interesting. I would think that it is a bug, my understanding is that the limit should be according to the global epoch and not depending on when our node starts sending messages. But I'm not sure though, @alrevuelta should know better

How are you sending the messages? via REST API? If so, I suspect it might be a bug in our REST layer. I might be wrong, but that's the first thing I would check

@darshankabariya
Copy link
Contributor Author

We can't assume that the epoch is 10 minutes though, I think we have to use the general rlnEpochSizeSec parameter to compute when the next epoch is supposed to start

Thanks! I agree, there's a possibility that sync times could change (currently it starts at the 10th, but this could change) or even the rlnEpochSizeSec could change. In this scenario, the idea would fail. On further thought, I believe the current implementation is much more reliable.

@darshankabariya
Copy link
Contributor Author

Ah wow that's super interesting. I would think that it is a bug, my understanding is that the limit should be according to the global epoch and not depending on when our node starts sending messages. But I'm not sure though, @alrevuelta should know better

How are you sending the messages? via REST API? If so, I suspect it might be a bug in our REST layer. I might be wrong, but that's the first thing I would check

I’m using the chat interface to send messages. There is a possibility of an issue with the chat interface, but let’s confirm with @alrevuelta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants