-
Notifications
You must be signed in to change notification settings - Fork 1
panprog - Invalid oracle versions can cause desync of global and local positions making protocol lose funds and being unable to pay back all users #49
Comments
1 comment(s) were left on this issue during the judging contest. 141345 commented:
|
Escalate This should be high, because:
I don't know why it's judged medium, but this issue is very likely to happen and will cause a lot of damage to the market, thus it should be high. |
You've created a valid escalation! To remove the escalation from consideration: Delete your comment. You may delete or edit your escalation comment anytime before the 48-hour escalation window closes. After that, the escalation becomes final. |
Escalate
This is not HIGH because there is a limitation: "When oracle version is skipped for any reason" |
While it is supposed to be a rare event, in the current state of the code, this is a VERY LIKELY event, see #42 |
Skipping of oracle versions is an unlikely event. This makes this issue to fall under MEDIUM severity according to Sherlock's classification rules. |
it should be treated like different submissions describing different impacts with the same root cause: now, clearly your submission is not a duplicate, but it builds on that same root cause. if this root cause were fixed, your impact would still hold, but with a much lower likelihood |
Yes, I chain 2 issues to demonstrate high impact. In this case both issues should be high. We can't start predicting future "what happens if that one is fixed..." The way it is now - existance of either issue creates a high impact for the protocol, and each issue is a separate one.
While it doesn't provide the same clear rule for the opposite, I believe it's a logical continuation of that rule to that impact shouldn't be decreased because of a future change in the code (as a fix to another issue). |
You've created a valid escalation! To remove the escalation from consideration: Delete your comment. You may delete or edit your escalation comment anytime before the 48-hour escalation window closes. After that, the escalation becomes final. |
The judging rule explicitly talks about unintended future implementation. So your logical continuation would also only be applicable to unintended future implementations (which I would absolutely agree with). Your submission however even shows that you were aware of the correct intended future implementations and the change in likelihood that it would bring. |
It says OR: so either unintended future implementation OR because of future fix of another issue. I still think this is high because
|
Medium severity seems more appropriate. Because:
Based on sherlock's H/M criteria
|
In the current state of the code it is very likely scenario (expected behavior is for this to be not common scenario, but currently it is common)
This depends. If malicious user want to cause damage (or comes up with a profitable scenario), the loss can be very significant. In the current state of the code - since it will happen often by itself, each instance will not be very significant (0.01-0.2% of the protocol funds depending on price volatility), but it will add up to large amounts over time. |
Although #42 points out the possibility of skipped version, it does not look like some common events. 0.01-0.2% each time won't be considered material loss. Some normal operation can have even more profit than that, such as spot-futures arbitrage, cross exchange arbitrage. And the attacker need to control the conditions to perform the action. As such, the loss amount and requirements fall into the Med. |
I've added a detailed response previously, but don't see it now, maybe it got deleted or not sent properly. Here it is again.
No, it's very easy to have skipped versions in the current implementation. For example:
I argue that this is actually a material loss - it's the percentage off the protocol funds. So if $100M are deposited into protocol, the loss can be like $100K per instance. And since it can happen almost every granularity, this will add up very quick.
Even if there is no attacker, the loss will be smaller, but it will be continuous in time, so even if it's, say, $10K per instance (with $100M deposited), it will add up to tens of millions over less than a day. |
Regarding the impact, I want to point that #62 is high and has similar impact (messed up internal accounting), however the real damage from it is:
So the real damage from #62 is reduction of deposited assets by (keeper fees / count) instead of (keeper fees). Since keeper fees are low (comparable to gas fees), the real damage is not that large compared to funds deposited (less than 0.001% likely). However, the issue from this report also causes messed up internal accounting, but the real damage depends on position size and price volatility and will be much higher than in #62 on average, even when happening by itself. If coming from malicious parties, this can be a very large amount. Even though damage in #62 is much easier to inflict, I believe that due to higher damage per instance of this issue, the overall damage over time from this issue will be higher than from #62. Something like: So based on possible real overall damage caused, this issue should be high. |
This should be MEDIUM because skipped oracle versions is not a common event.
Skipped oracle versions is unlikely, and the reason why it is a likely scenario in the current code is due to a bug, which has already been reported in #42. Since different attack scenarios, with same fixes are considered duplicates, this issue should be a MEDIUM because issue 42(which allows this bug to be a likely scenario) when fixed, will make this issue unlikely |
1st, the scenario is conditional, not the kind on daily basis. As such, conditional loss and capped loss amount, will suggest medium severity. |
I don't think high severity means unconditional. My understanding is that high severity is high impact which can happen with rather high probability. And high probability doesn't mean it can happen every transaction, it just means that it can reasonably happen within a week or a month. And this issue probability to happen is high.
Why do you only consider it as a 1 time event? The way it is now, that'll be like 0.1% per 1-10 minutes, this will add up to 10%+ in less than a day.
I think there are no new arguments presented here, so I keep it up to Sherlock to decide. I think ultimately it comes down to:
So my argument is that the way it is now, it's high and severity shouldn't be downgraded as if the other bug is fixed. |
I don't think it's one time, it's something intermittent, it happens every once in a while, but the frequency might not be as high as per 1-10 minutes. And even it happened, there could be loss, and also sometimes no loss. That's why based on the probability of happending and loss, it is more suitable for conditional and capped loss. |
I still think this is High, the way it is now - it can happen as often as once per each granularity if malicious user abuses it, or maybe once per 10 granularities with semi-active trading by itself. Either way it's very possible and causes loss of funds. |
What about the cross exchange price difference, the magnitude could be the same level. Also the spot and perpetual contract could deviates, those cases are not considered exchange's loss. |
It's different, they're normal protocol operation and funds just changing hands. This issue leads to mismatch between funds protocol has and funds protocol owes (protocol has 100, but total collateral users can withdraw can be 200, so not all users can withdraw, which can trigger bank run with last users unable to withdraw) |
To chime in here from our perspective -- this issue identified a pretty fundamental flaw in our accounting system that would have caused the markets to be completely out of sync (both w.r.t. positions as well as balances) in the event of an unfortunately timed invalid version. While ideally rare, occasional invalid versions are expected behavior. Invalid versions can occur for a number of reasons during the course of normal operation:
Given that this bug would have likely surfaced in normal operation plus its noted effect, our opinion is that this should be marked as High. Fixed in: equilibria-xyz/perennial-v2#82 and equilibria-xyz/perennial-v2#94. |
Result: |
From WatchPug: L590 will revert due to underflow in the following case: context.latestPosition.local.magnitude() is equal to 0 After the first run (i: 0): closableAmount is equal to 0 |
Under the new delta invalidation system, the aggregate amount of your pending closes must be less than your currently settled (latest) position as specified here. This underflow revert is actually intentional to enforce this invariant. See this test to cross-reference expected behavior. |
panprog
high
Invalid oracle versions can cause desync of global and local positions making protocol lose funds and being unable to pay back all users
Summary
When oracle version is skipped for any reason (marked as invalid), pending positions are invalidated (reset to previous latest position):
This invalidation is only temporary, until the next valid oracle version. The problem is that global and local positions can be settled with different next valid oracle version, leading to temporary desync of global and local positions, which in turn leads to incorrect accumulation of protocol values, mostly in profit and loss accumulation, breaking internal accounting: total collateral of all users can increase or decrease due to this while the funds deposited remain the same, possibly triggering a bank run, since the last user to withdraw will be unable to do so, or some users might get collateral reduced when it shouldn't (loss of funds for them).
Vulnerability Detail
In more details, if there are 2 pending positions with timestamps different by 2 oracle versions and the first of them has invalid oracle version at its timestamp, then there are 2 different position flows possible depending on the time when the position is settled (update transaction called):
While the end result (position 2) is the same, it's possible that pending global position is updated earlier (goes the 1st path), while the local position is updated later (goes the 2nd path). For a short time (between oracle versions 2 and 3), the global position will accumulate everything (including profit and loss) using the pending position 1 long/short/maker values, but local position will accumulate everything using the previous position with different values.
Consider the following scenario:
Oracle uses granularity = 100. Initially user B opens position maker = 2 with collateral = 100.
T=99: User A opens long = 1 with collateral = 100 (pending position long=1 timestamp=100)
T=100: Oracle fails to commit this version, thus it becomes invalid
T=201: At this point oracle version at timestamp 200 is not yet commited, but the new positions are added with the next timestamp = 300:
User A closes his long position (update(0,0,0,0)) (pending position: long=1 timestamp=100; long=0 timestamp=300)
At this point, current global long position is still 0 (pending the same as user A local pending positions)
T=215: Oracle commits version with timestamp = 200, price = $100
T=220: User B settles (update(2,0,0,0) - keeping the same position).
At this point the latest oracle version is the one at timestamp = 200, so this update triggers update of global pending positions, and current latest global position is now long = 1.0 at timestamp = 200.
T=315: Oracle commits version with timestamp = 300, price = $90
after settlement of both UserA and UserB, we have the following:
longPnl = 1*($90-$100) = -$10
makerPnl = -longPnl = +$10
Result:
User A deposited $100, User B deposited $100 (total $200 deposited)
after the scenario above:
User A has collateral $110, User B has collateral $100 (total $210 collateral withdrawable)
However, protocol only has $200 deposited. This means that the last user will be unable to withdraw the last $10 since protocol doesn't have it, leading to a user loss of funds.
Impact
Any time the oracle skips a version (invalid version), it's likely that global and local positions for different users who try to trade during this time will desync, leading to messed up accounting and loss of funds for users or protocol, potentially triggering a bank run with the last user being unable to withdraw all funds.
The severity of this issue is high, because while invalid versions are normally a rare event, however in the current state of the codebase there is a bug that pyth oracle requests are done using this block timestamp instead of granulated future time (as positions do), which leads to invalid oracle versions almost for all updates (that bug is reported separately). Due to this other bug, the situation described in this issue will arise very often by itself in a normal flow of the user requests, so it's almost 100% that internal accounting for any semi-active market will be broken and total user collateral will deviate away from real deposited funds, meaning the user funds loss.
But even with that other bug fixed, the invalid oracle version is a normal protocol event and even 1 such event might be enough to break internal market accounting.
Proof of concept
The scenario above is demonstrated in the test, add this to test/unit/market/Market.test.ts:
Console output for the code:
Maker has a bit more than $110 in the end, because he also earns funding and interest during the short time when ephemeral long position is active (but user A doesn't pay these fees).
Code Snippet
_processPositionGlobal
invalidates position if oracle version is invalid for its timestamp:https://github.com/sherlock-audit/2023-07-perennial/blob/main/perennial-v2/packages/perennial/contracts/Market.sol#L390-L393
_processPositionLocal
does the same:https://github.com/sherlock-audit/2023-07-perennial/blob/main/perennial-v2/packages/perennial/contracts/Market.sol#L430-L437
_settle
loops over global and local positions until the latest oracle version timestamp. In this loop each position is invalidated to previous latest if it has invalid oracle timestamp. So if_settle
is called after the invalid timestamp, previous latest is accumulated for it:https://github.com/sherlock-audit/2023-07-perennial/blob/main/perennial-v2/packages/perennial/contracts/Market.sol#L333-L347
Later in the
_settle
, the latest global and local position are advanced to latestVersion timestamp, the difference from the loop is that since position timestamp is set to valid oracle version,_processPositionGlobal
and_processPositionLocal
here will be called with valid oracle and thus position (which is otherwise invalidated in the loop) will be valid and set as the latest position:https://github.com/sherlock-audit/2023-07-perennial/blob/main/perennial-v2/packages/perennial/contracts/Market.sol#L349-L360
This means that for early timestamps, invalid version positions will become valid in the
sync
part of the_settle
. But for late timestamps, invalid version position will be skipped completely in the loop beforesync
. This is the core reason of desync between local and global positions.Tool used
Manual Review
Recommendation
The issue is that positions with invalid oracle versions are ignored until the first valid oracle version, however the first valid version can be different for global and local positions. One of the solutions I see is to introduce a map of position timestamp -> oracle version to settle, which will be filled by global position processing. Local position processing will follow the same path as global using this map, which should eliminate possibility of different paths for global and local positions.
It might seem that the issue can only happen with exactly 1 oracle version between invalid and valid positions. However, it's also possible that some non-requested oracle versions are commited (at some random timestamps between normal oracle versions) and global position will go via the route like t100[pos0]->t125[pos1]->t144[pos1]->t200[pos2] while local one will go t100[pos0]->t200[pos2] OR it can also go straight to t300 instead of t200 etc. So the exact route can be anything, and local oracle will have to follow it, that's why I suggest a path map.
There might be some other solutions possible.
The text was updated successfully, but these errors were encountered: