-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not enough public IPs in farm despite having a free public ips #970
Comments
Further investigation, these 2 IPs on farm ID 1 look free on graphql. However, they are reserved on the chain.
but @sameh-farouk found these contracts are deleted, but they are still on the farm object and reserving these IPs |
Update:
https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Ftfchain.grid.tf%2Fws#/explorer/query/5844419 What could be causing these incorrect states? It's still not clear to me, but it's probably either:
Now, how to fix this:
It's important to conduct checks to ensure that there are no similar cases with other IPs on different farms as well. I'll follow up. |
I'm personally a bit uneasy with the force reset if the root cause is not identified. Since the working theory seems to be that something messed with the state in an unexpected way, not removing the IP's without knowing the full extend could lead to some more dangling state |
@LeeSmet I am currently in the process of debugging to identify the root cause of the issue. I agree with you that understanding the root cause is crucial. However, I would like to present a counter-argument that resetting these IPs could potentially lead to more dangling states. This is because these contracts have been removed from the chain, and it doesn’t seem logical to keep these IPs reserved for contracts that no longer exist. Please note that the effect of the call is limited to the provided IP, and it only resets its contract id to 0. The code is straightforward, and I don’t foresee any harm in proceeding in this manner. We can certainly wait until the root cause is identified, but regardless, we will need to execute these calls eventually to free up these IPs, in addition to any other fixes. Lastly, I would like to point out that freeing these IPs will not hinder any investigations since all activities are recorded on the blockchain. Therefore, I propose that we work in parallel by freeing these IPs while continuing to investigate this incident. This approach is particularly important as the issue partially affects deploying on mainnet. |
Alright, I’ve managed to find the root cause of this erroneous state. The creation of Contract 16178 occurred at block 5844419. As previously mentioned, it was created with 0 IP, so it shouldn’t have reserved any IP. However, an examination of the state of farm 1 at the parent block shows that the IP was already reserved to contract 16178 before its creation. I conducted a backward search to determine when exactly the IP was reserved for this specific contract and found that this occurred at block 5844377. I then began to scrutinize the events and calls at this block to understand how we arrived at this state after the block was finalized. At this state of the block, the last contract ID was 16177. However, there was a failed transaction. This transaction attempted to create a node contract, which, if successful, would have been assigned the next available ID of 16178. Upon examining the call arguments and decoding the error details, we can ascertain two things: the optional solution provider provided by the user was
Now, let’s piece together the final part of the puzzle: why did this failed transaction alter the state? In older versions of Substrate, if you wrote a function within the runtime that could modify storage and if storage was altered during that extrinsic, it was impossible to undo those specific changes. This led to a crucial “best practice” in runtime development: “verify first, write last”. This principle states that you should never return an Err from an extrinsic after you have modified storage, as those changes will persist, potentially leading to an inconsistent state. Looking at tfchain/substrate-node/pallets/pallet-smart-contract/src/lib.rs Lines 648 to 652 in c242557
To the best of my knowledge, this issue is no longer pertinent as the behavior has been altered later in the more recent releases of Substrate. Currently, the extrinsic execution logic discards any changes if an error occurs. So in the most recent version of TFChain runtime, this issue can't occurs. It’s also straightforward to try to reproduce and verify. |
Great investigation ! Thank you. I am ok with the solution since code is already in place (probably for same reason) and had already been used to restore coherency of storage state probably. To evaluate the number of IPs which are reserved but should not an additional check could be added to the pallet-tfgrid live checking #817. Then we could have a list of IPs to "free". |
Update: |
Describe the bug
Mainnet, 2.4.2.
Dashboard is giving error Not enough public IPs in farm despite having a free public ips on the farm.
To Reproduce
Steps to reproduce the behavior:
Just deploy any instance on the dashboard using IPv4.
Screenshots
The text was updated successfully, but these errors were encountered: