Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Operation is not valid due to the current state of the object" error when consensus node handle 512 txs #1410

Closed
cloud8little opened this issue Jan 14, 2020 · 18 comments · Fixed by #1422
Assignees
Labels
Bug Used to tag confirmed bugs Critical Issues (bugs) that need to be fixed ASAP

Comments

@cloud8little
Copy link
Contributor

cloud8little commented Jan 14, 2020

Describe the bug
when handling 512 transactions to generating a new block, consensus service will stuck and throw "Operation is not valid due to the current state of the object" error.

To Reproduce
Steps to reproduce the behavior:
neo-cli: master 51cd29fbe21abb9e1f17f64e5c6d21bc7decbbb9
neo: master ab4830c
neo-vm: master be2ac36bf35a3033d828e0ba0630d390599c487d
MillisecondsPerBlock is 15000.

  1. build latest neo-cli with neo/neo-vm project.
  2. deploy 4 consensus node.
  3. deploy 1 external node
  4. send 50 tx/sec and last about40secs to external node.
{
  "jsonrpc": "2.0",
  "method": "sendtoaddress",
  "params": ["0x8c23f196d8a1bfd103a9dcb1f9ccf0c611377d3b", "NNa1Fgc82Qoh65TJfViqC92W8Z8AWL9YKA",0.00000001],
  "id": 1
}

Expected behavior
consensus nodes are expected to handle massive txs.

Screenshots

It can be seen from the screenshot that when number of txs is 216, consensus will continue, and when number of txs is 512, then error shows.
fa03d6d4efcea75043d0c4781cb51f5

[18:17:18.052] initialize: height=35 view=0 index=3 role=Primary
[18:17:33.058] timeout: height=35 view=0
[18:17:33.059] send prepare request: height=35 view=0
[18:17:33.189] OnPrepareResponseReceived: height=35 view=0 index=0
[18:17:33.299] OnPrepareResponseReceived: height=35 view=0 index=1
[18:17:33.302] send commit
[18:17:33.880] OnCommitReceived: height=35 view=0 index=1 nc=1 nf=0
[18:17:34.118] OnPrepareResponseReceived: height=35 view=0 index=2
[18:17:34.124] OnCommitReceived: height=35 view=0 index=2 nc=2 nf=0
[18:17:34.137] relay block: height=35 hash=0xa5d4b10cfaf0c6b24f598c975235b15d6399176fb4b0fb1fd41224d43a250103 tx=216
[18:17:34.164] persist block: height=35 hash=0xa5d4b10cfaf0c6b24f598c975235b15d6399176fb4b0fb1fd41224d43a250103 tx=216
[18:17:34.165] initialize: height=36 view=0 index=3 role=Backup
[18:17:49.159] OnPrepareRequestReceived: height=36 view=0 index=0 tx=512
[18:17:49.164] send prepare response
[18:17:49.890] OnPrepareResponseReceived: height=36 view=0 index=1
[18:17:49.892] send commit
[18:17:49.910] OnCommitReceived: height=36 view=0 index=1 nc=1 nf=0
[18:17:50.145] OnCommitReceived: height=36 view=0 index=0 nc=2 nf=0
[18:17:50.150] relay block: height=36 hash=0xb3be2ffab0fd2ac9f46dd8d14e5727d9d6c3f7e739cf72cf2c034d2100cc40cb tx=512
[ERROR][1/14/2020 10:17:50 AM][Thread 0005][akka://NeoSystem/user/$a] Operation is not valid due to the current state of the object.
Cause: System.InvalidOperationException: Operation is not valid due to the current state of the object.
   at Neo.Ledger.Blockchain.Persist(Block block) in D:\NEO\Github\neo\src\neo\Ledger\Blockchain.cs:line 503
   at Neo.Ledger.Blockchain.OnNewBlock(Block block) in D:\NEO\Github\neo\src\neo\Ledger\Blockchain.cs:line 333
   at Neo.Ledger.Blockchain.OnReceive(Object message) in D:\NEO\Github\neo\src\neo\Ledger\Blockchain.cs:line 477
   at Akka.Actor.UntypedActor.Receive(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)
[ERROR][1/14/2020 10:17:50 AM][Thread 0027][akka://NeoSystem/user/$a] Error while creating actor instance of type Neo.Ledger.Blockchain with 2 args: (Neo.NeoSystem,Neo.Plugins.Storage.Store)
Cause: [akka://NeoSystem/user/$a#456841214]: Akka.Actor.PostRestartException: Exception post restart (System.InvalidOperationException)
 ---> System.TypeLoadException: Error while creating actor instance of type Neo.Ledger.Blockchain with 2 args: (Neo.NeoSystem,Neo.Plugins.Storage.Store)
 ---> System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.
 ---> System.InvalidOperationException: Operation is not valid due to the current state of the object.
   at Neo.Ledger.Blockchain..ctor(NeoSystem system, IStore store) in D:\NEO\Github\neo\src\neo\Ledger\Blockchain.cs:line 107
   --- End of inner exception stack trace ---
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor, Boolean wrapExceptions)
   at System.Reflection.RuntimeConstructorInfo.Invoke(BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at System.RuntimeType.CreateInstanceImpl(BindingFlags bindingAttr, Binder binder, Object[] args, CultureInfo culture)
   at Akka.Actor.Props.ActivatorProducer.Produce()
   at Akka.Actor.Props.NewActor()
   --- End of inner exception stack trace ---
   at Akka.Actor.Props.NewActor()
   at Akka.Actor.ActorCell.CreateNewActorInstance()
   at Akka.Actor.ActorCell.<>c__DisplayClass109_0.<NewActor>b__0()
   at Akka.Actor.ActorCell.UseThreadContext(Action action)
   at Akka.Actor.ActorCell.NewActor()
   at Akka.Actor.ActorCell.FinishRecreate(Exception cause, ActorBase failedActor)
   --- End of inner exception stack trace ---

f3597892b8b49a5af8c53e45d6a41b4

Platform:

  • 4 consensus nodes:
    n1 - Ubuntu 18.04
    n2 - CentOS 7.4
    n3 - Winserver 2016
    n4 - Ubuntu 18.04
    external node - Ubuntu 18.04
@shargon
Copy link
Member

shargon commented Jan 14, 2020

The error it's located here

image

I will review it

@cloud8little the error appear with only one CN ?

@shargon shargon self-assigned this Jan 14, 2020
@cloud8little
Copy link
Contributor Author

@shargon It happens on all consensus nodes

@shargon
Copy link
Member

shargon commented Jan 15, 2020

Error replicated successful, I will try to solve it soon. Thanks for your detailed explanation @cloud8little !

@shargon
Copy link
Member

shargon commented Jan 15, 2020

For replicate the error you must have 2 address in the same wallet. One with all the GAS (30m), and the other with 0.
Sometimes the rpc starting to choose the second address as From. We have a problem in the memory pool that produce a negative balance in persist.
Still investigating the fix.

@shargon shargon added the Under Investigation The problem mentioned, or possible improvement, is under investigation label Jan 15, 2020
@shargon
Copy link
Member

shargon commented Jan 15, 2020

Without Parallel Verification, this doesn't happen @erikzhang.

@shargon shargon removed the Under Investigation The problem mentioned, or possible improvement, is under investigation label Jan 15, 2020
@eryeer
Copy link
Contributor

eryeer commented Jan 16, 2020

@shargon As I see from this report, error occurs on at
Neo.Ledger.Blockchain.Persist(Block block) in D:\NEO\Github\neo\src\neo\Ledger\Blockchain.cs:line 503, where the code is

all_application_executed.Add(application_executed);

Is there any relationship with VM execution?
engine.LoadScript(onPersistNativeContractScript);
if (engine.Execute() != VMState.HALT) throw new InvalidOperationException();

I'm confused with the error location and the cause.

@Tommo-L
Copy link
Contributor

Tommo-L commented Jan 16, 2020

No problems with VM, the main reason is the mempool balance check, which lead to package txs without enough balance. This has happened before.

@shargon
Copy link
Member

shargon commented Jan 16, 2020

yes it's while OnPersist>GasToken, the balance becomes negative and throw the exception.

@eryeer
Copy link
Contributor

eryeer commented Jan 16, 2020

@shargon I reproduced this problem, and found that the problem was on PostExecuteInstruction() in ExecutionEngine. When the number of tx reached 512, the result of ReferenceCounter.CheckZeroReferred () was 2053, which was greater than MaxStackSize: 2048, which caused the check to fail, and the VM returned Fault.
After enlarging MaxStackSize to a larger number, such as 4096, this problem disappeared in my machine.

@cloud8little
Copy link
Contributor Author

@shargon @eryeer was right, the root cause is ReferenceCounter.CheckZeroReferred() <= MaxStackSize return false, and leads to VM execute state as FAULT.

https://github.com/neo-project/neo-vm/blob/be2ac36bf35a3033d828e0ba0630d390599c487d/src/neo-vm/ExecutionEngine.cs#L1237-L1241

9dbe4d48b0d58f89eb721f6b74cf96a

After I changed MaxStackSize as 2 * 2 * 1024, the CN service will recover. and no exception throws.
Initial balance of the sender: 100,000 gas
Here is the result of changing MaxStackSize to 2 * 2 * 1024.

362b313a3df1f1e37e07a910baf4fbd

22749c0607e44db7e2498d15b314816

@shargon
Copy link
Member

shargon commented Jan 16, 2020

Great! I will review it. But maybe we have two errors...

@shargon
Copy link
Member

shargon commented Jan 16, 2020

@eryeer this dosen't solve the main issue, where are more than 2k items?

image

image


Also there are another issue... related to this one I will reopen my PR
@eryeer @cloud8little if you try to replicate this issue with multiple threads (I used 5), you will get another error produced by parallel verification

@shargon
Copy link
Member

shargon commented Jan 16, 2020

We have two problems here:

1- With 512 Tx, Gas contract produce 4 items per notification, and we raise the maximum numbers of stack items (FAULT), this is the initial problem described by @cloud8little
2- With multiple threads, we have a double spend in GAS native contract, this is the second problem find by me while testing the same issue with multiple threads. We should reconsider the paralell verification.

image

@shargon
Copy link
Member

shargon commented Jan 16, 2020

Maybe we should consider to rollback the snapshot, and FAULT all txs if in OnPersist in native contracts was FAULT, otherwise we open a door for Denial Of Service. @erikzhang

image

@shargon
Copy link
Member

shargon commented Jan 16, 2020

According to the issue 1, I prefer to skip ReferenceCounter if the Trigger is System

protected override bool PostExecuteInstruction(Instruction instruction)
        {
            if (Trigger == TriggerType.System) return true;
            return base.PostExecuteInstruction(instruction);
        }

@shargon shargon added Bug Used to tag confirmed bugs Critical Issues (bugs) that need to be fixed ASAP labels Jan 16, 2020
@shargon
Copy link
Member

shargon commented Jan 16, 2020

I created new issue for the second problem #1420

@eryeer
Copy link
Contributor

eryeer commented Jan 17, 2020

According to the issue 1, I prefer to skip ReferenceCounter if the Trigger is System

It's a good idea to skip ReferenceCounter for System Trigger.

@Tommo-L
Copy link
Contributor

Tommo-L commented Jan 17, 2020

Maybe we should consider to rollback the snapshot, and FAULT all txs if in OnPersist in native contracts was FAULT, otherwise we open a door for Denial Of Service. @erikzhang

image

Agree, it's better to close the door which we may not know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Used to tag confirmed bugs Critical Issues (bugs) that need to be fixed ASAP
Projects
None yet
4 participants