A long delay in CAddrMan::Select() causes p2p messages to be blocked #423

dimxy · 2021-05-06T13:21:12Z

There is a problem with addrman locks in komodod.

A delay exists in addrman.Select() function (

Line 361 in 9df2abc

const int kRetrySleepInterval = 100; // milliseconds

), which locks the addrman object for a long time (up to 2-3 min). So when the addrman is locked by Select() and if getaddr p2p cmd tries to lock addrman then p2p processing stops for a long time.

This is most noticeable for nspv clients which might timeout waiting for response from a nspv server.

The reason for this delay is explained in this zcash fix zcash#929.
However I checked the current bitcoin code, they do not seem to have such delays in addrman. Suggest removing it.
The addman.Select() is called quite regularly when the node is trying to maintain connections to peers.

jmjatlanta · 2021-05-14T19:33:10Z

A possible solution is to move the max retries loop outside the lock. This will slow down the P2P stuff, but not stop it. Note that link is totally untested code.

A note here is that it appears a mock RNG is necessary for testing. A possible option is to make this class a template, and apply the appropriate RNG at compile time. This would move the code for testing into the test and out of the implementation.

dimxy · 2021-05-15T12:04:57Z

I thought we could simply use the Select_() from bitcoin source, it is actually the same but has no any delays

dimxy · 2022-01-11T08:18:00Z

more findings on this issue:
basically it may affect connectivity: because of the long delay in the CAddrMan::Select() the komodod makes reconnect tries also with long delays and this may cause network partitioning.
However, the backward effect is when the delay removed (like in other bitcoin forks) there may be extra cpu load if there are unreachable seeds (like unused kmd seeds in assets chain). That is why this fix should be accompanied with disabling kmd seeds for assets chains

dimxy · 2022-01-19T12:49:10Z

yet more info about connection management:
actually trying connections occurs until outbound connections number becomes 16, after that OpenConnectionsThread starts waiting on lock (and does not waste resources) until the connection number reduces.
So some extra cpu spending on connection open can be only on small networks (<16 conns) and zcash tried to avoid it by the delay in Select, but I still believe we should not lock the address manager for so long and should remove the delay.
Also for assets chain we should disable kmd chain's dns seeds and fixed seeds, to avoid unneeded connection tries

dimxy · 2022-02-03T14:26:54Z

The issue is even worse though:
if a net is small so the number of connections < 16 the address manager object is regularly locked up to several minutes (because of the delay in Select()). When this happens an incoming "getaddr" cmd will be waiting for this lock and the whole received message processing loop stops until it is unlocked!

DeckerSU · 2022-02-03T17:48:46Z

Just some info from where it comes:

one CPU maxed out to 100%, strace shows endless nanosleep 0 zcash/zcash#717
Fix issue #717 t zcash/zcash#929
In Bitcoin code CAddrMan::Select_ initially was like this:

CAddrInfo CAddrMan::Select_(bool newOnly)
{
    if (size() == 0)
        return CAddrInfo();

    if (newOnly && nNew == 0)
        return CAddrInfo();

    // Use a 50% chance for choosing between tried and new table entries.
    if (!newOnly &&
       (nTried > 0 && (nNew == 0 || RandomInt(2) == 0))) { 
        // use a tried node
        double fChanceFactor = 1.0;
        while (1) {
            int nKBucket = RandomInt(ADDRMAN_TRIED_BUCKET_COUNT);
            int nKBucketPos = RandomInt(ADDRMAN_BUCKET_SIZE);
            while (vvTried[nKBucket][nKBucketPos] == -1) {
                nKBucket = (nKBucket + insecure_rand()) % ADDRMAN_TRIED_BUCKET_COUNT;
                nKBucketPos = (nKBucketPos + insecure_rand()) % ADDRMAN_BUCKET_SIZE;
            }
            int nId = vvTried[nKBucket][nKBucketPos];
            assert(mapInfo.count(nId) == 1);
            CAddrInfo& info = mapInfo[nId];
            if (RandomInt(1 << 30) < fChanceFactor * info.GetChance() * (1 << 30))
                return info;
            fChanceFactor *= 1.2;
        }
    } else {
        // use a new node
        double fChanceFactor = 1.0;
        while (1) {
            int nUBucket = RandomInt(ADDRMAN_NEW_BUCKET_COUNT);
            int nUBucketPos = RandomInt(ADDRMAN_BUCKET_SIZE);
            while (vvNew[nUBucket][nUBucketPos] == -1) {
                nUBucket = (nUBucket + insecure_rand()) % ADDRMAN_NEW_BUCKET_COUNT;
                nUBucketPos = (nUBucketPos + insecure_rand()) % ADDRMAN_BUCKET_SIZE;
            }
            int nId = vvNew[nUBucket][nUBucketPos];
            assert(mapInfo.count(nId) == 1);
            CAddrInfo& info = mapInfo[nId];
            if (RandomInt(1 << 30) < fChanceFactor * info.GetChance() * (1 << 30))
                return info;
            fChanceFactor *= 1.2;
        }
    }
}

I guess we should revert ZCash jl777#929 changes and try to reproduce 100% CPU load when offline and investigate what Bitcoin had to do with that issue.

dimxy · 2022-02-03T18:11:58Z

A higher cpu load which described by zcash, I believe, occurs only on small chains. On big chains the connection number quickly reaches the limit and the node stops trying connecting, so I think for bitcoin (or kmd) this is not an issue
I ran this fix on a small asset chain and saw some extra load but I think it happens because for assets chains we do not remove the default seeds (so nodes try and try them without success) - so I suggest removing the seeds for assets chains too

Dev

dimxy · 2023-04-09T09:47:08Z

I return back to this issue as it is a bit annoying on test chains, preventing nodes from connection.

As it was suggested above I tried to check if the same issue is addressed somehow in bitcoin (namely CConnman::ThreadOpenConnections):
I would say they did not bother about it: bitcoin has a similar loop, with more complicated code for selection of the conn type, but generally the code looks the same as it is in zcash/komodo, with the same 500 ms sleep before starting a new address selection, but with no delay in CAddrMan::Select_().

I also investigated what happens on a small chain with the delay in Select_() removed: I would not say that any extra heady load was seen. BTW I saw an extra load and frequent connections on test asset chains, in past, when we had invalid seeds left in our code. Obviously, we should not any unneeded seeds for komodo or asset chain, to avoid extra conn tries.

I still think we should remove this delay in Select_ anyway, because I think it is in fact dangerous code when p2p pump is stopped for minutes, which probably may be exploited.

dimxy changed the title ~~A long delay in CAddrMan::Select() causes p2p messages blocked~~ A long delay in CAddrMan::Select() causes p2p messages to be blocked May 6, 2021

dimxy mentioned this issue Jan 19, 2022

[wip] Remove delay in addrman.Select() and disable kmd seeds for asset chains #528

Closed

who-biz pushed a commit to who-biz/komodo that referenced this issue Jun 14, 2022

Merge pull request KomodoPlatform#423 from VerusCoin/dev

de62741

Dev

dimxy mentioned this issue Apr 9, 2023

Zebra node may fail to connect to a komodod node KomodoPlatform/zebra#50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A long delay in CAddrMan::Select() causes p2p messages to be blocked #423

A long delay in CAddrMan::Select() causes p2p messages to be blocked #423

dimxy commented May 6, 2021 •

edited

Loading

jmjatlanta commented May 14, 2021 •

edited

Loading

dimxy commented May 15, 2021

dimxy commented Jan 11, 2022

dimxy commented Jan 19, 2022

dimxy commented Feb 3, 2022

DeckerSU commented Feb 3, 2022

dimxy commented Feb 3, 2022

dimxy commented Apr 9, 2023

A long delay in CAddrMan::Select() causes p2p messages to be blocked #423

A long delay in CAddrMan::Select() causes p2p messages to be blocked #423

Comments

dimxy commented May 6, 2021 • edited Loading

jmjatlanta commented May 14, 2021 • edited Loading

dimxy commented May 15, 2021

dimxy commented Jan 11, 2022

dimxy commented Jan 19, 2022

dimxy commented Feb 3, 2022

DeckerSU commented Feb 3, 2022

dimxy commented Feb 3, 2022

dimxy commented Apr 9, 2023

dimxy commented May 6, 2021 •

edited

Loading

jmjatlanta commented May 14, 2021 •

edited

Loading