Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend peer shard info #3726

Closed

Conversation

miguelportilla
Copy link
Contributor

@miguelportilla miguelportilla commented Jan 14, 2021

This pull request is a refactoring and improvement of the shard information shared between peers with messages. Notably, the data shared now includes shard states, their progress, and a digital signature of the message data. The information is also visible via the crawl_shards command. Please note that the protocol message for the shard info has changed and the new message will only be visible to rippled nodes running 1.7.0-b10 or higher.

The introduction of the ShardInfo class disrupts levelization. The class resides in /nodestore, and is included from /overlay/impl/Peerimp.h and /nodestore/DataBaseShard.h. Since this cycle already exists, the levelization files will be updated with the change.

Builds/CMake/RippledCore.cmake Outdated Show resolved Hide resolved
src/ripple/nodestore/Types.h Outdated Show resolved Hide resolved
src/ripple/nodestore/Types.h Outdated Show resolved Hide resolved
src/ripple/nodestore/ShardInfo.h Outdated Show resolved Hide resolved
src/ripple/nodestore/ShardInfo.h Outdated Show resolved Hide resolved
src/ripple/nodestore/Types.h Outdated Show resolved Hide resolved
src/ripple/nodestore/impl/Shard.cpp Outdated Show resolved Hide resolved
src/ripple/nodestore/impl/Shard.cpp Outdated Show resolved Hide resolved
src/ripple/nodestore/impl/ShardInfo.cpp Outdated Show resolved Hide resolved
src/ripple/overlay/impl/PeerImp.cpp Show resolved Hide resolved
Copy link
Contributor

@pwang200 pwang200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the first pass and got a couple questions about the protobuf message changes. 

Let me confirm my understanding. Three protobuf message pairs were changed. 

  • TMGetShardInfo and TMShardInfo that were marked as deprecated previously are removed now.
  • TMGetPeerShardInfo and TMPeerShardInfo that were used previously are marked as deprecated now. Their handler functions are empty.
  • TMGetPeerShardInfoV2 and TMPeerShardInfoV2 are the new messages.

TMGetPeerShardInfoV2 is only used to answer crawl_shards RPC requests when the relays > 0
TMPeerShardInfoV2 is used to reply to TMGetPeerShardInfoV2. It is also used in DatabaseShardImp::updatePeers() to update peers when the local node's shard information changes. 

It seems nodes with different versions of rippled won't be able to exchange shard information via these messages. Maybe not a big concern if the messages were only for RPCs, and I guess the message version update happened once already when updating from TMGetShardInfo to TMGetPeerShardInfo

To reduce unnecessary messages, maybe only send TMGetPeerShardInfoV2 and TMPeerShardInfoV2 to peers that understand them? I am not sure if the optimization is needed since I don't know how often these messages are sent. If we need the optimization, then ProtocolFeature might be helpful. https://github.com/ripple/rippled/blob/c11037fd2764f6fea8e0d184b0b7aa0daa81136f/src/ripple/overlay/Peer.h#L38

@miguelportilla
Copy link
Contributor Author

miguelportilla commented Jan 28, 2021

@pwang200 Yes, the information is used to answer the crawl RPC but its primary objective is to automatically share the status through the peer network via the protocol message. It's a building block and it allows us to expand the use cases. For instance, clusters could potentially reduce shard redundancy when selecting new shards to acquire. The info can also be utilized to know which peers have the shards you are acquiring, allowing a direct download. Because the messages are signed, one could potentially build a verified state of shard history on the network. The older protocol messages were deprecated some time ago and deal with an older shard version. You are correct that some older rippled versions will not be able to understand but at this point, I am not sure it warrants optimization.

@miguelportilla miguelportilla force-pushed the shard_info_signature branch 3 times, most recently from 158ac4a to 2a0b097 Compare January 29, 2021 15:11
src/ripple/overlay/impl/PeerImp.cpp Outdated Show resolved Hide resolved
src/ripple/nodestore/impl/ShardInfo.cpp Show resolved Hide resolved
src/ripple/overlay/impl/PeerImp.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@pwang200 pwang200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I enjoyed reviewing this PR, learned a bit more about shard, and got to think about signatures and p2p information propagation. Thanks for the explanation of the potential use cases. Regarding p2p information propagation, I think "pull + last hop hush" make sense for our usages.

src/ripple/overlay/impl/PeerImp.cpp Show resolved Hide resolved
Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly found nits, but there are a couple of issues that I might be confused about or else really need to be addressed. Happy to talk through any of the stuff I've pointed out.

Regarding unit test coverage, you might consider running codecov locally (if you can) to see what your coverage looks like. On macOS to run coverage on all non-manual unit tests I use...

cmake -DOPENSSL_ROOT_DIR=$(brew --prefix openssl@1.1) -DCMAKE_BUILD_TYPE=Debug -Dunity=OFF -Dassert=ON -Dwerr=ON -Dcoverage=ON -Dcoverage_core_only =OFF -Ucoverage_test ../../..

To run a specific test ($1 on the command line) I use...

cmake -DOPENSSL_ROOT_DIR=$(brew --prefix openssl@1.1) -DCMAKE_BUILD_TYPE=Debug -Dunity=OFF -Dassert=ON -Dwerr=ON -Dcoverage_core_only=OFF -Dcoverage_test=$1 ../../..

Either of those is followed by...

make -j3 coverage_report

I have no idea whether that works on other platforms. But I'm happy to try and help if I can.

src/ripple/nodestore/impl/DatabaseShardImp.h Outdated Show resolved Hide resolved
src/ripple/nodestore/impl/Shard.h Outdated Show resolved Hide resolved
src/ripple/nodestore/impl/Database.cpp Outdated Show resolved Hide resolved
src/ripple/nodestore/DatabaseShard.h Outdated Show resolved Hide resolved
src/ripple/nodestore/impl/DatabaseShardImp.h Outdated Show resolved Hide resolved
src/ripple/nodestore/impl/DatabaseShardImp.cpp Outdated Show resolved Hide resolved
src/ripple/nodestore/impl/DatabaseShardImp.cpp Outdated Show resolved Hide resolved
src/test/nodestore/DatabaseShard_test.cpp Outdated Show resolved Hide resolved
@rasloun
Copy link

rasloun commented Feb 9, 2021

Get me out of this list !!!b

@scottschurr
Copy link
Collaborator

FWIW, the latest commit is not building for me on macOS with clang. The problem appears to result from ShardInfo::Incomplete having const data members and being put into a std::map. The const data members keep the Incomplete assignment operator from working.

One fix would be to make the two data members non-const, but make them private. Then provide const accessor methods. Maybe like this...

    class Incomplete
    {
    public:
        Incomplete() = delete;
        Incomplete(ShardState state, std::uint32_t percentProgress)
            : state_(state), percentProgress_(percentProgress)
        {
        }

        [[nodiscard]] ShardState
        state() const
        {
            return state_;
        }

        [[nodiscard]] std::uint32_t
        percentProgress() const
        {
            return percentProgress_;
        }

    private:
        ShardState state_;
        std::uint32_t percentProgress_;
    };

I'm going to assume you make a change similar to that and continue reviewing the code..

@scottschurr
Copy link
Collaborator

Oh! interesting. I'm unsure why clang was okay with the version of Incomplete that I proposed, but is unhappy with yours. Maybe it has to do with which methods are inlined? I dunno...

Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think once the problem with Incomplete is patched up this looks good to go. I spotted a couple of other things I think might be minor improvements, but they are not required.

src/ripple/basics/RangeSet.h Show resolved Hide resolved
src/ripple/nodestore/impl/ShardInfo.cpp Outdated Show resolved Hide resolved
src/ripple/nodestore/impl/Database.cpp Outdated Show resolved Hide resolved
@miguelportilla miguelportilla force-pushed the shard_info_signature branch 2 times, most recently from 6c6f7a9 to 1ec44a2 Compare February 23, 2021 18:41
@scottschurr
Copy link
Collaborator

FWIW, the current version of this branch usually (but not always) fails the DatabaseShard manual unit test on macOS. Sometimes the test passes. I usually get this:

#424649 failed: DatabaseShard_test.cpp(1292)
#424650 failed: DatabaseShard_test.cpp(1302)
#424651 failed: DatabaseShard_test.cpp(1314)
#424652 failed: DatabaseShard_test.cpp(1329)

@scottschurr
Copy link
Collaborator

The DatabaseShard manual unit test should probably pass reliably before the pull request is marked as passed. But everything else looks good as far as I can see.

@scottschurr
Copy link
Collaborator

Hi @miguelportilla, thanks for the work on the unit tests. I think things are better, but we're not quite there, at least for me on macOS. I ran the DatabaseShard unit test from your most recent work (4ed1a55) in a loop. In the 5th iteration I got the following failures:

...
ripple.NodeStore.DatabaseShard Prepare with historical paths
#444789 failed: DatabaseShard_test.cpp(1263)
#444790 failed: DatabaseShard_test.cpp(1264)
ripple.NodeStore.DatabaseShard Open shard management
ripple.NodeStore.DatabaseShard Shard info
Longest suite times:
  509.5s ripple.NodeStore.DatabaseShard
509.6s, 1 suite, 13 cases, 3847501 tests total, 2 failures

So I think that needs to be looked at and addressed.

@miguelportilla miguelportilla force-pushed the shard_info_signature branch 2 times, most recently from 58bffc7 to 7e4e26f Compare March 10, 2021 22:17
Copy link
Contributor

@pwang200 pwang200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good. Left one minor comment.

Also, the "crawling" in p2p is interesting. No action needed now. But if we have more use cases later, we could abstract the crawling logic out.

auto const timestamp{
NetClock::time_point{std::chrono::seconds{m->timestamp()}}};
auto const now{app_.timeKeeper().now()};
if (timestamp > now)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (timestamp > (now + 5s)) to allow some peer clocks that are not synchronized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwang200 Good idea, thank you.

Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! I ran the DatabaseShard test 35 times in a loop with no failures, so that's looking reliable now. And you got 100% code coverage on ShardInfo! Nice work! 👍

@miguelportilla miguelportilla force-pushed the shard_info_signature branch 3 times, most recently from a989eae to 05b0098 Compare March 23, 2021 13:43
@miguelportilla miguelportilla added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Apr 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants