Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise the PopularityCommunity metadata retrieval protocol. #7632

Closed
drew2a opened this issue Oct 16, 2023 · 6 comments
Closed

Revise the PopularityCommunity metadata retrieval protocol. #7632

drew2a opened this issue Oct 16, 2023 · 6 comments

Comments

@drew2a
Copy link
Contributor

drew2a commented Oct 16, 2023

Despite the protocol's apparent simplicity, PopularityCommunity is quite complex as it derives logic from RemoteQueryCommunity:

class PopularityCommunity(RemoteQueryCommunity, VersionCommunityMixin):
"""

This inheritance was implemented in #5736

The current algorithm for metadata retrieval is as follows:
#7398 (comment)

If a peer receives torrent health info for a torrent whose metadata is missing, the Popularity Community subsequently requests the missing metadata.

async def on_torrents_health(self, peer, payload):
self.logger.debug(f"Received torrent health information for "
f"{len(payload.torrents_checked)} popular torrents and"
f" {len(payload.random_torrents)} random torrents")
health_tuples = payload.random_torrents + payload.torrents_checked
health_list = [HealthInfo(infohash, last_check=last_check, seeders=seeders, leechers=leechers)
for infohash, seeders, leechers, last_check in health_tuples]
for infohash in await run_threaded(self.mds.db, self.process_torrents_health, health_list):
# Get a single result per infohash to avoid duplicates
self.send_remote_select(peer=peer, infohash=infohash, last=1)

As RemoteQueryCommunity is going to be removed in 8.0.0 we have to replace the algorithm for metadata retrieval.

@drew2a drew2a added this to the 8.0.0 milestone Oct 16, 2023
@drew2a
Copy link
Contributor Author

drew2a commented Oct 16, 2023

As a starting point for discussion, I propose the following algorithm:

Popularity Community operates in this manner:

  1. Upon introduction requests, send information about popular torrents.
  2. Every 5 seconds, choose a random peer and send a torrent health request.
  3. The chosen peer responds with a list of health information.

(Note: Steps 1, 2, and 3 remain unchanged from the current algorithm)

  1. The requester identifies torrents for which knowledge is missing and sends a series of messages requesting this knowledge:
@dataclass
class RequestKnowledgeMessage:
    infohash: str
  1. The chosen peer responds with a series of messages containing the required knowledge.

@dataclass(msg_id=STATEMENT_OPERATION_MESSAGE_ID)
class StatementOperationMessage:
operation: StatementOperation
signature: StatementOperationSignature

@synctext
Copy link
Member

synctext commented Oct 25, 2023

My idea is to first focus on stability, removing Gigachannels, keeping tags, and then radically alter the architecture. Lets not try to fix things which are not broken currently 🤔

  • Remove 2 out of 3 methods for content discovery
    • promote PopularityCommunity as the only way to discover novel hashes
    • remove channel sampling/pre-view and free-for-all channel mechanism
    • Remove in GUI and core at some time or hide
  • Release a stable release with this code
  • Add a new message inside the PopularityCommunity
    • backwards compatible with older peers
    • new feature of shadow keys and Libtorrent ground truth on swarm size
    • Query, swarm-clicked, swarm-not-clicked, swarm-clicked-size-as-seen-by-Libtorrent, date, shadow-signature
  • crawl new info
    • Web-of-trust: also the rendezvous will get start producing limited crawl data
    • New privacy-protected ClickLog-based discovery
  • New release which starts to utilise the new "ContentDiscovery" community and one-struct-to-rule-them-all
  • Further releases (mixing 4 things all into 1 Tribler hopefully 🙏 )
    • collecting further data for the Machine Learning Science part
    • collecting further data for web-of-trust
    • collecting further data for tag-based metadata enrichment (content,trust, and queries)
    • end of gigachannels

@drew2a
Copy link
Contributor Author

drew2a commented Oct 30, 2023

Removing 2 out of 3 Content Discovery Methods

During my effort on Friday to eliminate channel sampling/pre-view and the free-for-all channel mechanism, I encountered some obstacles. Even though my initial attempt wasn't successful, I've gained insights into how this can be achieved and can now provide more detailed estimations.

The removal process should begin on the GUI side. This involves:

  • Removing visual elements associated with these features.
  • Deleting the corresponding models and widgets.
image

Once the GUI components are addressed:

  • The Gigachannel Manager can be deleted.
  • The Remote Query Community should be detached from the Metadata Store, along with its metadata.db part.
  • Any parts not related to the search can then be stripped from the Gigachannel Community.
  • Finally, the Gigachannel Community can be renamed to Search Community, reflecting its sole focus on search.

Following these changes, a majority of the channels will be eliminated. Any remnants can either be adapted or removed in future refactoring stages.

From my current understanding, it can take 1 week to process these steps.

@qstokkink
Copy link
Contributor

* Add a new message inside the PopularityCommunity
  
  * backwards compatible with older peers
  * new feature of shadow keys and Libtorrent ground truth on swarm size
  * `Query, swarm-clicked, swarm-not-clicked, swarm-clicked-size-as-seen-by-Libtorrent, date, shadow-signature`

For this little part of the master plan I have the following implementation in mind:

  • [GUI] Store the last search query in the search widget and its associated top-X (top-10?? -> needs to fit in a UDP packet) results by infohash.
  • [GUI] Whenever a user starts downloading a new torrent with an infohash in the previously-stored results, wait until the download is at least 50% completed and fetch the info tuple (see quoted reply above) to the core.
  • [CORE] Store the tuples in a/the database and also store a reverse mapping for each torrent (i.e., for a given infohash X store infohash Y that is likely preferrable or X itself).
  • [CORE] Instead of gossiping random torrents, gossip the more preferrable torrent Y for a randomly sampled torrent X (using the O(1) reverse mapping in the db). Here we use the "shadow identity" instead of a user's real identity and/or pick a received signed record signed by another shadow identity and gossip that.

Once that all works the last remaining step is to update the search results to also make use of the preference relation instead of pure db-based text search.

@qstokkink qstokkink mentioned this issue Dec 11, 2023
13 tasks
@drew2a drew2a removed their assignment Dec 11, 2023
@drew2a drew2a removed this from the 8.0.0 milestone Dec 11, 2023
@qstokkink
Copy link
Contributor

A more detailed design (green blocks include the code to add), capturing some insights since my last post:

arch

Changes:

  • Because of our endpoint structure, we don't need to touch GUI code - just the endpoints.
  • Because this new functionality interfaces with different components, it needs to be in a component itself (named UserActivityComponent above).
  • It is easier to start with the torrent_finished_alert in the first version, instead of waiting for a 50% threshold.

Disclaimer: this is still before writing even a single line of code, the design may change as I implement it.

@qstokkink
Copy link
Contributor

This has now been implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants