Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content popularity community: performance evaluation #3868

Open
synctext opened this issue Sep 11, 2018 · 58 comments
Open

content popularity community: performance evaluation #3868

synctext opened this issue Sep 11, 2018 · 58 comments

Comments

@synctext
Copy link
Member

synctext commented Sep 11, 2018

For context, the long-term megalomaniac objectives (update Sep 2022):

Layer Description
User experience perfect search in 500 ms and asynchronously updated ✔️
Relevance ranking balance keyword matching and swarm health
Remote search trustworthy peer which has the swarm info by random probability
Popularity community distribute the swarm sizes
Torrent checking image
  1. After completing the above, next item: Add tagging and update relevance ranking. Towards perfect metadata.
  2. De-duplication of search results.
  3. Also find non-matching info. Search for Linux, find items tagged Linux, Biggest Ubuntu swarm is shown first.
  4. Added to that is adversarial information retrieval for our Web3 search science. After above is deployed and tagging is added. Cryptographic protection of above info. Signed data needs to have overlap with your web-of-trust, unsolved hard problem.
  5. personalised search
  6. 3+ years ahead: row bundling

@arvidn indicated: tracking popularity is known to be a hard problem.

I spent some time on this (or a similar) problem at BitTorrent many years ago. We eventually gave
up once we realized how hard the problem was. (specifically, we tried to pass around, via gossip,
which swarms are the most popular. Since the full set of torrents is too large to pass around,
we ended up with feedback loops because the ones that were considered popular early on got
disproportional reach).

Anyway, one interesting aspect that we were aiming for was to create a "weighted" popularity,
based on what your peers in the swarms you participated in thought was popular. in a sense,
"what is popular in your cohort".

We deployed the first version into Tribler #3649 , after prior Master thesis research #2783. However, we lack documentation or specification of the deployed protocol.

Key research questions:

  • What is the real deployed system behavior?
  • What is the resource consumption?
  • What is the accuracy and quality in general of the information?
  • How can we attack or defend this IPv8 community?

Concrete graphs from a single crawl:

  • messages and bandwidth in time
  • hashes discovery and duplicates
  • distribution of discovered popularity of swarms
  • conduct swarm popularity check and compare results in real-time
  • behavior of pub/sub mechanism for popularity feed
  • dynamics of trust-based pub/sub auto-subscribe

Implementation of on_torrent_health_response(self, source_address, data)
ToDo @xoriole : document deployed algorithm in 20+ lines (swarm check algorithm, pub/sub, hash selection algorithm,handshakes, search integration, etc.).

@xoriole
Copy link
Contributor

xoriole commented Oct 4, 2018

Popularity Community
Introduction
Popularity community is a dedicated community to disseminate popular/live contents across the network. The content could be anything for eg. health of a torrent, a list of popular torrents or even search results. The way of dissemination of the content follows the publish-subscribe model. Each peer in the community is both a publisher and a subscriber. A peer subscribes to a set of neighboring peers to receive their content updates while it publishes its content updates to the peers subscribing it.
pub-sub

Every peer maintains a list of subscribing and publishing peers with whom it exchanges content. All contents from non-subscribed publishers are basically refused. Selection of peers to subscribe or to publish greatly influences the dissemination of the content both genuine and spam. Therefore, we try to select based on a simple trust score. Trust score indicates the number of times we have interacted with the node as indicated by the number of mutual Trustchain blocks. Higher the trust score better the chance of being selected (as publisher or subscriber).

Research questions
...

@synctext
Copy link
Member Author

synctext commented Jul 2, 2019

ToDo:
describe the simplified top-N algorithm that is more light-weight (no pub/sub).
As-simple-as-possible gossip. Measure and plot 4 graphs listed above

  • messages and bandwidth in time
  • hashes discovery and duplicates
  • distribution of discovered popularity of swarms
  • conduct swarm popularity check and compare results in real-time

@synctext
Copy link
Member Author

synctext commented Jun 28, 2020

Bumping this issue. The key selling point of Tribler 7.6 is popularity community maturing (good enough for coming 2 years) and superior keyword search using relevance ranking. goal: 100k swarm tracking.

This has priority on channel improvements. Our process is to bump each critical features to a superior design and move to the next. Key lesson within distributed systems is: you can't get it perfect the first time (unless you have 20 years of failure experience). iteration and relentless improving deployed code is key.

After we this close performance evaluation issue we can build upon it. We need to know how well it performs and tweak it for 100k swarm tracking. We can do 1st version of real-time relevance ranking. Read our 2010 work for background: Improving P2P keyword search by combining .torrent metadata and user preference in a semantic overlay

Repeating key research questions from above (@ichorid):

  • What is the real deployed system behavior?
  • What is the resource consumption?
  • What is the accuracy and quality in general of the information?
  • How can we attack or defend this IPv8 community?

Concrete graphs from a single crawl:

  • messages and bandwidth in time
  • hashes discovery and duplicates
  • distribution of discovered popularity of swarms
  • conduct swarm popularity check and compare results in real-time
  • behavior of pub/sub mechanism for popularity feed
  • dynamics of trust-based pub/sub auto-subscribe

@synctext
Copy link
Member Author

synctext commented Jun 28, 2020

See also #4256 for BEP33 measurements&discussion

@synctext
Copy link
Member Author

Please check out @grimadas tool for crawling+analysing Trustchain and enhance this for the popularity community:
https://github.com/Tribler/trustchain_etl

@synctext
Copy link
Member Author

Hopefully we can soon add the health of the ContentPopularity Community to our overall dashboard.

@xoriole
Copy link
Contributor

xoriole commented Sep 13, 2020

Screenshot from 2020-09-13 19-11-03

  • PEERS_CONNECTED : Number of currently connected peers
  • PEERS_UNIQUE: Number of unique peers encountered during the measurement period (10 mins)
  • TORRENTS: Number of torrents received
  • SEEDERS_MAX: Seeder count of the most popular torrent received
  • SEEDERS_AVG: Avg seeder count of the received torrents. (Higher the better; how to increase this?)
  • SEEDERS_ZERO: Number of torrents received with zero seeders

Currently, a peer shares its most popular 5 and random 5 torrents checked by the peer to its connected neighbors. Since, a peer starts sharing them from the beginning, its not always the case the popular torrents are shared. This results in sharing torrents that doesn't have enough seeders (see SEEDERS_ZERO count), and this does not contribute much in sharing of popular torrents. So, two things that could improve sharing popular torrents seems like:

  1. not sharing zero seeder torrents
  2. increasing the initial buffer time before sharing is started

https://jenkins-ci.tribler.org/job/Test_tribler_popularity/plot/

@devos50
Copy link
Contributor

devos50 commented Sep 14, 2020

Nice work! I assume that this experiment is using the live overlay?

As a piece of advice, I would first try to keep the mechanism simple for now, while analyzing the data from the raw network (as you did right now). Extending the mechanism with (arbitrary) rules might lead to bias results, which I learned the hard way when designing the matchmaking mechanism in our decentralized market. Sharing of the 5 popular and 5 random torrents might look like a naive sharing policy, but it might be a solid starting point to get at least a basic popularity gossip system up and running.

Also, we have a DAS5 experiment where popularity scores are gossiped around (which might actually be broken after some channel changes). This might be helpful to test specific changes to the algorithm before deploying them 👍 .

@xoriole
Copy link
Contributor

xoriole commented Sep 14, 2020

@devos50 Yes, it is using live overlay.

Also, we have a DAS5 experiment where popularity scores are gossiped around (which might actually be broken after some channel changes). This might be helpful to test specific changes to the algorithm before deploying them.

Yes, good point. I'll create experiments to test the specific changes.

@synctext
Copy link
Member Author

synctext commented Sep 14, 2020

Thnx @xoriole! We now have our first deployment measurement infrastructure, impressive.

  • What is the real deployed system behaviour?

Can we (@kozlovsky @drew2a @xoriole) come up with a dashboard graph to quantify how far we are to our Key Performance Indicator: the goal of tracking 100k swarms? To kickstart the brainstorm:

  • Unique hash discovery after joining community for 1 hour (plus duplicates)?
  • messages and bandwidth in time
    • For 1 hour measure the amount of messages and type
    • Indicate their contents (identify ZERO_SEEDERS)
    • plot bandwidth usage
  • What is the accuracy and quality in general of the information?
    • validate the popularity check results
    • sample slowly the discovered swarms (DHT hammering will get server blocked)
    • compare client and our Jenkins re-check result
    • plot the delta for each swarm and sort by largest difference (Y-Axis delta of popularity, X-Axis: swarms sorted by delta)
  • distribution of discovered popularity of swarms (dead swarms, suspicious large swarms)

increasing the initial buffer time before sharing is started

As @devos50 indicated, this sort of tuning is best preserved for last. You want to have an unbiased view of your raw data for as long as possible. Viewing raw data improves accurate understanding. {Very unscientific: we design this gossip stuff with intuition. If we have 100+ million users people would be interested in our design principle.}

Repeating long-term key research questions from above (@ichorid):

  • What is the resource consumption?
  • How can we attack or defend this IPv8 community?
  • DHT node spam from exit nodes #3065 Fix for DHT spam using additional deployed service infrastructure

@ichorid
Copy link
Contributor

ichorid commented Sep 14, 2020

  1. not sharing zero seeder torrents

For every popular torrent, there are a thousand of dead ones. Therefore, information about what is alive is much more precious and scarce then about what is dead. It will be much more efficient to only share torrents that are well seeded.

Though, the biggest questions are:

  • should we use the information received from other peers when sending our own gossip packets? (probably not, or to some limited extend)
  • should we recheck the torrent health eventually? (probably not, as popular torrents tend to rise quickly and fall slowly)

@ichorid
Copy link
Contributor

ichorid commented Sep 14, 2020

It would be very nice if we find (or develop) some Python-based Mainline DHT implementation, to precisely control the DHT packets parameters.

  • How can we attack or defend this IPv8 community?
⚔️ attack 🛡️ defence
spam stuff around pull-based gossip
fake data cross-check data with others
biased torrent selection pseudo-random infohash selection
(e.g. only send infohashes sharing some number of last bytes)

@xoriole
Copy link
Contributor

xoriole commented Sep 13, 2022

Popularity community experiment
The purpose of the experiment is to see how the torrent health information received via the popularity community differs when checked locally by joining the swarm.

From the popularity community, we constantly receive a set of tuples (infohash, seeders, leechers, last_checked) representing the popular torrent with their health (seeders, leechers) information. This health information is supposed to be obtained by the sender by checking the torrent themselves so the expectation is that the information is relatively accurate and fresh.

In the graph below, we show how the reported (or received) health info and checked health info differ for the 24 popular torrents received via the community.

First considering the seeders. Since the variation in the number of seeders for different torrents is high, a logarithmic scale is used.
Sept - Seeders (reported and checked)

Similarly for the leechers, again logarithmic scale is used.
Sept - Leechers (reported and checked)

Here each individual torrent is unrelated to each other and could be more or less popular depending on what content they represent so seeders, leechers, and peers (= seeders + leechers) are represented in the percentage of their reported value in an attempt to normalize them.

Seeders % = ( checked seeders / reported seeders ) x 100 %
Leechers % = ( checked leechers / reported leechers ) x 100 %
Peers % = ( ( checked seeders + checked leechers) / ( reported seeders + reported leechers ) ) x 100 %

Peers and seeders

Peers and leechers

Observerations

  • Similar to frozen experiment, the checked seeders values are much lower than the reported seeders values. However, the average seeders % is 13.60% which is a bit higher than the frozen experiment average seeder value (9.42%). This makes sense because since these are popular torrents the seeders are expected to be higher than normal search experiments like frozen.
  • Checked leechers values are normally higher than the reported leecher values. This also resembles with frozen experiment with similar average values 296.44% (this experiment) and 225.17% (frozen experiment).
  • Comparing the peer values, the average peers % is 28.27% compared to the frozen experiment 35.26%. This is likely because the total peers reported for popular torrents is higher than for the torrents returned from the search results.
  • In overall, the percentages do not differ by too much in both experiments.

@synctext
Copy link
Member Author

synctext commented Sep 30, 2022

Writing down our objectives here:

Layer Description
Relevance ranking It is show to the user within 500 ms and asynchronously updated
Remote search trustworthy peer which has the swarm info by random probability
Popularity community distribute the swarm sizes
Torrent checking image
  1. Add tagging and update relevance ranking. Towards perfect metadata.
  2. Added to that is adversarial information retrieval for our Web3 search science. After above is deployed and tagging is added. Cryptographic protection of above info. Signed data needs to have overlap with your web-of-trust, unsolved hard problem.

background
Getting this all to work is similar to making a distributed Google. Everything needs to work and needs to work together. Already in 2017 we tried to find the ground-truth on the perfect matching swarm for a query. We have a minimal swarm crawler (2017). "Roughly 15-KByte-ish of cost for sampling a swarm (also receive bytes?). Uses magnet links only. 160 Ubuntu swarms crawled":
image
Documented torrent checking algorithm?
Documented popularity community torrent selection and UDP/IPv8 packet format?
Readthedocs Example "latest/search_architecture.html"

@synctext
Copy link
Member Author

synctext commented Dec 2, 2022

Initial documentation of deployed Tribler 7.12 algorithms

  • Random Torrentchecking. Every 2 minutes check popularity of a random swarm. Critical decisions: which swarm to check (e.g. random). No bias for dead swarm or fresh swarms in any way.
  • Popular Torrentchecking
  • Unknown quality of dead swarms 💀. The cause could be channels, dead swarms inside subscribed channels. Concept of pre-view torrent might be the cause. Remote search results are also checked.
  • No algorithms to purge (remove) dead swarms in Tribler.
  • {repeating} Redo experiments with newer Tribler code: "The purpose of the experiment is to see how the torrent health information received via the popularity community differs when checked locally by joining the swarm."
    • "Naked Libtorrent" is operational as a minimal codebase to join swarms through exit node, measure connected peers, and estimate total swarm size. Strictly limited to 60 seconds per swarm.
      • Bep-33 swarm count (not used)
      • Full DHT lookup peer identities (not used)
      • Tracker peer identities (not used, infohash only)
      • PEX-gossip peer identities
        • failed to connect peers
        • Connected peer identities (e.g. responsive)

@xoriole
Copy link
Contributor

xoriole commented Dec 9, 2022

Repeating the Popularity community experiment here.

Similar to the experiment done in September, here we show how the reported (or received) health info and checked health info differ for the 24 popular torrents received via the community.

The numbers represented in the graph are count values and the scale used in the graph is logarithmic for better comparison since the variation in the values is large.

A. Based on count

Dec-Seeders (reported and checked)

Dec-Leechers (reported and checked)

B. Normalized in percentages

Seeders % = ( checked seeders / reported seeders ) x 100 %
Leechers % = ( checked leechers / reported leechers ) x 100 %
Peers % = ( ( checked seeders + checked leechers) / ( reported seeders + reported leechers ) ) x 100 %

Dec - Peers and seeders
Dec - Peers and leechers


Observerations

  • Seeders
    The average number of checked seeders per torrent is similar for both measurements.

    Measurement Avg Seeders count Avg Seeders %
    Septemebr 104 13.6
    December 108 2.49
  • Leechers
    The average number of checked leechers is lower than found in September.

    Measurement Avg Leechers count Avg Leechers %
    Septemebr 143.58 296.44
    December 105.91 40.02

    However, this is less significant since we're more interested.Lessons learned: Even though the torrent is reported to be alive and popular, it could still be dead as we found out by checking. This gap between reported and checked requires fixing the checking mechanism within Tribler.Lessons learned: Even though the torrent is reported to be alive and popular, it could still be dead as we found out by checking. This gap between reported and checked requires fixing the checking mechanism within Tribler.

  • Peers
    The average number of checked leechers is lower than found in September.

    Measurement Avg Peers count Avg Peers %
    Septemebr 244.04 28.27
    December 214.45 4.35
  • In overall, the seeders, leechers, and peers percentage has decreased significantly compared to the September measurement. One likely explanation for this change is that the popular torrents recorded in this experiment have a lower standard deviation of the reported values of health information (new version of Tribler) compared to that of the measurements taken in September. That is, more diverse popular torrents are being distributed in the new version of Tribler 7.12.1. This is different from the earlier Tribler version where we observed a few popular torrents were distributed multiple times.

  • Lessons learned: Even though the torrent is reported to be alive and popular, it could still be dead as we found out by checking. This gap between reported and checked requires fixing the checking mechanism within Tribler.

  • In my opinion, the dissemination of popular torrents via the popularity community is satisfactory looking at the results.

@absolutep
Copy link

In overall, the seeders, leechers, and peers percentage has decreased significantly compared to the September measurement.

I would point out another reason - lower number of users using Tribler might skew the results or atleast give an erratic response.

I do not know why but it seems that userbase has decreased quite a lot.

For newer torrents, I get downloading/uploading speeds of around 20MBPS in qBitTorrent (without VPN) but on Tribler I hardly cross maximum of 4MBPS (without hops).

Is this because of low number of users or unable to connect to peers or cooperative downloading - that I have no technical knowledge of?

@synctext
Copy link
Member Author

synctext commented Dec 9, 2022

@absolutep Interesting thought, thx! We need to measure that and compensate for that.

@xoriole The final goal of this work is to either write or contribute the technical content to a (technical/scientific) paper, like: https://github.com/Tribler/tribler/files/10186800/LTR_Thesis_v1.1.pdf
We're very much not ready for machine learning. But for publication results its strangely easy to mix measurement of a 17 years deployed system with simplistic Python Jupyter notebooks with machine learning. Key performance indicator: zombies in top-N (1000). Agree with key point you raised: stepping out of the engineering mindset. Basically we're spreading data nicely and fast, its only a bit wrong (e.g. 296.44% 😂 )
Lesson learned: started simple, working, and inaccurate. Evolved complexity: we need a filter step and measure again later in time (e.g. re-measure, re-confirm popularity). Reactive, pro-active, or emergent design. Zero trust architecture: trust nobody but yourself. We have no idea actually. So just build, deploy, and watch what happens. Actually we need to know the root cause of failure. Without understanding the reason for wrong statistics, we're getting nowhere. Can we reproduce the BEP33 error, for instance? Therefore, analysis of 1 month system dynamics and faults. Scientific related work (small sample from this blog on Google Youtube):
image
Scientific problem is item ranking. What would be interesting to know is: how fast does the frontpage of Youtube change with the most-popular videos? Scientific article by Google: Deep Neural Networks for YouTube Recommendations.

@synctext
Copy link
Member Author

synctext commented Jan 9, 2023

Discussed progress, next sprint: how good is are the popularity statistics with latest 12.1 Tribler (filtered results, compared to ground truth)? DHT self-attack issue to investigate next?

@xoriole
Copy link
Contributor

xoriole commented Jan 19, 2023

Comparing the results from the naked libtorrent and the Tribler, I found that the results of the torrent check of the popular torrents received via the popularity community when checked locally results in dead torrents which is likely not the case. This is because of the issue in torrent checker (DHT Session checker). After BEP33 is removed, the earlier way of getting the health response mostly returns in zero seeders and zero or some leechers, this in the UI shows as

@drew2a
Copy link
Contributor

drew2a commented Jan 19, 2023

Could this bug (#6131) relate to the described issues?

@xoriole
Copy link
Contributor

xoriole commented Jan 19, 2023

Could this bug (#6131) relate to the described issues?

Yes, it is same bug

@drew2a
Copy link
Contributor

drew2a commented Jan 27, 2023

While working on #7286 I've found a strange behavior that may shed light on some of the other oddities.

If TorrentChecker performs a check via a tracker, then returned values always look ok-ish (like 'seeders': 10, 'leechers': 77).

If TorrentChecker performs a check via DHT, then returned seeders are always equal to 0 (like 'seeders': 0, 'leechers': 56)

Maybe it is a bug that @xoriole describes above.


UPDATED 03.02.22 after verification from @kozlovsky

I also found that one automatic check in TorrentChecker was broken.
I also have found that literally all automatic checks in TorrentChecker were broken.

There are three automatic checks:

self.register_task("tracker_check", self.check_random_tracker, interval=TRACKER_SELECTION_INTERVAL)
self.register_task("torrent_check", self.check_local_torrents, interval=TORRENT_SELECTION_INTERVAL)
self.register_task("user_channel_torrent_check", self.check_torrents_in_user_channel,
interval=USER_CHANNEL_TORRENT_SELECTION_INTERVAL)

The first (check_random_tracker) is broken because it performs the check, but didn't save the results into DB:

try:
await self.connect_to_tracker(session)
return True
except:
return False

The second (check_local_torrents) is broken because it calls an async function in a sync way (which doesn't leads to the execution of the called function).

The third (check_torrents_in_user_channel) is also broken because it calls an async function in a sync way (which doesn't leads to the execution of the called function).

CC: @kozlovsky

@drew2a
Copy link
Contributor

drew2a commented Jan 30, 2023

Also, I'm posting an algorithm example of getting seeders' and leechers' in case there is more than one source of information available.

  1. TorrentChecker checks the seeders' and leechers' for an infohash.
  2. TorrentChecker sends a DHT request and a request to a tracker.
  3. TorrentChecker receives two answers. One from DHT and one from the tracker:
    • DHT_response= {"seeders": 10, "leechers"=23}
    • tracker_response={"seeders": 4, "leechers"=37})
  4. TorrentChecker picks the answer with the maximum seeders' value. Therefore the result is:
    • result={"seeders": 10, "leechers"=23}
  5. TorrentChecker saves this information to the DB (and propagates it through PopularityCommunity later).

Proof:

# More leeches is better, because undefined peers are marked as leeches in DHT
if s > torrent_update_dict['seeders'] or \
(s == torrent_update_dict['seeders'] and l > torrent_update_dict['leechers']):
torrent_update_dict['seeders'] = s
torrent_update_dict['leechers'] = l

Intuitively it is not the correct algorithm. Maybe we should use the mean function instead of the max.

Something like:

from statistics import mean

DHT_response = {'seeders': 10, 'leechers': 23}
tracker_response = {'seeders': 4, 'leechers': 37}

result = {'seeders': None, 'leechers': None}
for key in result.keys():
    result[key] = mean({DHT_response[key], tracker_response[key]})

print(result)  # {'seeders': 7, 'leechers': 30}

Or we might prioritize the sources. Let's say:

  1. Tracker (more important)
  2. DHT (less important)

@kozlovsky
Copy link
Contributor

I also have found that literally all automatic checks in TorrentChecker were broken.
The first (check_random_tracker) is broken because it performs the check, but didn't save the results into DB:

I suspect you are right, and this check does not store the received results in the DB

The second (check_local_torrents) is broken because it calls an async function in a sync way (which doesn't leads to the execution of the called function):
The third (check_torrents_in_user_channel) is also broken because it calls an async function in a sync way (which doesn't leads to the execution of the called function):

I think these checks work properly. The function they call is not actually async:

@task
async def check_torrent_health(self, infohash, timeout=20, scrape_now=False):
    ...

This function appears to be async, but it is actually synchronous. The @task decorator converts an async function to a sync function that starts an async task in the background. This is non-intuitive, and PyCharm IDE does not understand this.

We should probably replace the @task decorator with something that IDE better supports to avoid further confusion.

@synctext
Copy link
Member Author

synctext commented Apr 6, 2023

New PR for torrent checking. After the release is done of Tribler 7.13 we can re-measure the popularity community. ToDo @xoriole. The new code deployment will hopefully fix issues and improve the general health and efficiency of popularity info for each swarm.

@synctext
Copy link
Member Author

synctext commented Sep 4, 2023

The graph below shows the received number of torrents (unique & total), total messages and peers discovered per day by the crawler running Popularity Community in observer mode for 95 days. The crawler is running with an extended discovery booster which leads to discovering more torrents.

Lets focus on the latest 7.13 release and re-measure. Can we find 1 million unique swarm in 50 days? How many long-tail swarms are we finding. Did we achieve the key goal of tracking 100k swarms for the {rough} popularity. If we succeeded we can move on and focus fully on bug fixing, tooling, metadata enrichment, tagging and semantic search.

@egbertbouman Measure for 24 hours the metadata at runtime we get from the network:

@egbertbouman
Copy link
Member

Just did a simple experiment with Tribler running idle for 24h in order to get some idea about which mechanisms in Tribler are primarily responsible for getting torrent information. It turns out that the vast majority of torrents are discovered through the lists of random torrents gossiped within the PopularityCommunity.

torrent_collecting

When zooming in on the first 10 minutes, we see that a lot of discovered torrents are discovered by the GigaChannelCommunity (preview torrents). The effect of preview torrent quickly fades as preview torrents are only collected onup the discovery of new channels.

torrent_collecting

While examining the database after the experiment had completed, it turned out that a little under 80% of all torrent were "free-for-all" torrents, meaning that they did not have a public_key/signature and were put on the network after a user manually requested the torrentinfo.

@ichorid
Copy link
Contributor

ichorid commented Sep 19, 2023

Also, if you'd look at the top channels, they all were updated 2-3 years ago. That means their creators basically dumped a lot of stuff and then forgot about it, lose interest in it. But their creations continue to dominate the top, which is now a bunch of 🧟 🧟 🧟

"Free-for-all" torrents gossip is the most basic form of collective authoring, and it trumped over the channels in real usage (e.g., search and top torrents) as @egbertbouman demonstrated above. This means that without a collective editing feature Channels are useless and misleading.

Overall, I'd say Channels must be removed (especially because of its clumsy "Channel Torrent" engine), and replaced with some more bottom-up system, e.g. collectively edited tags.

@synctext
Copy link
Member Author

synctext commented Oct 3, 2023

😲 😲 😲
650 torrents are collected by Tribler in the first 20 seconds after startup?

When zooming in on the first 10 minutes, we see that a lot of discovered torrents are discovered by the GigaChannelCommunity (preview torrents).

Now we understand a big source of performance loss. Gigachannels is way too agressive in the first seconds and first minute of startup. No rate control or limit of IO or networking (32 torrents/second). Should be smoother and first rely on RemoteSearch results. Tribler will get unresponsive I believe on slow computers with this aggressive content discovery. Great work on the performance analysis! We should do that with all our features.
No fix needed now! Only awareness that we need to monitor our code consistently for performance.

@synctext
Copy link
Member Author

synctext commented Oct 10, 2023

Web3Discover: Trustworthy Content Discovery for Web3

@xoriole would be great to write a scientific arxiv paper on this, beyond traditional developer role. Also contributing to performance evaluation and scientific core. See IPFS example reporting, tooling, and general pubsub work plus "GossipSub: Attack-Resilient Message Propagation in the Filecoin and ETH2.0 Networks".

Problem description, design, Graphs, performance analysis, and vital role for search plus relevance ranking.
Timing: finish before Aug 2024 (end contract).

@grimadas
Copy link
Contributor

grimadas commented Aug 30, 2024

The problem remains relevant to this day. I see two main issues and possible directions for improvement:

  1. Zero-trust architecture is expensive. If I only trust my own node, I need to check all torrents, which is a costly operation.
    Solution: Enhance the torrent checker with reputation information. Include the source when you receive information about torrents checked by others. When verifying torrent health, cross-check with other reported sources. If the reported information is significantly inaccurate, downgrade the source's reputation. If the information seems "good enough," slightly increase the source's reputation.
    This approach enables us to have torrent information enhanced with network health checks, not just checks from our own node.
    Challenge: Torrent health is always dynamic data. We need a reliable estimation to determine what is significantly inaccurate and what is "good enough."

  2. Biased health checks are not optimal. Currently, many peers perform redundant work by repeatedly checking the same popular torrents. Can we improve this process? Perhaps we could integrate information about who has already checked the torrents and how much we trust them.
    Simplistic gossip mechanism: Randomly send information about 5 popular, checked torrents. While this could lead to repetition and over-redundancy, we need to demonstrate that this is indeed a problem (using a graph). With baseline metrics, we can identify opportunities for further improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

10 participants