-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swarm size community: content popularity #2783
Comments
first startup task:
|
I found that the official tutorial of libtorrent is a good reference, though the documentation isn't very much consistent with the code. |
|
Can we obtain accurate statistics from Libtorrent. We desire the raw number of packets and length. Sadly, the peer_info, total_download, number of bytes "the total number of bytes downloaded from this peer. These numbers do not include the protocol chatter, but only the payload data" There are session statistics with net.sent_payload_bytes and net.sent_ip_overhead_bytes It seems accurate messages counter are kept:
Next step would be to determine the exact number of packet and bytes for downloading various Ubuntu images. The whole chain from magnet link to first downloaded piece. The goal is single-byte accurate statistics Then we can calculate optimal strategies for connecting to swarms and sharing the swarm size handshake and crawling.
Final note, swarm size is not equal to the number of people that have seen this online video. Media consumption (content popularity) or downloads over time are very hard to estimate or measure. You would need to know the average in-swarm time, average downloading time and age of swarm to calculate the media consumption. |
Can you please elaborate on how optimal strategies can be calculated based on "single-byte accurate statistics"? Meanwhile, I was reading some literature about how to get (near) real-time info of # of seeders and leechers and came across "DHT Crawling". |
Found a bittorent DHT crawler example (also in Python) on Github: https://github.com/nitmir/btdht-crawler |
Two more references: -- Update -- zo 16 apr 2017 21:32:22 CEST -- |
@synctext
The program made in the first starup task can only show the number of seeders and leechers that peer is currently connected with. It has no knowledge about how large the total swarm size is. So I prefer to solve this problem first.
|
indeed. please ignore DHT crawling, too much spam. |
Succeeded in retrieving statistics (see session stats in the screen shot), failed to interpret it. See this issue for more info: arvidn/libtorrent#1946 |
Next step is to get this operational and count the exact number of packets + their type. Goal is to measure all packets starting from a magnet link until exactly the first block is completed. When this information from Libtorrent is combined with the packet length we get a sufficiently accurate picture. Experimental results for thesis: do this for 100-ish legal magnet links (e.g. magnet links from all Ubuntu torrents + other sources). |
After some effort it is finally possible to interpret the session statistics.
Most of them can be measured using libtorrent, as we can see is the commit above. |
The next steps would be to:
|
The experiment is running now using torrent files instead of magnet links (temporarily, otherwise it would be too slow). |
true. time for dead torrent filtering also |
Time measurement is done: https://github.com/MaChengxin/expBT/commit/5a8f3293d1daafb2cfc1b72552bd08d3662e535f |
Personal repository for the experiments: https://github.com/MaChengxin/expBT |
Just noticed that there is a page describing this issue with many details: https://www.tribler.org/SwarmSize/ |
It seems to work! Further polish plots made during the measurements (many more here) Thoughts: downloading 1 piece is good correlation. How good is downloading 16 pieces? Does our content popularity estimation get more accurate if you spend more bandwidth and time within a swarm? Dispersy community design sketch. Create a community which every second starts a new content popularity check. The outcome of this check is shared across the Dispersy overlay with 10 neighbors (fanout). That's it. |
The design sketch of the Dispersy community is almost done: https://github.com/MaChengxin/tribler/tree/swarm_size_community/Tribler/community/swarmsize |
Result of the experiment with the swarm size detector simulator: https://jenkins.tribler.org/job/pers/job/SwarmSizeCommunity_Chengxin/32/artifact/output/localhost/node321/00000.out/*view*/ |
QuestionsMy task has three parts: 1.Distribute the entire work to different nodes (i.e. divide thousands of torrents into smaller sets and assign each set of torrents to nodes) The questions are:
|
Solution: no scheduling, division of work or result aggregation. Proposed architecture in above design sketch is that each node just randomly checks 1 swarm per second. A total of 10 random content popularity checks are then shared with random neighbors. With this probabilistic approach it is highly likely that duplicate checks are conducted, however the code is simplified. Next steps:
|
This figure shows how many checks one node has to make in order to cover all the torrents using a random pick-up policy (with replacement). It takes about 800 - 900 checks to cover 162 swarms. More generally, if m denotes the number of torrents, then the expected number of checks required for total coverage is m*Hm, where Hm is the m-th Harmonic number. This gives us an impression of the order of magnitude of the required checks. We could improve this simple swarm selection policy by the follow strategies:
The new architecture looks like this: Description: So for a certain swarm, it will first be selected from the running pool and its size will be measured. After this, a message ({swarm: size}) will be created. This message will go to a message handler. This handler does three things: store the info locally, gossip it to other nodes, and move the swarm from the running pool to pool 1 in the standby pool. Pool 2 in the standby tool is responsible for receiving info about swarms measured by others. Upon receiving a new message {swarm:size}, it will store the info and put the swarm into pool 2 in the standby pool. (So when selecting a swarm to check in the running pool, we will need to first check if it is already in the standby pool.) Bootstrapping: Initially, the two pools are both empty. A node will first load swarms known by itself to the running pool. Then the flow described above can start working. Steady state: Since a node will measure the swarm sizes by itself and also receive such info from others, it will eventually know all the swarms in the ecosystem (e.g. Tribler). One interesting question is that if every node uses the same strategy, what is the estimation of the time needed to get every swarm measured at least once. I've already abstracted this question in a mathematical way: https://math.stackexchange.com/questions/2351742/expected-number-of-steps-to-walk-through-points-by-multiple-walkers @synctext How do you think about this design? Do you see any design flaws? |
@MaChengxin Please take a look at our existing torrent checker solution: https://github.com/Tribler/tribler/tree/devel/Tribler/Core/TorrentChecker. This check simply queries the tracker for the size of the swarm. It is compatible with both HTTP and UDP trackers (it also supports DHT queries). |
This is the new design for checking swarms. |
Would suggest to remove the checking of dead swarms. Additionally, split the checking of healthy and unhealthy swarms. Use a three strikes algorithms for swarms. A swarm is declared dead and will never be checked again if zero seeders and zero leechers are found three times with at least 24h between measurements. |
As shown in the figure, only healthy and unhealthy swarms will be put back in the checking queue. The dead swarms, colored in red, is like a black hole from where no swarm can escape.
It also seems to work if I check the healthy and unhealthy swarms together, and put different tags after checking on them indicating their difference. |
This figure shows the experiment result of the initially proposed architecture (with one checking queue and two standby pools). Experiment setup: The results are: The conclusion is that using 10 nodes, each of them will only need to check 22 swarms to cover (almost) all the swarms. If there is only one node, the theoretical coverage efficiency would be 100% (namely no duplicate checks). However, checking by one node would be tims-consuming. By the time 10 nodes have covered almost all the swarms (159), a single node has only checks only around 25 swarms. Therefore, we can sacrifice some coverage efficiency to achieve high speed of checking. |
First task: spend a few days writing problem description thesis chapter and intro. Simple model:
Some simple parameters:
Code:
|
You can use one of the lease web servers for testing purposes; we should disable the exit nodes running on these servers for one or two days. Technical specifications of one of the servers: https://gyazo.com/5eb8aa07d0a767afc7586acdc787ac7f (I would say you can run up to 24 Dispersy instances on one machine but you could try to scale this up using system load statistics). |
According to Wiki the data length of a UDP packet is 65507 bytes. How does 1480 bytes come from? The advantage of filling an entire UDP packet is that it reduces the cost of communication. However, this also means some data will not be sent out until a full UDP packet is filled. The consequence of such a delay might be that other nodes will do a double check because they don't know someone has already done the task.
Can we just gossip the last checks as soon as we have the results instead of forming a group and sharing the info within the group? |
1480 bytes: https://stackoverflow.com/a/10239583 |
Link to my thesis: https://github.com/MaChengxin/msc-thesis |
Comments on first thesis draft
|
Possible first sentence of chapter 2: Our aim is to create a media system which is not controlled by a single commercial entity. possible Chapter 3: Within this thesis we expanded an Internet-deployed open source distributed media system with content popularity. We focus on video because of the popularity of Youtube, Netflix, and Bittorrent. Basic graphs of content popularity gossip community. Keep to the default Gumby plots. Screenshots of Android. Plots of Android. thesis DONE. |
ch4.pdf To be discussed: outline of Chapter 5; experiments to be included in Chapter 5
For Chapter 1 to 3, I think it's better to revise them after finishing Chapter 4 and 5, such that I will know what aspects to emphasize |
Comments:
|
First prototype deployment, moving work to new issue. |
The popularity of content is essential. Currently we do not have a method to determine if a magnet links points to a large swarms or a dying swarm.
This work enables novel caching approaches, see "Efficient analysis of caching strategies under dynamic content popularity".
Each peer in the network must contact a swarm to determine the size. Checking 1 swarm every 5 seconds to conserve bandwidth means it takes many hours to check several thousand swarms. It is possible to share this information with your neighbors. By gossiping this swarm size information around, swarms need to be checked only once and bandwidth is saved.
The swarm size community is mechanism to collaboratively determine the size of swarms. The step from magnet link to swarm size involves numerous steps. Resolving a magnet link means using the DHT mechanism. Checking a swarm and doing a bittorrent handshake also takes numerous UDP packets and dealing with non-responsive IPv4 addresses.
Experiments:
Science:
The text was updated successfully, but these errors were encountered: