-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved downloader webseed performance #10715
Conversation
erigon-lib/downloader/downloader.go
Outdated
// if we have created a torrent bit it has no info assume that the | ||
// webseed download either has not been called yet or has failed and | ||
// try again here - otherwise the torrent will be left with no info | ||
if t.Info() == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select {
case <-t.GotInfo():
continue
default:
}
is better than t.Info()
from "mutex contention" perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Info now has its own mutex - so contention is minimal - as its only locked written by set info. So I think that t.Info() should be preferred for call simplicity.
if ok && err == nil { | ||
_, _, err = addTorrentFile(d.ctx, ts, d.torrentClient, d.db, d.webseeds) | ||
if err != nil { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Debug the err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -65,7 +65,7 @@ func Default() *torrent.ClientConfig { | |||
// better don't increase because erigon periodically producing "new seedable files" - and adding them to downloader. | |||
// it must not impact chain tip sync - so, limit resources to minimum by default. | |||
// but when downloader is started as a separated process - rise it to max | |||
//torrentConfig.PieceHashersPerTorrent = max(1, runtime.NumCPU()-1) | |||
torrentConfig.PieceHashersPerTorrent = max(1, runtime.NumCPU()-2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because there is no global limiter for hashers - on non-nvme drives it may crash with > 10K threads created
panic. because all threads will wait for overloaded/just_slow disk (and go creating dedicated thread for each IO-blocked goroutine).
m.mu.Lock() | ||
defer m.mu.Unlock() | ||
|
||
tx, err := m.db.BeginRwNosync(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use BeginRw
instead of BeginRwNosync
. otherwise you can loose last committed transaction on power-off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
m.putFlushed(tx, infoHash, flushed) | ||
|
||
tx.Commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if commit returns error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like it. left couple comments. plz address.
show torrent
lib diff plz. maybe i can review a bit there also.
m.mu.Lock() | ||
defer m.mu.Unlock() | ||
//fmt.Println("FLUSH", infoHash) | ||
m.db.Batch(func(tx kv.RwTx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Batch
- is designed for much parallel calls. but currently it's behind mutex. In this case Update
will perform better.
Or maybe it's possible to do behind mutext only map get/put.
Or just don't use mdbxPieceCompletionBatch
if Update
inside Set
is not a problem anymore.
up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this so now we're using the original unbatched version of the code - with the alterations you suggested.
I have found with sepolia things seems to run ok with default settings, for bor-mainnet I found I need to set DL hashers to 20 in order to avoind memory growth - the hashers can't keep up with the data download. Too may hahsers though and CPU's get maxed out, althouhg hashing iteslf only seems to hake ~25% of the processing time. The rest is copying memory and GC. |
I have resolved all of the DATA races I see for the moment. |
This is resolved - it was caused by a cancelled flag being incorrectly set. I have seen several other panics - which I have investigated and resolved. I think there may be a couple of issue left since I made the lib more active - but think its worth merging and investigating. The reamining issue seem to be mostly associated with peer flow. Which I'll start to test this week. |
This contains fixes, mainly in erigontech/torrent which improve the parallelization of the lib to support downloads speeds over ~25MB per second on a reliable basis.
--torrent.download.rate=256mb --torrent.download.slots=32
should now run a download to completion at a consistent ~250MB/second download rate, assuming64GB
of available memory.If more memory is available
--torrent.download.rate=512mb --torrent.download.slots=48
will around ~400MB/second. Its not yet clear where the loss in performance is but for this version ofthe code it seems 400MB/second is around the maximum it can support.Outstanding Issues
The current version of the code is memory hungry at high bandwidths the reason behind this is the way the http data is dealt with under high load. The buffer model is currently:
http->hashing->mmap
Where both the http connection and the torrent hasher will retain intermediate buffers until they are finally flushed to te memory mapped file. A more memory efficient model would be to get the http connection to write directly to the memory mapped segment which can the be directly hashed. This will require further code modification - which is outside of the scope of this change.
Downloader changes
Along with the torrent lib changes a number of changes have been made to the downloader code. The most significant are:
mdbxPieceCompletion
now has aFlushed(infoHash infohash.T, flushed *roaring.Bitmap)
method which is used to commit the completion status to the db after an asynchronous flush of the mmap files has been made. This means that the completion state will only be confirmed once the data is flushed. (This may lead to re-downloading of peices in the case of a crash.