-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Insanely fast #4
Comments
Highly recommend to use the latest turbo fork of par2-cmdline, it's way faster than par2-cmdline and even beats parpar animetosho/par2cmdline-turbo#4
Highly recommend to use the latest turbo fork of par2-cmdline, it's way faster than par2-cmdline and even beats parpar animetosho/par2cmdline-turbo#4
Thanks for the tests. I'm hoping to get a ParPar v0.4.0 release "soonish" which should be more comparable (if you wish, you can test the current dev branch to see how it performs). |
|
Thanks for posting - yeah, ParPar has a few tricks that par2cmdline doesn't implement.
Do you mean that you were measuring the disk read rate whilst it was running? 950MB/s sounds slow for typical NVMe otherwise. |
It's a zfs pool actually (equivalent of RAID0). And yes, I was looking at I ran same parpar test on a regular SSD, reads fluctuated between 330-380MB/s and it took 1m58s to complete. NVMe definitely had an edge. |
Ah, that makes sense. The next bottleneck on your system may be MD5. If the CPU isn't being used fully, maybe setting |
I got 64GB of RAM, which is plenty. Indeed Here's another test with
|
Interesting the CPU usage difference there. Thanks for the benchmarking! I've noticed that the thread scaling isn't so great at the moment (something I still need to work on). I wonder if the lower CPU usage is related to par2cmdline not feeding enough data in to process. Try adding |
|
Thanks for the testing! The results are interesting. I had a bit of a look and think it could be related to over-eager processing invocations - I've updated ParPar to address that. Hopefully that reduces/eliminates the unnecessarily high CPU usage.
They're both ultimately running the same processing code, so in theory it should be roughly the same. |
Source is a 1800d old NZB of a typical ~4G movie that is broken. Gutted sabnzbd 4.x newsunpacker so I could re-run over the same job (verification+repair) to record different times on various machines and have try to get some good real world benchmarks. With par we parsed out parts of the cli output so we know when to start/stop. I did add some logic so I can try out multipar/par2cmdline/par2cmdline-turbo and 32bit/64bit variations but left everything else that sab calls the same. I do not give it any additional arguments, wanted to see what a pure drop in replacement would do. Using python 3.11.2 on each machine. For the laptops, all machines are fully charged + connected to power and up to date on updates. |
Thanks for the benchmarks! |
I did a quick benchmark comparing parchive/par2cmdline, animetosho/par2cmdline-turbo, klauspost/reedsolomon, and catid/wirehair === Test 1: 1 GB file with 1100 shards (1000 file shards, 100 recovery shards) ===
The results of the test above show that par2cmdline-turbo is the clear winner in terms of performance and memory usage. The memory usage of wirehair is due to the storage of the file in memory as well as the creation of the encoder. The 1GB file itself takes up 1GB of memory, and then the encoder also takes up another 1GB, for a total of 2GB. Note: klauspost/reedsolomon does support progressive encoding, but only for the Regular code (which supports up to 256 shards) not the Leopard code (which supports up to 65536 shards). This means that if you have more than 256 shards then you cannot do progressive encoding. As for wirehair, the author said that there is no way to do progressive encoding. === Test 2: 1 GB file with 220 shards (200 file shards, 20 recovery shards) === Using progressive encoding only (so no wirehair). par2cmdline: 14 seconds, 0.12 GB Here I used progressive encoding for Klaus Post's implementation so it used less memory. In this scenario it seems that par2cmdline is only a bit slower than reedsolomon while using the same amount of RAM. It seems that while in progressive encoding mode klauspost/reedsolomon uses the same amount of RAM as par2cmdline, it also is much slower, to the point where it is not much faster than par2cmdline. Also wow, par2cmdline-turbo is so much faster than par2cmdline it's crazy! It's like 3 times faster! It does use slightly more RAM than par2cmdline? Hard to tell with my by-the-eyeball methods. Reposted on my blog: https://1f604.blogspot.com/2023/04/comparison-of-par2-reedsolomon-and.html Edit: Looking back at the results, it is surprises me that going from 20 to 100 recovery shards didn't change the time taken at all. I always thought the time taken was proportional to the number of recovery shards? |
Thanks for the benchmark + results!
That's interesting that the
Does it provide file handling functions? I thought that and Wirehair were just encoding libraries, so it may be worth publishing usable test code just in case someone finds an issue there. Threading may also be a factor - I don't know whether the libraries handle it, whilst par2cmdline does implement multi-threading.
It does indeed use more RAM. par2cmdline's RAM usage is largely the size of the recovery data, plus a block/shard sized read/write buffer. The ParPar backend also has a 'staging area' where the data is shuffled around to make processing more optimal. The size of this varies, but currently targets around 24 blocks/shards. This figure could probably use some tuning as well. Note that par2cmdline's memory usage can be adjusted with the
The Reed-Solomon computation is But real world speed may not follow that strictly, particularly for low number of recovery shards:
You might be able to see more of a difference if you go for more recovery shards. |
Thanks @animetosho ! I ran par2cmdline-turbo with a range of shard numbers and found that the running time grows faster than linearly (all done with the 1GB file with 10% recovery): 100 shards: 3 seconds So below say 1,000 shards it seems the running time is dominated by some kind of constant-time overhead as we go from 100 to 1000 shards the time only grows from 3 seconds to 4 seconds. Maybe some of that is due to my test system having slow IO.
Here's the code I used for testing Klaus Post's reedsolomon progressive encoding, it is basically just a modified version of the simple-encoder.go in the examples directory (you can run it by replacing the contents of the simple-encoder.go file with the below): //go:build ignore
// +build ignore
// Copyright 2015, Klaus Post, see LICENSE for details.
//
// Simple encoder example
//
// The encoder encodes a single file into a number of shards
// To reverse the process see "simpledecoder.go"
//
// To build an executable use:
//
// go build simple-decoder.go
//
// Simple Encoder/Decoder Shortcomings:
// * If the file size of the input isn't divisible by the number of data shards
// the output will contain extra zeroes
//
// * If the shard numbers isn't the same for the decoder as in the
// encoder, invalid output will be generated.
//
// * If values have changed in a shard, it cannot be reconstructed.
//
// * If two shards have been swapped, reconstruction will always fail.
// You need to supply the shards in the same order as they were given to you.
//
// The solution for this is to save a metadata file containing:
//
// * File size.
// * The number of data/parity shards.
// * HASH of each shard.
// * Order of the shards.
//
// If you save these properties, you should be able to detect file corruption
// in a shard and be able to reconstruct your data if you have the needed number of shards left.
package main
import (
"bufio"
"flag"
"fmt"
"io/ioutil"
"os"
"path/filepath"
"time"
"github.com/klauspost/reedsolomon"
)
var dataShards = flag.Int("data", 200, "Number of shards to split the data into, must be below 257.")
var parShards = flag.Int("par", 20, "Number of parity shards")
var outDir = flag.String("out", "", "Alternative output directory")
func init() {
flag.Usage = func() {
fmt.Fprintf(os.Stderr, "Usage of %s:\n", os.Args[0])
fmt.Fprintf(os.Stderr, " simple-encoder [-flags] filename.ext\n\n")
fmt.Fprintf(os.Stderr, "Valid flags:\n")
flag.PrintDefaults()
}
}
func main() {
start := time.Now()
// Parse command line parameters.
flag.Parse()
args := flag.Args()
if len(args) != 1 {
fmt.Fprintf(os.Stderr, "Error: No input filename given\n")
flag.Usage()
os.Exit(1)
}
fname := args[0]
// Create encoding matrix.
enc, err := reedsolomon.New(*dataShards, *parShards)
checkErr(err)
fmt.Println("Finished creating reedsolomon encoding matrix.")
// Open the file for reading
f, err := os.Open(fname)
checkErr(err)
defer f.Close()
dir, file := filepath.Split(fname)
if *outDir != "" {
dir = *outDir
}
// calculate the shard size
fi, err := f.Stat()
if err != nil {
// Could not obtain stat, handle error
}
fmt.Printf("The file is %d bytes long\n", fi.Size())
filesize := fi.Size()
if filesize%200 != 0 {
panic("file size must be a multiple of 200 bytes")
}
shardsize := filesize / 200 // 200 file shards
println("shard size:", shardsize)
// create the parity shard buffer
parity := make([][]byte, 20) // 20 recovery shards
for i := range parity {
parity[i] = make([]byte, shardsize)
}
// create the parity shards
buf := make([]byte, shardsize)
index := 0
r := bufio.NewReader(f)
elapsed := time.Since(start)
fmt.Println("time elapsed until starting to encode file:", elapsed)
for {
start := time.Now()
// read in a chunk
_, read_err := r.Read(buf)
if read_err != nil { // assume EOF
println("all done!")
break
}
// write it out into a file
outfn := fmt.Sprintf("%s.%d", file, index)
err = os.WriteFile(filepath.Join(dir, outfn), buf, 0644)
checkErr(err)
elapsed := time.Since(start)
fmt.Println("reading + writing time:", elapsed)
start = time.Now()
// encode the chunk
err = enc.EncodeIdx(buf, index, parity)
checkErr(err)
index++
elapsed = time.Since(start)
fmt.Println("encoding time for shard", index, ":", elapsed)
}
start = time.Now()
// Write out parity shards into files.
for i, shard := range parity {
outfn := fmt.Sprintf("%s.%d", file, 200+i)
fmt.Println("Writing to", outfn)
err = ioutil.WriteFile(filepath.Join(dir, outfn), shard, 0644)
checkErr(err)
}
elapsed = time.Since(start)
fmt.Println("time taken to write out parity files:", elapsed)
}
func checkErr(err error) {
if err != nil {
fmt.Fprintf(os.Stderr, "Error: %s", err.Error())
os.Exit(2)
}
} Output:
|
Actually, with a 1GB source size, you'll start running into an issue with shards being too small. Internally, shards are broken into chunks and distributed across threads, but if the shard is too small, it's going to be limited in how many threads it can use. I know little about Go, so I can't really critique your code (mostly the suggestion was made for your blog post, as it could make it seem more transparent). |
just to add, when i tried using |
Are you setting it to something larger than the default? In theory, larger values shouldn't decrease performance, though you may be experiencing the same issue as @buggsi. |
is there a way to see what the default/what it uses? the machine i was testing it out on has limited ram (4G) so i tried setting 2G and 3G but neither helped compared to 'default' |
Add the Default policy is half of RAM (or 128MB if it couldn't figure out the amount of RAM available). The memory limit only affects the amount of recovery data that can be held in memory - for create, that's the total amount of recovery being created, whereas for repair, it's the size of the damaged blocks. |
I benchmarked it on 25GB of rar'd data (Ryzen 3700x, Ubuntu 22.04, NVMe).
parpar -s5120000b -r5% -p 50 -o test.par2 *.rar
-> 3m10spar2 c -s5120000 -r5 -l -v test.par2 *.rar
-> 8m55spar2-turbo c -s5120000 -r5 -l -v test.par2 *.rar
-> 1m14sI can't believe how fast it is, it even beats
parpar
!The text was updated successfully, but these errors were encountered: