Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insanely fast #4

Closed
buggsi opened this issue Mar 25, 2023 · 19 comments
Closed

Insanely fast #4

buggsi opened this issue Mar 25, 2023 · 19 comments

Comments

@buggsi
Copy link

buggsi commented Mar 25, 2023

I benchmarked it on 25GB of rar'd data (Ryzen 3700x, Ubuntu 22.04, NVMe).

parpar -s5120000b -r5% -p 50 -o test.par2 *.rar -> 3m10s
par2 c -s5120000 -r5 -l -v test.par2 *.rar -> 8m55s
par2-turbo c -s5120000 -r5 -l -v test.par2 *.rar -> 1m14s

I can't believe how fast it is, it even beats parpar!

buggsi added a commit to buggsi/usenet-scripts that referenced this issue Mar 25, 2023
Highly recommend to use the latest turbo fork of par2-cmdline, it's way faster than par2-cmdline and even beats parpar animetosho/par2cmdline-turbo#4
buggsi added a commit to buggsi/usenet-scripts that referenced this issue Mar 25, 2023
Highly recommend to use the latest turbo fork of par2-cmdline, it's way faster than par2-cmdline and even beats parpar animetosho/par2cmdline-turbo#4
@animetosho
Copy link
Owner

Thanks for the tests.
One thing to point out is that the current release of par2cmdline-turbo uses an in-development ParPar v0.4.0 backend, which has performance improvements over the current ParPar v0.3.2 release.

I'm hoping to get a ParPar v0.4.0 release "soonish" which should be more comparable (if you wish, you can test the current dev branch to see how it performs).

@buggsi
Copy link
Author

buggsi commented Mar 26, 2023

$ time parpar -s5120000b -r5% -p 50 -o test.par2 *.rar  
Input data:         25.22 GiB (5289 slices from 106 files)
Recovery data:      1293.95 MiB (265 * 5000 KiB slices)
Input pass(es):     2, processing 265 * 2500 KiB chunks per pass
Read buffer size:   4096 KiB * max 8 buffers

AMD Ryzen 7 3700X 8-Core Processor
  Multiply method:  Xor-Jit (AVX2) with 125.5 KiB loop tiling, 16 threads
  Input batching:   12 chunks, 2 batches
  Memory Usage:     705.71 MiB (265 * 2500.5 KiB chunks + 58.61 MiB transfer buffer)
Calculating: 100.00%
PAR2 created. Time taken: 59.988 second(s)

real    1m0.032s
user    12m31.955s
sys     0m36.965s

parpar 0.4.0-dev is 14 seconds faster than par2cmdline-turbo :)
Note that these tests were done on a 2xNVMe raid. The read I/O was around 950MB/s.

@animetosho
Copy link
Owner

Thanks for posting - yeah, ParPar has a few tricks that par2cmdline doesn't implement.

these tests were done on a 2xNVMe raid. The read I/O was around 950MB/s

Do you mean that you were measuring the disk read rate whilst it was running? 950MB/s sounds slow for typical NVMe otherwise.
PAR2 does have somewhat of an MD5 bottleneck, and 950MB/s seems about right for what your CPU can do on one thread.

@buggsi
Copy link
Author

buggsi commented Mar 26, 2023

It's a zfs pool actually (equivalent of RAID0). And yes, I was looking at iotop while doing the tests. These are very fast nvme's, I get 2800-3000MB/s on proper benchmarks. Maybe the 950MB/s is because of the par2 computations while reading the files.

I ran same parpar test on a regular SSD, reads fluctuated between 330-380MB/s and it took 1m58s to complete. NVMe definitely had an edge.

@animetosho
Copy link
Owner

Ah, that makes sense.

The next bottleneck on your system may be MD5. If the CPU isn't being used fully, maybe setting --read-hash-queue=2 and/or increasing --read-buffers could increase parallelism when computing MD5.
Also, if you have the spare memory, a higher --memory limit could reduce the input passes to 1.

@buggsi
Copy link
Author

buggsi commented Mar 27, 2023

I got 64GB of RAM, which is plenty. Indeed par2cmbline-turbo isn't using the cores fully, the threads fluctuate erratically between 10-50% and not all the threads are used (compared to vanilla par2cmbline, that one uses 100% of all the threads and takes forever to finish!), while parpar consistently uses all the threads at around 85-90%. Despite those differences, par2cmbline-turbo and parpar 0.4.0 still manage to create the par2 files in roughly the same time. Could it be that par2cmbline-turbo is more efficient in using the CPU?

Here's another test with parpar 0.4.0 using --read-hash-queue=2 --read-buffers=32 (tried with 1024, that didn't add much). Those tweaks gained a few seconds, but nothing majorly.

$ time parpar -s5120000b -r5% -p 50 --read-hash-queue=2 --read-buffers=32 -o test.par2 *.rar
Input data:         25.22 GiB (5289 slices from 106 files)
Recovery data:      1293.95 MiB (265 * 5000 KiB slices)
Input pass(es):     2, processing 265 * 2500 KiB chunks per pass
Read buffer size:   4096 KiB * max 32 buffers

AMD Ryzen 7 3700X 8-Core Processor
  Multiply method:  Xor-Jit (AVX2) with 125.5 KiB loop tiling, 16 threads
  Input batching:   12 chunks, 2 batches
  Memory Usage:     705.71 MiB (265 * 2500.5 KiB chunks + 58.61 MiB transfer buffer)
Calculating: 100.00%
PAR2 created. Time taken: 57.337 second(s)

real    0m57.385s
user    11m53.713s
sys     0m36.117s

@animetosho
Copy link
Owner

Interesting the CPU usage difference there. Thanks for the benchmarking!

I've noticed that the thread scaling isn't so great at the moment (something I still need to work on). I wonder if the lower CPU usage is related to par2cmdline not feeding enough data in to process.

Try adding --memory=2G for ParPar or -m2048 for par2cmdline to ensure it processes everything in a single pass.
You could also try with fewer threads to see if the speed goes down by much.

@buggsi
Copy link
Author

buggsi commented Mar 30, 2023

--memory=2G for ParPar 0.4.0 actually made it slower, to the same speed as par2cmdline-turbo.
-m2048 for par2cmdline-turbo was interesting; it opened the files instantly (or more like listed them!?) and started "Processing" immediately, only this time the CPU threads are being used at 90%+, however the benchmark finished in exactly the same time (1m14s) as without -m2048. Note that without that parameter, the CPU threads aren't used at full capacity, which brings me back to the notion that par2cmbline-turbo can be more efficient in using the CPU.

@animetosho
Copy link
Owner

Thanks for the testing!

The results are interesting. I had a bit of a look and think it could be related to over-eager processing invocations - I've updated ParPar to address that. Hopefully that reduces/eliminates the unnecessarily high CPU usage.

which brings me back to the notion that par2cmbline-turbo can be more efficient in using the CPU.

They're both ultimately running the same processing code, so in theory it should be roughly the same.

@thezoggy
Copy link

thezoggy commented Apr 3, 2023

Source is a 1800d old NZB of a typical ~4G movie that is broken.
Lets call it "The.Test.2018.iNTERNAL.1080p.WEB.x264-DAMAGED" ;)
This job needs repair before it can be extracted, using as typical user test case (although only focusing on par2's tasks).

Gutted sabnzbd 4.x newsunpacker so I could re-run over the same job (verification+repair) to record different times on various machines and have try to get some good real world benchmarks. With par we parsed out parts of the cli output so we know when to start/stop. I did add some logic so I can try out multipar/par2cmdline/par2cmdline-turbo and 32bit/64bit variations but left everything else that sab calls the same. I do not give it any additional arguments, wanted to see what a pure drop in replacement would do. Using python 3.11.2 on each machine.

For the laptops, all machines are fully charged + connected to power and up to date on updates.
A fresh copy of the job was created each time to try and reduce caching involved. Waited ~30s after job folder creation to allow cool-down of cpu/disk thermals and activity.

firefox_2023-04-03_00-10-27

firefox_2023-04-03_00-10-44

firefox_2023-04-03_00-37-00

animetosho added a commit that referenced this issue Apr 3, 2023
@animetosho
Copy link
Owner

Thanks for the benchmarks!

@1f604
Copy link

1f604 commented Apr 4, 2023

I did a quick benchmark comparing parchive/par2cmdline, animetosho/par2cmdline-turbo, klauspost/reedsolomon, and catid/wirehair

=== Test 1: 1 GB file with 1100 shards (1000 file shards, 100 recovery shards) ===

  • par2cmdline: 30 seconds, 0.1 GB memory usage
  • par2cmdline-turbo: 5 seconds, 0.12 GB memory usage
  • wirehair: 4 seconds, 2.1 GB memory usage
  • reedsolomon: 4 seconds, 1.5 GB memory usage

The results of the test above show that par2cmdline-turbo is the clear winner in terms of performance and memory usage.

The memory usage of wirehair is due to the storage of the file in memory as well as the creation of the encoder. The 1GB file itself takes up 1GB of memory, and then the encoder also takes up another 1GB, for a total of 2GB.

Note: klauspost/reedsolomon does support progressive encoding, but only for the Regular code (which supports up to 256 shards) not the Leopard code (which supports up to 65536 shards). This means that if you have more than 256 shards then you cannot do progressive encoding. As for wirehair, the author said that there is no way to do progressive encoding.

=== Test 2: 1 GB file with 220 shards (200 file shards, 20 recovery shards) ===

Using progressive encoding only (so no wirehair).

par2cmdline: 14 seconds, 0.12 GB
par2cmdline-turbo: 5s, 0.15 GB
klauspost/reedsolomon: 10 seconds, 0.12 GB

Here I used progressive encoding for Klaus Post's implementation so it used less memory.

In this scenario it seems that par2cmdline is only a bit slower than reedsolomon while using the same amount of RAM.

It seems that while in progressive encoding mode klauspost/reedsolomon uses the same amount of RAM as par2cmdline, it also is much slower, to the point where it is not much faster than par2cmdline.

Also wow, par2cmdline-turbo is so much faster than par2cmdline it's crazy! It's like 3 times faster! It does use slightly more RAM than par2cmdline? Hard to tell with my by-the-eyeball methods.

Reposted on my blog: https://1f604.blogspot.com/2023/04/comparison-of-par2-reedsolomon-and.html

Edit: Looking back at the results, it is surprises me that going from 20 to 100 recovery shards didn't change the time taken at all. I always thought the time taken was proportional to the number of recovery shards?

@buggsi buggsi closed this as completed Apr 5, 2023
@animetosho
Copy link
Owner

Thanks for the benchmark + results!

not the Leopard code (which supports up to 65536 shards). [...] As for wirehair, the author said that there is no way to do progressive encoding.

That's interesting that the O(n log n) algorithms don't allow that - I've always assumed they'd just scale better than O(n^2) algorithms like in PAR2, but I guess that could be a drawback of the approach.

It seems that while in progressive encoding mode klauspost/reedsolomon uses the same amount of RAM as par2cmdline, it also is much slower, to the point where it is not much faster than par2cmdline.

Does it provide file handling functions? I thought that and Wirehair were just encoding libraries, so it may be worth publishing usable test code just in case someone finds an issue there.

Threading may also be a factor - I don't know whether the libraries handle it, whilst par2cmdline does implement multi-threading.

It does use slightly more RAM than par2cmdline?

It does indeed use more RAM.

par2cmdline's RAM usage is largely the size of the recovery data, plus a block/shard sized read/write buffer.
par2cmdline-turbo changes the latter to 4 buffers to enable reading and processing to occur in parallel (4 is an arbitrary number I picked randomly without testing; 2 might work fine and save some memory).

The ParPar backend also has a 'staging area' where the data is shuffled around to make processing more optimal. The size of this varies, but currently targets around 24 blocks/shards. This figure could probably use some tuning as well.

Note that par2cmdline's memory usage can be adjusted with the -m flag. I don't know about the other implementations.

I always thought the time taken was proportional to the number of recovery shards?

The Reed-Solomon computation is O(input_size * number_of_recovery_shards), so your thought is correct.
(the GF size (8-bit for klauspost/reedsolomon "Regular code", 16-bit for PAR2) also plays a part, with larger GF being slower to compute)

But real world speed may not follow that strictly, particularly for low number of recovery shards:

  • Various baseline overheads, e.g. reading files, setting stuff up
  • PAR2 also needs to compute 2x MD5 + CRC32 hash, regardless of how much recovery you generate (I don't think the other libraries include hashing)
  • ParPar is somewhat more tuned to higher number of shards, so expends effort to do stuff like shuffle data around for more optimal processing (regardless of how much recovery is generated)
  • CPU caches can be more efficient with more recovery shards
  • 1GB in 5s is ~200MB/s - if your file is from a hard disk, you could be bottlenecked on its read speed

You might be able to see more of a difference if you go for more recovery shards.

@1f604
Copy link

1f604 commented Apr 5, 2023

Thanks @animetosho ! I ran par2cmdline-turbo with a range of shard numbers and found that the running time grows faster than linearly (all done with the 1GB file with 10% recovery):

100 shards: 3 seconds
200 shards: 3.2 seconds
1,000 shards: 4 seconds
1,500 shards: 5 seconds
2,000 shards: 7 seconds
5,000 shards: 24 seconds
10,000 shards: 85 seconds
20,000 shards: 179 seconds

So below say 1,000 shards it seems the running time is dominated by some kind of constant-time overhead as we go from 100 to 1000 shards the time only grows from 3 seconds to 4 seconds. Maybe some of that is due to my test system having slow IO.

Does it provide file handling functions? I thought that and Wirehair were just encoding libraries, so it may be worth publishing usable test code just in case someone finds an issue there.

Here's the code I used for testing Klaus Post's reedsolomon progressive encoding, it is basically just a modified version of the simple-encoder.go in the examples directory (you can run it by replacing the contents of the simple-encoder.go file with the below):

//go:build ignore
// +build ignore

// Copyright 2015, Klaus Post, see LICENSE for details.
//
// Simple encoder example
//
// The encoder encodes a single file into a number of shards
// To reverse the process see "simpledecoder.go"
//
// To build an executable use:
//
// go build simple-decoder.go
//
// Simple Encoder/Decoder Shortcomings:
// * If the file size of the input isn't divisible by the number of data shards
//   the output will contain extra zeroes
//
// * If the shard numbers isn't the same for the decoder as in the
//   encoder, invalid output will be generated.
//
// * If values have changed in a shard, it cannot be reconstructed.
//
// * If two shards have been swapped, reconstruction will always fail.
//   You need to supply the shards in the same order as they were given to you.
//
// The solution for this is to save a metadata file containing:
//
// * File size.
// * The number of data/parity shards.
// * HASH of each shard.
// * Order of the shards.
//
// If you save these properties, you should be able to detect file corruption
// in a shard and be able to reconstruct your data if you have the needed number of shards left.

package main

import (
	"bufio"
	"flag"
	"fmt"
	"io/ioutil"
	"os"
	"path/filepath"
	"time"

	"github.com/klauspost/reedsolomon"
)

var dataShards = flag.Int("data", 200, "Number of shards to split the data into, must be below 257.")
var parShards = flag.Int("par", 20, "Number of parity shards")
var outDir = flag.String("out", "", "Alternative output directory")

func init() {
	flag.Usage = func() {
		fmt.Fprintf(os.Stderr, "Usage of %s:\n", os.Args[0])
		fmt.Fprintf(os.Stderr, "  simple-encoder [-flags] filename.ext\n\n")
		fmt.Fprintf(os.Stderr, "Valid flags:\n")
		flag.PrintDefaults()
	}
}

func main() {
	start := time.Now()
	// Parse command line parameters.
	flag.Parse()
	args := flag.Args()
	if len(args) != 1 {
		fmt.Fprintf(os.Stderr, "Error: No input filename given\n")
		flag.Usage()
		os.Exit(1)
	}
	fname := args[0]

	// Create encoding matrix.
	enc, err := reedsolomon.New(*dataShards, *parShards)
	checkErr(err)
	fmt.Println("Finished creating reedsolomon encoding matrix.")

	// Open the file for reading
	f, err := os.Open(fname)
	checkErr(err)
	defer f.Close()

	dir, file := filepath.Split(fname)
	if *outDir != "" {
		dir = *outDir
	}
	// calculate the shard size
	fi, err := f.Stat()
	if err != nil {
		// Could not obtain stat, handle error
	}
	fmt.Printf("The file is %d bytes long\n", fi.Size())

	filesize := fi.Size()
	if filesize%200 != 0 {
		panic("file size must be a multiple of 200 bytes")
	}
	shardsize := filesize / 200 // 200 file shards
	println("shard size:", shardsize)

	// create the parity shard buffer
	parity := make([][]byte, 20) // 20 recovery shards
	for i := range parity {
		parity[i] = make([]byte, shardsize)
	}

	// create the parity shards
	buf := make([]byte, shardsize)
	index := 0
	r := bufio.NewReader(f)
	elapsed := time.Since(start)
	fmt.Println("time elapsed until starting to encode file:", elapsed)
	for {
		start := time.Now()

		// read in a chunk
		_, read_err := r.Read(buf)
		if read_err != nil { // assume EOF
			println("all done!")
			break
		}
		// write it out into a file
		outfn := fmt.Sprintf("%s.%d", file, index)
		err = os.WriteFile(filepath.Join(dir, outfn), buf, 0644)
		checkErr(err)

		elapsed := time.Since(start)
		fmt.Println("reading + writing time:", elapsed)
		start = time.Now()

		// encode the chunk
		err = enc.EncodeIdx(buf, index, parity)
		checkErr(err)
		index++

		elapsed = time.Since(start)
		fmt.Println("encoding time for shard", index, ":", elapsed)
	}
	start = time.Now()

	// Write out parity shards into files.
	for i, shard := range parity {
		outfn := fmt.Sprintf("%s.%d", file, 200+i)

		fmt.Println("Writing to", outfn)
		err = ioutil.WriteFile(filepath.Join(dir, outfn), shard, 0644)
		checkErr(err)
	}

	elapsed = time.Since(start)
	fmt.Println("time taken to write out parity files:", elapsed)

}

func checkErr(err error) {
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error: %s", err.Error())
		os.Exit(2)
	}
}

Output:

reedsolomon/examples$ time go run simple-encoder.go ~/Downloads/1GB.zip 
Finished creating reedsolomon encoding matrix.
The file is 1073741800 bytes long
shard size: 5368709
time elapsed until starting to encode file: 65.674005ms
reading + writing time: 8.242166ms
encoding time for shard 1 : 133.489673ms
reading + writing time: 8.663781ms
encoding time for shard 2 : 32.475082ms
reading + writing time: 11.068881ms
encoding time for shard 3 : 39.272467ms
reading + writing time: 15.424516ms
encoding time for shard 4 : 33.573678ms
reading + writing time: 19.12296ms
encoding time for shard 5 : 39.149728ms
reading + writing time: 18.893182ms
encoding time for shard 6 : 38.416607ms
reading + writing time: 10.466169ms
... lines omitted for brevity ...
encoding time for shard 195 : 39.498974ms
reading + writing time: 11.668383ms
encoding time for shard 196 : 35.514783ms
reading + writing time: 17.014156ms
encoding time for shard 197 : 42.856191ms
reading + writing time: 12.128048ms
encoding time for shard 198 : 47.070398ms
reading + writing time: 7.328568ms
encoding time for shard 199 : 39.252697ms
reading + writing time: 8.452414ms
encoding time for shard 200 : 42.781552ms
all done!
time taken to write out parity files: 133.70886ms
total time spent on encoding: 8.088607059s

real    0m11.404s
user    0m7.337s
sys     0m2.276s

@animetosho
Copy link
Owner

Actually, with a 1GB source size, you'll start running into an issue with shards being too small. Internally, shards are broken into chunks and distributed across threads, but if the shard is too small, it's going to be limited in how many threads it can use.
par2cmdline's I/O is also tied to shard size, so making that too small will also decrease I/O efficiency.

I know little about Go, so I can't really critique your code (mostly the suggestion was made for your blog post, as it could make it seem more transparent).
I did notice that you're copying all the input data to a file (write it out into a file part), which is something that PAR2 doesn't do.

@thezoggy
Copy link

thezoggy commented Apr 5, 2023

just to add, when i tried using -m### option with par2cmdline-turbo, I saw performance decreased.
I just figured the app is doing better defaults than what memory size I tried to give it.

@animetosho
Copy link
Owner

animetosho commented Apr 5, 2023

Are you setting it to something larger than the default? In theory, larger values shouldn't decrease performance, though you may be experiencing the same issue as @buggsi.
There's nothing particularly special or intelligent about the default.

@thezoggy
Copy link

thezoggy commented Apr 6, 2023

Are you setting it to something larger than the default? In theory, larger values shouldn't decrease performance, though you may be experiencing the same issue as @buggsi. There's nothing particularly special or intelligent about the default.

is there a way to see what the default/what it uses? the machine i was testing it out on has limited ram (4G) so i tried setting 2G and 3G but neither helped compared to 'default'

@animetosho
Copy link
Owner

animetosho commented Apr 6, 2023

Add the -vv option and it'll print out the limit chosen.

Default policy is half of RAM (or 128MB if it couldn't figure out the amount of RAM available).

The memory limit only affects the amount of recovery data that can be held in memory - for create, that's the total amount of recovery being created, whereas for repair, it's the size of the damaged blocks.
Picking a limit above these amounts makes no difference. In your case, it's possible that your chosen memory limit isn't actually affecting anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants