Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHD improved compression algorithm #7402

Closed
HashTang opened this issue Oct 27, 2020 · 25 comments
Closed

CHD improved compression algorithm #7402

HashTang opened this issue Oct 27, 2020 · 25 comments
Labels
needs design implementation details needs to be properly addressed tools

Comments

@HashTang
Copy link

HashTang commented Oct 27, 2020

CHD is a streamable compression format for backup images developed by the mame project.
It's getting traction way beyond mame, and is quickly becoming the de facto compressed format for the entire emulation community.

CHD cuts the image into smaller blocks, compressed individually.
There are, as far as I know, 3 compression algorithms which are accessible : lzma (cdlz), zlib (cdzl), and flac (cdfl).
flac is specific to sound data, while lzma and cdzl are lossless.
lzma compresses better, but is also slower to decompress.

CHD utilities like chdman seem to highly value space, and therefore tend to lean towards lzma. That's fine, if one disregards decompression speed. But on many devices, notably ARM smartphones and tablets, decompression speed is a real hog. This basically leaves zlib as the only alternative.

More recently, a new compression algorithm, zstd has started to become mainstream. Granted it is especially successful in cloud center and server space, but not only. It's even documented as part of updated zip and 7zip format specification, has a public IETF RFC, and can even be used to compress web traffic.

What makes zstd interesting with regards to CHD ?
To begin with, it has excellent decompression speed. As in ~x16 faster than lzma, and 3-4x to zlib.
One would expect to pay this speed in compression ratio, but that's effectively not the case : at its highest setting, zstd compresses within a few % of lzma, with small to negligible differences.
At the very least, compared to zlib, it's an all win, and a substantial one.

For these properties, zstd is already used in transparent compression for file systems, such as squashfs, which serve a similar use case as CHD.

High compression ratio, blazing fast decompression speed, could that be an interesting evolution of the CHD format ?

From a format perspective, it's probably not a huge deal : just an additional tag for a new format. I presume the current format already uses tags to distinguish between none, zlib, lzma and flac.
From an ecosystem perspective though, it can be trickier : it will require support from the decoder first, before any encoder can make use of the new feature.

@cuavas
Copy link
Member

cuavas commented Oct 27, 2020

It would also cause compatibility issues if people start distributing CHDs using a new compression algorithm that isn’t supported in the applications that are out there now, and add another third-party library as a hard dependency. It’s not such a simple decision to add.

@Robbbert
Copy link
Contributor

This appears to be the same as issue #7386.

@rb6502
Copy link
Contributor

rb6502 commented Oct 27, 2020

There was a plan to use https://github.com/aaru-dps/libaaruformat as the basis for a major CHD feature update around now, and introducing additional compression codecs at that time would've been natural, but Claunia's been too busy to work on that library.

@smf-
Copy link
Member

smf- commented Oct 27, 2020 via email

@rb6502
Copy link
Contributor

rb6502 commented Oct 28, 2020

That was the intent - have CHD wrap "original" formats like cue/bin and iso and use libaaruformat to decode the formats, with the CHD layer providing transparent (de)compression. It would greatly boost the number of formats supported, make it so problems with our interpretation of the format didn't require remaking the CHD, and put a specialist in such matters (Claunia) in charge of the format decoding.

@claunia
Copy link

claunia commented Oct 28, 2020

Hi all,

In the design of the aaruformat library I found three problems that I would be fixed in aaruformat v2 design:

  1. Negative sectors: Some media, specially CDs, use it, and it enhances (while not required) their emulation.
  2. Memory usage: Currently v1 takes 350MiB of RAM easily, because of LZMA. It is my intention of implemented zstandard in aaruformat v2 so it can be used where space and speed matter more than size (like ARM).
  3. Overhead: When imaging big disks (200GiB or higher) the overhead starts to be too big.

The design for V2 was completed in March when a globlal pandemic bringed everything to a halt.

Because V1 was so intrinsically linked to Aaru (basically being a part of it, not independent), I need to start the move to Aaru 6.0 so I can start implementing V2 in the library and have test images to ensure it works properly.

I'm doing it as fast as I can with my little spare time as I have not been able to find grants for my work on Aaru so I can dedicate to it 100%.

@Anuskuss
Copy link

Anuskuss commented Jan 29, 2021

Apologies if this is not the right place to discuss this, but what is the status of CHDv6? I think it's currently awaiting claunia's Aaru v2 but is there any work being done regardless? Where can I read about progress, road map, discussions, etc?
I'm excited what the future will bring (especially zstd support) because although I do like CHD, it has some flaws which I hope will be addressed by CHDv6.

  • Proper documentation / reference implementation: I don't know much about the history of CHD, but I do believe that it was created solely as an immediate format for use in MAME only (that's why you can barely find any information about it). Even in this very issue the OP mentiones that there are "3 compression algorithms which are accessible", but there are actually 7 (cdfl cdlz cdzl flac huff lzma zlib). I don't blame them for not knowing that, you basically have to read (and understand) the source code to know about them. But now with the advent of libchdr and many emulators (including libretro) adopting it, I sure hope there's more emphasize put on properly documenting it for other devs.
  • The second biggest (developer) complaint I kept reading was about maintainability (mainly in the new CHD PRs for the PCSX2 and PPSSPP emulators). The developers of these emulators may be not familiar with MAME (and how big the project really is) but I do understand their concerns, because nobody wants to add unknown code and maintain it for the foreseeable future. There are simple things that are troublesome here, like not having an official MAME C(++) library and still using obsolete lzma (which hasn't received an update in years) instead of switching to xz.
  • Another complaint I kept seeing is that the CHD format is a "for piracy" format (which I heavily disagree with) but it's true that most (MAME) CHDs are probably pirated. I would suggest to rip chdman (and the CHD format) out of MAME so it has it's own repo, which would make working on it easier (easier management of issues, faster compilation) and could allow for different goals between the projects. This could also allow for new features (e.g. new input formats, Multisessional CD, SBI support, etc.) to be merged more quickly as that wouldn't have to rely on the MAME cycle. Alternatively you could create a separate repo for a potential libchd and leave chdman in the main repo.
  • Speaking of goals, CHD is/should be about preservation (and you finally achieved that with v5), right? So why are Dreamcast discs stored in the inferior GDI format? I don't want to restart the war again, because by now the advantages of the Redump style should be obvious (more accurate representation of the data on disc not just what the DC reads, not having the same dump multiple times with the only difference being write offsets, a more familiar CUE+BINs approach, etc.) but I disagree that you guys (and not to shame anybody but p1pkin has been a loud voice when that decision was made) picked familiarity/compatibility over accuracy/preservation. MAME is a big player and a decision made by you can impact the whole emulation scene, that why I hope that decisions like that will be taken with more care for CHDv6.
  • Improve chdman generaly: Expose more options (like listing the aforementioned encoders), develop "crush" mode (similiar to pngcrush or maxcso), fix outstanding issues (e.g. command line output in Windows being cut after 74 characters), etc. A separate repo (maybe even for the whole mame-tools) could help to manage that.
  • Lastly (and this is just a personal complaint), chdman currently does not strip the auxiliary data (EDC+ECC) of CDs, which is just lost potential. I've learned this just now (because of another discussion) and the data can be very easily generated and because it's highly uncompressible, we're talking about up to 12% smaller files for the same information.

To finally wrap this up, I wanna say that I do like claunia's ambitions with Aaru, and I think the aif format is a great concept (with even more emphasis on preservation e.g. storing the PSX disc wobble) but sadly it hasn't gained any traction. I can't wait to see what a collaboration would look like with the excellent decoding capatibilites of Aaru, with the familiar on-the-rise container CHD, and the huge MAME project backing it up. Cheers.

P.S. I'm not demanding anything here, I'm just writing this down so I can get them out of my head.

@lonkelle
Copy link

lonkelle commented Nov 2, 2022

@Anuskuss Dang, you nailed that write up. I hope all of that is taken into consideration for V6. 🤞

Personally, I think there's a huge missed compressing opportunity by not finding a way to compress 2 - 4 disc games into a single CHD (compressing two discs into one archive basically make it the size of single disc since so much is reused between discs). Not to mention the benefit for emulators being able to choose a disc without a user making a manual m3u file to link them.

@claunia
Copy link

claunia commented Nov 25, 2022

@Anuskuss

The development of AaruFormat V2 is open and you can follow it in here where you can also peep at the official formal specification.
Suggestions there are welcome.

It is going slower than I intended because it's me alone doing it and things in life took a 180 degree turn this 2022.

I have not made a roadmap really because the specification itself is what we intend to implement so it works kinda like a roadmap. The only thing V2 will have that is not yet written down is support for Data Position Measurements.

@lonkelle we have that planned for AaruFormat V3, as we need to ensure V2 is working fine before adding such a complex feature.

As for the complains that emulators are not using AaruFormat well it has never been my target with the format. My target is preservation and our userbase is quite happy with the format even if no emulator supports it.

I would love for emulators to support it but does not depend on me (neither does that AFV2 becomes CHDV6), and I cannot focus my energy on convincing people to, I just focus on making the format be able to preserve any media.

Hope this solves your doubts. If you want to make specific discussions or questions about AFV1 or AFV2 please feel free to drop by the repository linked above.

@mirh
Copy link

mirh commented May 16, 2023

takes 350MiB of RAM easily, because of LZMA.

That doesn't sound right.
Decompression memory should only depend on which dictionary size you use.

mame/src/lib/util/chdcodec.cpp

Lines 1151 to 1152 in a504bde

LzmaEncProps_Init(&props);
props.level = 8;

If, give or take, you use 64MB you shouldn't get near those numbers (unless you are trying to run more than four parallel streams at once or something).

Overhead: When imaging big disks (200GiB or higher) the overhead starts to be too big.

Overhead.. as in cpu time required for compression?

like not having an official MAME C(++) library and still using obsolete lzma (which hasn't received an update in years) instead of switching to xz.

*lzma2
Xz is the file format that if any then packs that codec (not dissimilar from 7z, but without any bells and whistles).
It's so odd though this isn't already the case, given even in 2012 that was already a thing and it's not like you even need to use another library to switch.

Speaking of which, I'd like the bring attention to a recent discovery I made.
A few years ago the x64 decoder has gotten a noticeable speed bump thanks to some asm voodoo (recently also for arm64!)

Lastly (and this is just a personal complaint), chdman currently does not strip the auxiliary data (EDC+ECC) of CDs, which is just lost potential.

I suppose the xbox and wii guys would also like to have a word about that.

p.s. FWIW brotli is also very competitive with zstd

@cuavas
Copy link
Member

cuavas commented Dec 10, 2023

#11827 adds support for Zstandard compression in CHD files, as well as zip archives. By default, chdman will not enable Zstandard compression, so CHD files will be compatible with existing software.

You can enable Zstandard compression when creating or copying a CHD with the --compression (or -c) option. Remember that chdman prefers higher compression ratios, so if you want Zstandard to be used, you should generally disable LZMA compression.

For CD-ROM media, a good setting to try is --compression cdzs,cdzl,cdfl. For hard disk media, a good setting to try is --compression zstd,zlib,huff,flac. After creating a CHD, you can see statistics on compression algorithms used with chdman info --verbose.

@cuavas cuavas closed this as completed Dec 10, 2023
@mnadareski
Copy link

I'm curious, what happens if someone creates a V5 CHD that uses zstd and attempts to use it with another program that supports CHD? Will there be an easy to parse error? I'm also a bit curious as to why this didn't end up bumping the CHD version.

@rb6502
Copy link
Contributor

rb6502 commented Dec 11, 2023

The error handling will depend on what the other program does, we don't control that.

CHDMAN doesn't use Zstandard by default for that reason. The people who wanted it can have it, but we're not forcing people to use it and would probably recommend that maintainers of torrent sets or whatever not rush into it.

@cuavas
Copy link
Member

cuavas commented Dec 11, 2023

I'm curious, what happens if someone creates a V5 CHD that uses zstd and attempts to use it with another program that supports CHD?

If the program is using MAME's chd_file class, opening the CHD file will return chd_file::error::UNKNOWN_COMPRESSION (wrapped as a std::error_condition).

Previous versions of MAME itself will report the image as "not found" when auditing media as MAME isn't particularly detailed/friendly when it comes to dealing with invalid, unsupported or corrupt CHD files.

I assume other CHD implementations also return an error on encountering an unsupported codec FourCC on opening a CHD file.

Will there be an easy to parse error?

You should always be checking error codes, and there's already an error code that covers this situation.

I'm also a bit curious as to why this didn't end up bumping the CHD version.

CHD V5 already includes support for adding codecs (in much the same way that the zip file format doesn't need to change when a compression method is added). A V5 CHD file can specify up to four codecs. We don't frequently define new codecs.

As @rb6502 already pointed out, chdman will not enable Zstandard by default when creating CHD files, so compatibility won't be broken unless you choose to use it.

@jumphil
Copy link

jumphil commented Feb 22, 2024

#11827 adds support for Zstandard compression in CHD files, as well as zip archives. By default, chdman will not enable Zstandard compression, so CHD files will be compatible with existing software.

You can enable Zstandard compression when creating or copying a CHD with the --compression (or -c) option. Remember that chdman prefers higher compression ratios, so if you want Zstandard to be used, you should generally disable LZMA compression.

For CD-ROM media, a good setting to try is --compression cdzs,cdzl,cdfl. For hard disk media, a good setting to try is --compression zstd,zlib,huff,flac. After creating a CHD, you can see statistics on compression algorithms used with chdman info --verbose.

@cuavas Great work! Does this use zstd compression level 19?

@cuavas
Copy link
Member

cuavas commented Feb 22, 2024

@cuavas Great work! Does this use zstd compression level 19?

Well, you could read the source and see…

It uses whatever ZSTD_maxCLevel() returns.

Since compression isn’t done on-the-fly and decompression speed is largely insensitive to the compression level, it favours higher compression.

@Anuskuss
Copy link

I'm not salty that none of my suggestions were acknowledged but has at least my last point been taken care of? I remember that I used to zero-out the error correction which resulted in better compression. Reed–Solomon EDC/ECC could easily be regenerated when extracting so there's no point in storing that information.

@rb6502
Copy link
Contributor

rb6502 commented Apr 26, 2024

If you want to change the world, you have to submit pull requests. WindyFairy just added multi-session disc support to CHD, for instance. I'm not super enthusiastic about throwing away data though, on the off chance that it's important for protected discs or something along those lines.

@Anuskuss
Copy link

C is sadly above my weight class but even if, I don't think it's that easy to implement. I mean the Reed–Solomon can be copy-pasted but then you'd have to handle the case where it's deliberately wrong (like you said maybe for piracy detection or something) and then store that somewhere. The end goal would be to get rid of everything that's not userdata and only store what differs from the default case (e.g. submode). Then you could have a MODE2/2352 stored with a 2048 block size, giving you max compression.

@p1pkin
Copy link
Member

p1pkin commented Apr 26, 2024

I'm not salty that none of my suggestions were acknowledged but has at least my last point been taken care of? I remember that I used to zero-out the error correction which resulted in better compression. Reed–Solomon EDC/ECC could easily be regenerated when extracting so there's no point in storing that information.

errm, but ECC/EDC already zeroed before compression and regenerated during decompression, isn't it ?

@Anuskuss
Copy link

errm, but ECC/EDC already zeroed before compression and regenerated during decompression, isn't it ?

Nah.

$ du -m *.chd
285	noecc.chd
298	normal.chd
with open('/tmp/normal.bin', 'rb') as o, open('/tmp/noecc.bin', 'wb') as n:
  while b := o.read(2352):
    n.write(b[:-280])
    n.write(b'\0'*280)

@p1pkin
Copy link
Member

p1pkin commented Apr 27, 2024

errm, but ECC/EDC already zeroed before compression and regenerated during decompression, isn't it ?

Nah.

Yeah

// clear out ECC data if we can

You may see the code which does clear header and ecc if they are "standard"
So, can you please stop to make statements based on nothing and waste our time ?

@robzorua
Copy link

robzorua commented Apr 27, 2024

Anuskuss and p1pkin you guys are right, it only works in a bug state so it's incomplete.

nocash documented it.

http://problemkaputt.de/psxspx-cdrom-disk-images-chd-mame.htm

"The ECC-Filter works only for 930h-byte sectors (920h does also contain ECC, but CHD can't filter that, resulting in very bad compression ratio)".

@p1pkin
Copy link
Member

p1pkin commented Apr 27, 2024

@robzorua not sure if it's the problem, the actual raw CD sectors is always 2352 (930h) bytes long (plus subcodes), there is no such thing as 2336 (920h) byte sector.

@cuavas
Copy link
Member

cuavas commented Apr 27, 2024

They’re referring to the CD-ROM XA Mode 2, Form 2. It uses the following pattern:

  • 12 sync bytes
  • 3 address bytes
  • 1 mode byte
  • 8 subheader bytes
  • 2324 data bytes
  • 4 CRC bytes

The total size is 12+3+1+8+2324+4 = 2352 bytes per sector. If you assume standard mastering and no data errors you can reconstruct the sync pattern and CRC. That leaves 3+1+8+2324 = 2336 meaningful bytes. Most mastering software supports supplying data files with 2336 bytes per sector for CD-ROM XA Mode 2 tracks.

However “920h does also contain ECC” is just plain incorrect. The whole point of Mode 2, Form 2 is that it omits the extra in-band ECC data to allow 276 more data bytes per sector. You trade redundancy and error tolerance for space and speed.

But this is getting way off-topic. We aren’t talking about changing the way data is stored inside CHD files here. The issue was just requesting support for Zstandard compression in CHD files, which has been implemented.

@mamedev mamedev locked as off-topic and limited conversation to collaborators Apr 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs design implementation details needs to be properly addressed tools
Projects
None yet
Development

No branches or pull requests