Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support different ZFP stream word sizes #133

Open
markcmiller86 opened this issue Jan 17, 2024 · 17 comments
Open

Support different ZFP stream word sizes #133

markcmiller86 opened this issue Jan 17, 2024 · 17 comments

Comments

@markcmiller86
Copy link
Member

Currently, H5Z-ZFP requires ZFP be configured with 8-bit word streams. It is an outright error (not just a silent skip) to attempt to use H5Z-ZFP if ZFP library is configured incorrectly. This means it is impossible to overlook cases where ZFP library is perhaps incorrectly configured for stream word size.

But, ZFP can actually be configured with 16, 32 and 64 bit word sizes as well. Higher word sizes mean faster compression/decompression. That is important for in-memory ZFP arrays. Why can't we can support these various word sizes in H5Z-ZFP? Well, the answer is that we can but what do we do when the resulting data is readin a cross-endian context?

If we store addtional information in the dataset header about ZFP stream word size used, we can detect the combination of >8 bit word stream size AND cross-endian context and fail the read. Well, it may be that an 8-bit configured ZFP on the read half of the operation will be able to work with any ZFP word stream size on the write half. So, maybe we only need to make sure that when reading ZFP compressed data, the ZFP library is configured with 8-bit stream word size.

@lindstro identified a situation in which the current filter behavior to require 8-bit stream word size is helpful when ZFP library is being installed at a facility and apps are using that install and then someone accidentally installs ZFP with non-8bit word stream size. Currently, it would be quite obvious because data writes would fail. If, however, we make it silently skip ZFP compression in that case, that could present serious issues. Its just a use case to keep on our radar and be sure we don't somehow wind up increasing the likelihood of this outcome.

@markcmiller86
Copy link
Member Author

If we store an additional bit of info in the H5Z-ZFP header, then I think we should design things to require 8-bit on writes by default but allow user to override this default allowing non-8-bit word stream on writes. This would mean adding a property to the properties interface and utilizing a currently unused bit in the generic interface. By default H5Z-ZFP would behave as it currently does, erroring in H5Dcreate() if ZFP was configured for something other than 8-bit streams.

But, we can add the ability for users to disable this behhavior allowing any stream word size.

On the read end, it is probably best for the ZFP library to always be configured for 8-bit streams. That way, it will read correctly, I think, regardless of stream word size used in writer.

We need to understand what will happen and how we'll detect the situation when we read non-8-bit stream compressed data with a non-8-bit stream reader. Do we know. Will it just sort of work most of the time (e.g. little endian) and not work only in cross-endian situations or only big endian or both?

@lindstro
Copy link
Member

Maybe I'm being confused by HDF5 lingo, but requiring 8-bit writes while allowing non-8-bit writes seems contradictory.

I would advocate for encoding which word size was used in the cd_vals metadata. When reading, we can decide if zfp has been configured to support that word size or not. If not, we error.

It is not true in general that an 8-bit reader can correctly process streams written as N-bit words, with N > 8. This works only on little endian machines, and there are still issues with alignment. I would propose that H5Z-ZFP perform explicit 64-bit alignment by padding the end of the stream after zfp_compress(), e.g., using stream_wtell() and stream_pad(). This way, a stream written on a little-endian machine is independent of word size. Such padding also should not break old H5Z-ZFP readers (but we should verify that).

In the case of big endian machines, we cannot mix word sizes. The H5Z-ZFP reader should just fail if libzfp was built with a different word size. Unless we have the opportunity to manually byte swap the compressed data first.

@markcmiller86
Copy link
Member Author

I think we do have the ability to byte-swap data as desired before (or after) ZFP operates on it.

Would it make sense to always have the filter deliver (for compressions on write) ZFP little endian data, regardless of host's native endian-ness? That way, we'd be always living in a world where 8-bit reader can correctly process N-bit word streams with N>8. We'd just have to do some additional endian gynmastics when running on big-endian systems.

@lindstro
Copy link
Member

This seems like a reasonable approach, in particular since big-endian machines are quickly falling out of favor. But let me ask this first: is there any situation where H5Z-ZFP might call zfp_compress() more than once per HDF5 dataset? This would result in incompatible alignment if word sizes differ. One might use this feature to write, say, a vector field as multiple back-to-back scalar fields. Skimming the filter code, this appears not to be an issue.

I would propose that we add a 2-bit code, n, to cd_values that indicates that the word size is 2n. Is there perhaps an existing pair of bits in the HD5-ZFP header that already is zero such that we wouldn't have to change the header format but would still decompress existing and new files correctly? It seems, for example, that cd_values[1] is currently unused. If it's always zero, then we could use two of its bits toward this purpose.

The only change going forward is that we would add some padding to the compressed stream on write. On read, we would ideally check the number of compressed bytes processed to make sure it matches expectations, and we'd then have to be a little careful when dealing with mismatched word sizes between the written file and the compiled filter. It appears that the filter currently checks only if the return value is zero, indicating failure.

@markcmiller86
Copy link
Member Author

markcmiller86 commented Jan 21, 2024

But let me ask this first: is there any situation where H5Z-ZFP might call zfp_compress() more than once per HDF5 dataset?

Hmm....remember HDF5 compresses individual blocks so strictly speaking, I think you are asking whether it would call zfp_compress() more than once on a block. The answer is YES, FOR SURE but maybe not in the specific way I think you are asking. So, it is never the case that for a given block of data passing through the filter, it would zfp_compress() some of it (say the first quarter) and then call zfp_compress() again for the next part (e.g. second quarter) catentating the resulting bytes with the first quarter, etc building up the full byte sequence returned to HDF5 from the filter for that block of data. I mean, I guess I could have coded it to behave that way but I didn't. Once we're in the filter, we zfp_compress everything we've been given...a whole block of data.

However, with partial I/O, you can have a situation where a block is only partially written (remaining parts of the block are treated as a fill value) and then a later H5Dwrite() call writes more of the data in that block. In that case, the block is uncompressed first and is is wholly re-decompressed by zfp_compress() a second time, to arrive at a new block of data stored in the file. This is the read-modify-write aspect of partial I/O during writes. In addition, I suppose a caller could wind up wholly re-writting data in a datset causing any existing blocks of data in the file to be replaced by new ones.

Hmmm...now that I hear myself describing that, something just occurred to me having to do with the possibility of compounding loss with lossy compression. If I write a partial block, the whole block (partial + fill) gets zfp_compressed() and stored. A later write that either overwrites that block or writes more to the parts of it never written in the past has to start from the uncompressed version of it. That means the parts of the block that get written and re-written get compressed/decompressed multiple times. Can ZFP losses compound under those conditions? I mean, can the result drift...I don't think so but I thought I should ask.

@vasole
Copy link
Contributor

vasole commented Jan 22, 2024

Can ZFP losses compound under those conditions? I mean, can the result drift...I don't think so but I thought I should ask.

To me it is clear that any lossy compression may have issues when allowing to write incomplete chunks to the HDF5 file and therefore chunks should be written only once.

"Write once, read many" is responsibility of the ZFP user, not the ZFP developers. The HDF5 chunk cache size must be set big enough to prevent uncontrolled flushing when using lossy compression filters.

@lindstro
Copy link
Member

Can ZFP losses compound under those conditions? I mean, can the result drift...I don't think so but I thought I should ask.

Yes, losses can certainly compound. With some arbitrary fill value (instead of zfp's specialized padding), compression accuracy will in general be negatively impacted in zfp's fixed-rate mode. Those errors will then persist (unless they're canceled by pure luck) over subsequent re-compression calls. Moreover, those errors could also hurt decorrelation and then degrade compression once a block is filled. I don't have a good sense of how severe this problem is, however.

@markcmiller86
Copy link
Member Author

I will have to inquire with the HDF Group but I would think it should be possible, maybe, for H5Z-ZFP to use ZFP's "specialized padding" in most circumstances.

Also, I may be confused by the word "persist" but I've always thought once ZFP has compressed some data, there is some loss that can never be recovered and so in that sense, any ZFP compression results in some persistent errors. The question I had is whether they may grow with subsequent ZFP compressoin calls.

@lindstro
Copy link
Member

The point I was making about persistent errors is exactly the one you make. In other words, compressing half a block padded with fill values will introduce some irrevocable loss. When you are later given the rest of the block and re-compress, it's not the same as if you compressed the whole block only once, as you've now lost information, and the errors that were introduced in the first compression step persist.

There's also a second form of error that results from injecting "noise" in the data during the first round of compression that could then hurt zfp's "predictor" (really, a transform) and cause compression vs. accuracy to suffer in the second round of compression.

To illustrate this point, consider a simpler compressor that predicts and represents linear functions perfectly. Suppose we have a block of data (2, 4, 6, 8). The predictor would result in a perfect fit to this data. But suppose that we're initially given only a partial block (2, 4, *, *), where * denotes a not-yet transmitted value that is replaced by a fill value of 0. The best linear fit (in the least-squares sense) to this padded block (2, 4, 0, 0) is then (3, 2, 1, 0). When we later receive the last two samples, we're asked to compress (3, 2, 6, 8) instead of (2, 4, 6, 8), because of lossy compression in the first step. The best linear fit to this modified block is then (1.9, 3.8, 5.7, 7.6), even though the original block could be represented exactly. We then have to spend precious additional bits if we want to (perhaps partially) correct this error. Worse yet, we have to make large corrections of values (1.9, 3.8) to (3, 2) that have already been contaminated with error, and those are relatively more costly than the smaller corrections of values (5.7, 7.6) to (6, 8).

A similar issue occurs with zfp when a block is compressed in multiple stages.

@markcmiller86
Copy link
Member Author

@lindstro...ok thanks for that detailed explanation 💪🏻

I agree with @vasole that maybe we should include in H5Z-ZFP docs some advice on this...perhaps avoid doing partial I/O + lossy compression. @vasole do you happen to have any other refs in the literature about this issue in general?

I also think it is an unusual use case that is unlikely in pracitce. But, I may be biased in my experiences so far.

I do not think we can easily detect partial I/O or blocks with fill value inside the H5Z-ZFP filter funcs themselves. We might be able to interrogate HDF5 for those details...I don't honestly know.

@vasole
Copy link
Contributor

vasole commented Jan 24, 2024

do you happen to have any other refs in the literature about this issue in general?

Perhaps this page: https://en.wikipedia.org/wiki/Generation_loss

I do not know if it is what you are asking for, but the situation most users should be familiar with is the degradation of JPEG images when edited and saved again in JPEG format instead of using a lossless format after editing.

In this web page they called simply JPEG degradation:

https://imagekit.io/blog/jpeg-image-degradation/

I particularly like their analogy with the "Photocopier Effect". When doing a photocopy something is lost. If you do a photocopy of the photocopy things will degrade more and so on.

@lindstro
Copy link
Member

This issue of "generation loss" is one I've pondered for a long time and hypothesized about but never had much time to investigate further. We've conjectured that starting from some arbitrary input x that is compressed-then-decompressed as D(C(x)), another round-trip of compression + decompression should not change the result if the same compression parameters are used. And we've had some reasonable arguments for why that should be the case.

However, I just ran some experiments with real data and very low rates and precisions, e.g., a rate of 22 bits/block in 2D, translating to a rate of 11/8 = 1.375 bits/value. Note that each 2D zfp block requires 12 bits of header to represent the common exponent and whether the block is all-zero, so coefficients of such blocks are allocated only (22 - 12) / 16 = 0.625 bits/value.

In such settings, it seems that drift can occur to the point that repeated D(C(x)) does not only fail to converge but actually diverges and blows up. This seems to be the case only in fixed-rate or -precision modes, but I have not rigorously verified this.

The gist of it is that a real input value of 1 decompresses to a value somewhat different from 1 and that at extreme compression gets reconstructed and rounded to a value of 2 (recall that the rate is here less than one compressed bit/value, and precision may be even lower than one uncompressed bit/value). Fed back into compression, the only difference between 1 and 2 as input is the exponent, which zfp factors out, so 2 gets reconstructed as 4, 4 gets reconstructed as 8, and so on. Hence, each application of D(C(x)) doubles the input value, until we eventually blow up.

Now, in practice, such extremely low precision is of course not practically useful, where values are not even accurate to a single bit. And if you bump up precision or rate slightly, e.g., from 22 to 23 bits per block of 16 values, the repeated application of D(C(x)) converges quickly after 1 or 2 iterations. Again, I have only anecdotal data to support this and it would be useful to analyze this issue more rigorously.

Let me also add that errors occur both in the "conversion" (compression) of IEEE floating-point values to zfp and in conversion in the opposite direction, from zfp to IEEE floating-point values (decompression). This is because the two number systems fundamentally represent different subsets of the reals (or tensors of reals) that have a large intersection but with neither being a subset of the other. In the case I'm describing above, it seems clear that the issue is with lack of zfp precision, i.e., errors are incurred on compression but not decompression. But I thought I'd point out that loss may also be due to "limitations" of IEEE as zfp uses 62 mantissa bits while IEEE double precision uses only 53 mantissa bits.

@markcmiller86
Copy link
Member Author

markcmiller86 commented Feb 1, 2024

Turns out there is an HDF5 method to control whether the lib compresses partial chunks (H5Pset_chunk_opts()). It is up to caller's of HDF5 to affect this so we just need to document/highlight the issue in H5Z-ZFP docs. By default, HDF5 does compress partial chunks and callers do need to take action to prevent that if it is important.

@lindstro
Copy link
Member

lindstro commented Feb 1, 2024

@markcmiller86 What are your thoughts on this proposal of mine:

I would propose that we add a 2-bit code, n, to cd_values that indicates that the word size is 2n. Is there perhaps an existing pair of bits in the HD5-ZFP header that already is zero such that we wouldn't have to change the header format but would still decompress existing and new files correctly? It seems, for example, that cd_values[1] is currently unused. If it's always zero, then we could use two of its bits toward this purpose.

Just to clarify, I meant a 2-bit code n to represent a word size of 2n bytes, so valid values of n are {0, 1, 2, 3} to represent word sizes of {8, 16, 32, 64} bits.

When dealing with files already generated with n = 0 (8-bit word size), you would not be able to decompress with larger word sizes unless the stream size is a multiple of the word size, even though in practice, any stream is overwhelmingly likely malloc'ed in multiples of at least 64 bits. But from now on, we'd want the filter to always output streams that are multiples of 64 bits, which aside from slightly longer streams would be a backwards compatible change. You can use stream_pad() to achieve such alignment.

I'm sure that on decompression in H5Z-ZFP, we already are given the stream length in bytes, so we can test then if that's a multiple of the word size and bail otherwise. We'd also have to manually do some byte swapping on big endian machines.

I think we want to prioritize this capability as zfp's CUDA and HIP support currently requires 64-bit words, so without this fix you can't install zfp both with GPU and HDF5 support. And as mentioned, zfp tests pass only for 64-bit words.

@markcmiller86
Copy link
Member Author

We talked about this more and decided to focus primarily on little-endian machines/workflows first. To address this, we decided the right things to do are...

  • add logic in the reader/decompressor block to ensure the buffer delivered to zfp_dcompress() is a) aligned on a ZFP word-size and b) a multiple of a ZFP word-size. In some cases, we may wind up having to do an extra memory copy to ensure this.
  • add logic to the writer/decompressor to always ensure the buffer is a multiple of the maximum ZFP word size (8 bytes). This will have the effect that all future data written with the filter will avoid the above mentioned memory copy.
  • add logic to detect when running on a big-endian machine using a ZFP word size other than 8 bits and issue a useful error message.
  • Re-check the endian swapping logic associated with ZFP header decoding...we looked at it on main and it didn't look right. The first 4 bytes are byte-swapped but the remaining header bytes are not.

@lindstro
Copy link
Member

lindstro commented Oct 3, 2024

@markcmiller86 Thanks for the summary. I think it's safe to assume that the buffer is word aligned if it was produced by malloc and not deliberately advanced to break alignment. It may even be difficult to portably and correctly determine that the pointer is word aligned.

Regarding padding the output stream to a multiple of 8 bytes, it may make sense to provide a compile-time macro to disable such new (but now default) behavior for applications that might be sensitive to such a change.

@markcmiller86
Copy link
Member Author

@lindstro we were talking about big endian systems and that issue came up in a recent YouTube I watched regarding the Linux Kernel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants