Skip to content

[API Proposal]: Add support for Zstandard to System.IO.Compression #59591

@carlossanlop

Description

@carlossanlop

Zstandard (or Zstd) is a fast compression algorithm that was published by Facebook in 2015, and had its first stable release in May 2021.

Their official repo offers a C implementation. https://github.com/facebook/zstd

Data compression mechanism specification: https://datatracker.ietf.org/doc/html/rfc8478

Features:

  • It is faster than Deflate, especially in decompression, while offering a similar compression ratio.
  • It's maximum compression level is similar to that of lzma and performs better than lza and bzip2.
  • It reached the Pareto Frontier, as it decompresses faster than any other currently-available algorithm with similar or worse compression ratio.
  • It supports multi-threading.
  • It can be saved to a *.zst file.
  • It has a dual BSD+GPLv2 license. We would be using the BSD license.

It's used by:

  • The Linux Kernel as a compression option for btrfs and SquashFS since 2017.
  • FreeBSD for coredumps.
  • AWS RedShift for databases.
  • Canonical, Fedora and ArchLinux for their package managers.
  • Nintendo Switch to compress its files.

We could offer a stream-based class, like we do for Deflate with DeflateStream or GZipStream, but we should also consider offering a stream-less static class, since it's a common request.

API proposal (by @rzikm)

  • The API follows the precedents set by BrotliStream, BrotliEncoder, BrotliDecoder, New additions are ZstandardDictionary. and some additional members on the encoder/decoder/options. So far we don't expose custom dictionary support for Brotli, but there is demand for it: [API Proposal]: Custom dictionary support for Brotli #118784
    • new additions which don't have a precedent are marked with // NEW
  • New API available in .NET 11, ships inbox, no OOB shipment for previous releases
  • New assembly System.IO.Compression.Zstandard
  • Native implementation from facebook/zstd linked to System.IO.Compression.Native (like we do with brotli)

Some notes about Zstandard dictionaries:

  • underlying zstd implementation accepts dictionaries in multiple ways:
    • prepared dictionaries (ZSTD_(C|D)Dict*) - good for reusal, included in this API proposal as ZstandardDictionary, these do not benefit from "EnableLongDistanceMatching"
    • pointers to void* + length - may be more efficient if dictionary is to be used only once, not covered by this proposal
    • as a "prefix" of the data (ZSTD_(C|D)Ctx_refPrefix, accept void* + length) - together with EnableLongDistanceMatching and large enough Window enables use of Zstandard as a diff engine to produce binary patches. - included in this proposal
namespace System.IO.Compression
{
    // NEW
    // represents a prepared dictionary to improve compression, mostly
    // useful when the same dictionary can be reused to compress many small files (<100 kB).
    //
    // Internally wraps a pair of safe handles
    //     ZSTD_CDict* - dictionary for compression
    //     ZSTD_DDict* - dictionary for decompression
    // and initializes both of them from the given data
    // These dictionaries are immutable and thus thread safe and can be reused across encoders/decoders in
    // concurrent processing.
    public sealed partial class ZstandardDictionary : System.IDisposable
    {
        internal ZstandardDictionary() { }
        public void Dispose() { }

        // Creates a new ZstandardDictionary instance. The provided buffer
        // `quality` dictates the quality of the compression and overrides the quality setting on the encoder.
        // The quality parameter has no effect during decompression.
        public static System.IO.Compression.ZstandardDictionary Create(System.ReadOnlySpan<byte> buffer, int quality) { throw null; }

        // like above, but uses default quality
        public static System.IO.Compression.ZstandardDictionary Create(System.ReadOnlySpan<byte> buffer) { throw null; }

        // Alternatively, the Create methods can accept ReadOnlyMemory to which a reference is kept internally,
        // This can avoid the need for copying the buffer, but the dictionary needs to keep a MemoryHandle, pinning
        // the provided memory during the lifetime of the dictionary.

        // optional:
        // `type` is a new enum (not used elsewhere):
        //   - Raw - buffer is treated as raw data, implies some small
        //           processing when loading (processed as if it were a prefix
        //           of the compressed data), any data can be used as raw
        //           dictionary, will fail only on very small buffers.
        //   - Serialized - Assumes serialized version of a preprocessed dictionary 
        //                  (magic bytes, entropy tables, raw data), can fail if
        //                  structure is compromised
        //   - Detect - default behavior, checks for presence of magic bytes, then
        //              as either Raw or Serialized
        public static System.IO.Compression.ZstandardDictionary Create(System.ReadOnlySpan<byte> buffer, int quality, ZstandardDictionaryType type) { throw null; }

        // Members below are for ability to create an optimized dictionary based on
        // training data.
        //
        // Note:
        //
        // Dict training API taking more detailed options would be more complicated as
        // there are multiple training algorithms to chose from and each has
        // different tuning parameters. Since it is not clear yet how big a demand
        // there is for the training APIs, we think it better to avoid adding large
        // dictionary training API surface for it now.

        // Create a small dictionary (up to `maxDictionarySize`) that would help efficiently
        // zstd suggest max size 100kB dictionaries, and that size of the training data
        // to be ~100x the size of the resulting dictionary 
        // encode given samples.
        //     samples - all training samples concatenated in one large buffer
        //     lengths - lengths of the samples
        // Uses default training parameters in zstd
        public static System.IO.Compression.ZstandardDictionary TrainFromSamples(System.ReadOnlySpan<byte> samples, ReadOnlySpan<int> lengths, int maxDictionarySize) { throw null; }

        // alternative to above, more natural API, but does not match the underlying native API (requires temporary copying of data):
        public static System.IO.Compression.ZstandardDictionary TrainFromSamples(ReadOnlyMemory<ReadOnlyMemory<byte>> samples, int maxDictionarySize) { throw null; }

        // access to raw dictionary bytes (e.g. to be able to store them on disk)
        public ReadOnlyMemory<byte> Data { get { throw null; } }
    }

    // wrapper for multiple compression options, extension point if we decide
    // to expose more options in the future
    // Note 1: zstd distinguishes between sticky and non-sticky parameters. Sticky parameters are not unset by Reset()
    //         and are carried over to the next compressed frame if the encoder instance is reused.
    //         This class is supposed to contain only the sticky parameters, while non-sticky parameters should be set
    //         via individual instance members on the Encoder
    // Note 2: Some parameters are dynamically adjusted acording to the other parameters (e.g. according to Quality),
    //         Being able to represent "don't explicitly set anything" is desirable in many cases
    public sealed partial class ZstandardCompressionOptions
    {
        // NEW
        public static int DefaultWindow { get { throw null; } } // 23 
        public static int MinWindow { get { throw null; } } // 10
        public static int MaxWindow { get { throw null; } } // 31

        // NEW
        public static int DefaultQuality { get { throw null; } } // defined by zstd implementation as 3
        public static int MaxQuality { get { throw null; } } // ZSTD_maxCLevel
        public static int MinQuality { get { throw null; } } // ZSTD_minCLevel

        // quality parameter, higher -> slower and better compression
        // name chosen for parity with other compression APIs
        // alternatively, we can call this Level to match zstd terminology
        // 0 = implementation default (3)
        public int Quality { get { throw null; } set { throw null; } };

        // optional custom dictionary. If set, the Quality parameter is ignored and the quality set during
        // dictionary creation takes precedence
        public System.IO.Compression.ZstandardDictionary? Dictionary { get { throw null; } set { throw null; } }

        // NEW (BrotliCompressionOptions does not expose this value yet, but there is ask for it)
        // size of the backreference window *in bits* (same name as for ZLib), actual size is (1 << Window)
        // 0 = "use default window"
        public int Window { get { throw null; } set { throw null; } };

       // below are some more advanced parameters, these are not necessary for MVP, all of which are NEW

        // hint for the size of the block sizes that the encoder will output
        // smaller size => more frequent outputs => lower latency when streaming
        // valid range = [1340 .. 1<<17]
        // 0 = no hint, implementation defined behavior
        public int TargetBlockSize { get { throw null; } set { throw null; } }

        // appends 32-bit cheksum at the end of the compressed content. This checksum is checked during
        // decompression can lead to failures if data are corrupted.
        // default false
        public bool AppendChecksum { get { throw null; } set { throw null; } }

        // Enable long distance matching. This parameter is designed to improve compression ratio
        // for large inputs, by finding large matches at long distance. It increases memory usage and
        // default window size.
        // Together with ZstandardEncoder.ReferencePrefix() enable use for zstd binary diffing of (potentially large) files
        public bool EnableLongDistanceMatching { get { throw null; } set { throw null; } }
    }

    // standalone decoder implementation, closely copies BrotliDecoder design
    public partial struct ZstandardDecoder : System.IDisposable
    {
        private object _dummy;
        private int _dummyPrimitive;

        // Decoder can be also default-initialized => no dictionary is used, default parameters are used
        // public ZstandardDecoder()

        // specify decompression dictionary
        public ZstandardDecoder(System.IO.Compression.ZstandardDictionary dictionary) { throw null; }

        // allow specify max window for decompression, decompression requiring larger window (=> more memory)
        // will fail 
        public ZstandardDecoder(int maxWindow) { throw null; }

        // combined ctor for the above
        // Question: There are currently no other stable parameters exposed by ZSTD, do we need ZstandardDecompressionOptions?
        public ZstandardDecoder(System.IO.Compression.ZstandardDictionary dictionary, int maxWindow) { throw null; }

        // Question: how do we access the specific error code in case of InvalidData? e.g. how does user know that data is valid,
        // but would require more memory to decompress and thus larger maxWindow?
        public System.Buffers.OperationStatus Decompress(System.ReadOnlySpan<byte> source, System.Span<byte> destination, out int bytesConsumed, out int bytesWritten) { throw null; }
        public void Dispose() { }

        // NEW
        // Resets the decoder state so that it can be reused for more frames
        public void Reset() { }

        // NEW
        // sets a dictionary in a prefix mode, exposes ZSTD_CCtx_setPrefix. Internally pins the memory until
        // disposed/GCd or Reset() (the prefix is "non-sticky" parameter which is cleared by Reset)
        public void ReferencePrefix(ReadOnlyMemory<byte> prefix);

        // NEW
        // This exposes ZSTD_decompressBound, which
        //   - reads the size from the header, if present, or
        //   - estimates the the upper bound based on the information found in header
        // NOTE: malicious input may edit the header to report arbitrary values, but zstd validates this bound set in header during decompression
        // Question: should this return long?
        public static int GetMaxDecompressedLength(System.ReadOnlySpan<byte> data) { throw null; }
        // Alternative to above:
        public static bool TryGetDecompressedLength(System.ReadOnlySpan<byte> data, out int length) { throw null; }

        // one-off decompressing functions
        // note that these don't need maxWindow as the parameter is relevant only during streaming compression
        public static bool TryDecompress(System.ReadOnlySpan<byte> source, System.IO.Compression.ZstandardDictionary dictionary, System.Span<byte> destination, out int bytesWritten) { throw null; }
        public static bool TryDecompress(System.ReadOnlySpan<byte> source, System.Span<byte> destination, out int bytesWritten) { throw null; }
    }

    // symmetric API to ZstandardDecoder
    public partial struct ZstandardEncoder : System.IDisposable
    {
        private object _dummy;
        private int _dummyPrimitive;

        public ZstandardEncoder(int quality) { throw null; }
        public ZstandardEncoder(System.IO.Compression.ZstandardDictionary dictionary) { throw null; }
        public ZstandardEncoder(int quality, int window) { throw null; }
        public ZstandardEncoder(System.IO.Compression.ZstandardDictionary dictionary, int window) { throw null; }

        // NEW
        // does not store reference to options, only reads the data, most flexible options that can replace all the above
        public ZstandardEncoder(ZstandardCompressionOptions options) { throw null; }

        public System.Buffers.OperationStatus Compress(System.ReadOnlySpan<byte> source, System.Span<byte> destination, out int bytesConsumed, out int bytesWritten, bool isFinalBlock) { throw null; }
        public System.Buffers.OperationStatus Flush(System.Span<byte> destination, out int bytesWritten) { throw null; }

        public void Dispose() { }

        // NEW
        // Resets the decoder state so that it can be reused for more frames
        public void Reset() { }

        // NEW (symmetry with ZstandardDecoder)
        // sets a dictionary in a prefix mode, exposes ZSTD_CCtx_setPrefix. Internally pins the memory until
        // disposed/GCd or Reset() (the prefix is "non-sticky" parameter which is cleared by Reset)
        public void ReferencePrefix(ReadOnlyMemory<byte> prefix);

        // NEW
        // ZSTD_CCtx_setPledgedSrcSize, sets size of the compressed data (so that it can be written into the header)
        // May be called only before the first Compress method, or after Reset(). Calling Reset() clears the size.
        // The size is validated during compression, not respecting the value causes OperationStatus.InvalidData.
        // QUESTION: Should this accept long?
        public void SetSourceSize(int size);

        public static int GetMaxCompressedLength(int inputSize) { throw null; }

        // one-off compression functions
        // note that `quality` and `dictionary` are mutually exclusive
        public static bool TryCompress(System.ReadOnlySpan<byte> source, System.Span<byte> destination, out int bytesWritten) { throw null; }
        public static bool TryCompress(System.ReadOnlySpan<byte> source, System.Span<byte> destination, out int bytesWritten, int quality, int window) { throw null; }

        // NEW (dictionary support)
        public static bool TryCompress(System.ReadOnlySpan<byte> source, System.Span<byte> destination, out int bytesWritten, System.IO.Compression.ZstandardDictionary dictionary, int window) { throw null; }

        // NEW
        // this one is the most flexible, but can be ommited as it is just a wrapper for creating Encoder and single call to Compress
        public static bool TryCompress(System.ReadOnlySpan<byte> source, System.Span<byte> destination, out int bytesWritten, System.IO.Compression.ZstandardCompressionOptions options) { throw null; }
    }

    // Wrapper around ZstandardEncoder/Decoder to provide Stream API
    public sealed partial class ZstandardStream : System.IO.Stream
    {
        // similar ctor members as BrotliStream
        // QUESTION: why don't we make leaveOpen always default to null?
        public ZstandardStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode) { }
        public ZstandardStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode, bool leaveOpen) { }
        // this ctor is needed to perform decompression with dictionary (but can be achieved by a ctor taking ZstandardDecoder listed below)
        public ZstandardStream(System.IO.Stream stream, System.IO.Compression.CompressionMode mode, ZstandardDictionary dictionary, bool leaveOpen = false) { }

        // these imply CompressionMode.Compress
        public ZstandardStream(System.IO.Stream stream, System.IO.Compression.CompressionLevel compressionLevel) { }
        public ZstandardStream(System.IO.Stream stream, System.IO.Compression.CompressionLevel compressionLevel, bool leaveOpen) { }
        public ZstandardStream(System.IO.Stream stream, System.IO.Compression.ZstandardCompressionOptions compressionOptions, bool leaveOpen = false) { }

        // NEW
        // These constructors allow reuse of ZstandardEncoder/Decoder instances
        // Disposing of the stream `Reset()`s the encoder/decoder
        // Note that this works even though Encoder/Decoder are structs.
        // QUESTION: should we add `bool ownsEncoder = false` parameter to allow passing ownership?
        public ZstandardStream(System.IO.Stream stream, System.IO.Compression.ZstandardDecoder decoder, bool leaveOpen = false) { }
        public ZstandardStream(System.IO.Stream stream, System.IO.Compression.ZstandardEncoder encoder, bool leaveOpen = false) { }

        // below are usual compression stream members
        public System.IO.Stream BaseStream { get { throw null; } }
        public override bool CanRead { get { throw null; } }
        public override bool CanSeek { get { throw null; } }
        public override bool CanWrite { get { throw null; } }
        public override long Length { get { throw null; } }
        public override long Position { get { throw null; } set { } }
        public override System.IAsyncResult BeginRead(byte[] buffer, int offset, int count, System.AsyncCallback? callback, object? state) { throw null; }
        public override System.IAsyncResult BeginWrite(byte[] buffer, int offset, int count, System.AsyncCallback? callback, object? state) { throw null; }
        protected override void Dispose(bool disposing) { }
        public override System.Threading.Tasks.ValueTask DisposeAsync() { throw null; }
        public override int EndRead(System.IAsyncResult asyncResult) { throw null; }
        public override void EndWrite(System.IAsyncResult asyncResult) { }
        public override void Flush() { }
        public override System.Threading.Tasks.Task FlushAsync(System.Threading.CancellationToken cancellationToken) { throw null; }
        public override int Read(byte[] buffer, int offset, int count) { throw null; }
        public override int Read(System.Span<byte> buffer) { throw null; }
        public override System.Threading.Tasks.Task<int> ReadAsync(byte[] buffer, int offset, int count, System.Threading.CancellationToken cancellationToken) { throw null; }
        public override System.Threading.Tasks.ValueTask<int> ReadAsync(System.Memory<byte> buffer, System.Threading.CancellationToken cancellationToken = default(System.Threading.CancellationToken)) { throw null; }
        public override int ReadByte() { throw null; }
        public override long Seek(long offset, System.IO.SeekOrigin origin) { throw null; }
        public override void SetLength(long value) { }
        public override void Write(byte[] buffer, int offset, int count) { }
        public override void Write(System.ReadOnlySpan<byte> buffer) { }
        public override System.Threading.Tasks.Task WriteAsync(byte[] buffer, int offset, int count, System.Threading.CancellationToken cancellationToken) { throw null; }
        public override System.Threading.Tasks.ValueTask WriteAsync(System.ReadOnlyMemory<byte> buffer, System.Threading.CancellationToken cancellationToken = default(System.Threading.CancellationToken)) { throw null; }
        public override void WriteByte(byte value) { }
    }
}

API Usage

Request decompression in ASP.NET Core

using System.IO.Compression;
using Microsoft.AspNetCore.RequestDecompression;

namespace Example;

// See: https://learn.microsoft.com/en-us/aspnet/core/fundamentals/middleware/request-decompression

sealed class ZstandardDecompressionProvider : IDecompressionProvider
{
    public Stream GetDecompressionStream(Stream stream) => new ZstandardStream(stream, CompressionMode.Decompress, leaveOpen: true);
}

// Register in Program.cs
// ...
builder.Services.AddRequestDecompression(x =>
{
    x.DecompressionProviders.Add("zstd", new ZstandardDecompressionProvider());
});
// ...

Response compression in ASP.NET Core

using System.IO.Compression;
using Microsoft.AspNetCore.ResponseCompression;
using Microsoft.Extensions.Options;
using Microsoft.Extensions.Options;

namespace Example;

// See: https://learn.microsoft.com/en-us/aspnet/core/performance/response-compression

sealed class ZstandardCompressionProviderOptions : IOptions<ZstandardCompressionProviderOptions>
{
    public CompressionLevel Level { get; set; } = CompressionLevel.Fastest;

    ZstandardCompressionProviderOptions IOptions<ZstandardCompressionProviderOptions>.Value => this;
}

sealed class ZstandardCompressionProvider : ICompressionProvider
{
    public ZstandardCompressionProvider(IOptions<ZstandardCompressionProviderOptions> options)
    {
        Options = options.Value;
    }

    private ZstandardCompressionProviderOptions Options { get; }

    public Stream CreateStream(Stream outputStream) => new ZstandardStream(outputStream, Options.Level, leaveOpen: true);

    public string EncodingName { get; } = "zstd";
    public bool SupportsFlush { get; } = true;
}

// Register in Program.cs:
// ...
builder.Services.AddOptions<ZstandardCompressionProviderOptions>()
    .Configure(zstd =>
    {
        zstd.Level = CompressionLevel.Optimal;
    });

builder.Services.AddResponseCompression(x =>
{
    x.EnableForHttps = true;
    x.Providers.Add<ZstandardCompressionProvider>();
});

One-shot APIs

byte[] source = new byte[256000];
Random.Shared.NextBytes(source);

int maxLength = ZstandardEncoder.GetMaxCompressedLength(source.Length);
var resultBuffer = new byte[maxLength];

Assert.True(ZstandardEncoder.TryCompress(source, resultBuffer, out int bytesWritten));
Assert.True(maxLength >= bytesWritten);

int decompressedLength = ZstandardDecoder.GetMaxDecompressedLength(resultBuffer.AsSpan(0, bytesWritten));
var decompressedBuffer = new byte[decompressedLength];

Assert.True(ZstandardDecoder.TryDecompress(resultBuffer.AsSpan(0, bytesWritten), decompressedBuffer.AsSpan(), out var bytesDecompressed));
Assert.True(decompressedBuffer.AsSpan(0, bytesDecompressed).SequenceEqual(source.AsSpan()));

Compression using dictionaries

            byte[] originalData = "Hello, World! This is a test string for Zstandard compression and decompression."u8.ToArray();
            byte[] compressedBuffer = new byte[ZstandardEncoder.GetMaxCompressedLength(originalData.Length)];
            byte[] decompressedBuffer = new byte[originalData.Length * 2];

            using ZstandardDictionary dictionary = ZstandardDictionary.Create(CreateSampleDictionary(), quality);

            int bytesWritten;
            int bytesConsumed;

            {
                using var encoder = new ZstandardEncoder(dictionary, ZstandardCompressionOptions.DefaultWindow);
                OperationStatus compressResult = encoder.Compress(originalData, compressedBuffer, out bytesConsumed, out bytesWritten, true);
                Assert.Equal(OperationStatus.Done, compressResult);
            }

            Assert.Equal(originalData.Length, bytesConsumed);
            Assert.True(bytesWritten > 0);
            int compressedLength = bytesWritten;

            {
                using var decoder = new ZstandardDecoder(dictionary);
                OperationStatus decompressResult = decoder.Decompress(compressedBuffer.AsSpan(0, compressedLength), decompressedBuffer, out bytesConsumed, out bytesWritten);

                Assert.Equal(OperationStatus.Done, decompressResult);
            }

            Assert.Equal(compressedLength, bytesConsumed);
            Assert.Equal(originalData.Length, bytesWritten);
            Assert.Equal(originalData, decompressedBuffer.AsSpan(0, bytesWritten));

Opening compressed TAR archives

using FileStream compressedStream = File.OpenRead("/home/dotnet/SourceDirectory/compressed.tar.zst");
using ZstandardStream decompressor = new(compressedStream, CompressionMode.Decompress);
TarFile.ExtractToDirectory(source: decompressor, destinationDirectoryName: "/home/dotnet/DestinationDirectory/", overwriteFiles: false);

Using Encoder/Decoder with Pipes

static async Task CompressPipelineDataAsync(PipeReader reader, PipeWriter writer)
{
    // this code is a bit naive, but illustrates the usage
    using var encoder = new ZstandardEncoder(quality: 6, window: 22);

    while (true)
    {
        var result = await reader.ReadAsync();
        var buffer = result.Buffer;

        if (buffer.IsEmpty && result.IsCompleted)
        {
            var finalMemory = writer.GetMemory(1024);
            encoder.Compress(ReadOnlySpan<byte>.Empty, finalMemory.Span, out _, out int finalBytes, isFinalBlock: true);
            writer.Advance(finalBytes);
            break;
        }

        foreach (var segment in buffer)
        {
            var outputMemory = writer.GetMemory(segment.Length * 2);
            encoder.Compress(segment.Span, outputMemory.Span, out int consumed, out int written, isFinalBlock: false);
            writer.Advance(written);
        }

        reader.AdvanceTo(buffer.End);
        await writer.FlushAsync();
    }

    await reader.CompleteAsync();
    await writer.CompleteAsync();
}

Using ReferencePrefix to produce binary patches

    // this is the "base" file, the starting point for the patch
    byte[] fromFileBytes = File.ReadAllBytes(fromFile.FullName);

    ZstandardCompressionOptions options = new ZstandardCompressionOptions()
    {
        Window = (int)Math.Log2(fromFileBytes.Length) + 1, // Allow using entire prefix file as the window
        EnableLongDistanceMode = true, // needed for efficient diffs of large files
    };

    using ZstandardEncoder encoder = new ZstandardEncoder(options);
    encoder.ReferencePrefix(fromFileBytes);

    using Stream inputStream = inputFile.OpenRead(); // target file
    using Stream outputStream = outputFile.Create(); // patch file

    // pass configured encoder to ZstandardStream
    using ZstandardStream zstandardStream = new ZstandardStream(outputStream, encoder);
    await inputStream.CopyToAsync(zstandardStream);

Applying the patch is similar

    byte[] fromFileBytes = File.ReadAllBytes(fromFile.FullName);

    int window = (int)Math.Log2(fromFile.Length) + 1;

    using ZstandardDecoder decoder = new ZstandardDecoder(window);
    decoder.ReferencePrefix(fromFileBytes!);

    using Stream inputStream = inputFile.OpenRead(); // patch file
    using Stream outputStream = outputFile.Create(); // target file

    using ZstandardStream zstandardStream = new ZstandardStream(inputStream, decoder);
    await zstandardStream.CopyToAsync(outputStream, cancellationToken);

below are example file sizes when producing patch between tarballs of Linux Kernel source code, .patch.zst file is binary patch that can produce linux-6.17-rc7.tar from linux-6.16.tar`

-rwxr-xr-x   1 rzikm rzikm 1592842240 Sep 26 14:27 linux-6.16.tar
-rwxr-xr-x   1 rzikm rzikm 1600378880 Sep 26 14:27 linux-6.17-rc7.tar
-rw-r--r--   1 rzikm rzikm  214944413 Sep 29 13:58 linux-6.17-rc7.tar.zst
-rwxr-xr-x   1 rzikm rzikm    7387096 Sep 26 14:27 linux-6.17-rc7.tar.patch.zst

Alternative Designs

Encoder/Decoder types as class instead of structs

Structs chosen for similarity with Brotli.

Separate Compression and Decompression Dictionary

Rename ZstandardDictionary to ZstandardCompressionDictionary

Add ZstandardDecompressionDictionary

    public sealed partial class ZstandardDecompressionDictionary : System.IDisposable
    {
        internal ZstandardDecompressionDictionary() { }
        public void Dispose() { }

        public static System.IO.Compression.ZstandardDictionary Create(System.ReadOnlySpan<byte> buffer) { throw null; }

        // optional:
        // `type` is a new enum (not used elsewhere):
        //   - Raw - buffer is treated as raw data, implies some small
        //           processing when loading (processed as if it were a prefix
        //           of the compressed data), any data can be used as raw
        //           dictionary, will fail only on very small buffers.
        //   - Serialized - Assumes serialized version of a preprocessed dictionary 
        //                  (magic bytes, entropy tables, raw data), can fail if
        //                  structure is compromised
        //   - Detect - default behavior, checks for presence of magic bytes, then
        //              as either Raw or Serialized
        public static System.IO.Compression.ZstandardDictionary Create(System.ReadOnlySpan<byte> buffer, ZstandardDictionaryType type) { throw null; }

        // Optional: access to raw dictionary bytes, for symmetry with compression dictionary
        public ReadOnlyMemory<byte> Data { get { throw null; } }
    }

Duplicate respective constructors on ZstandardStream, adjust ctors on encoder/decoder/options as needed.

Don't copy data passed to ZstandardDictionary

Ctors in the proposal accept ReadOnlySpan<byte> for flexibility, but that means that an internal copy of the data needs to be created. This could be avoided by accepting ReadOnlyMemory<byte> instead (The memory would be pinned for the lifetime of the dictionary by ReadOnlyMemory.Pin()`). This can save some allocations (recommended dictionary size is <100kB) but risks accidental overwriting the dictionary data after they were loaded by zstd.

    public sealed partial class ZstandardDictionary : System.IDisposable
    {
       public static System.IO.Compression.ZstandardDictionary Create(System.ReadOnlyMemory<byte> buffer, int quality) { throw null; }

        // like above, but uses default quality
        public static System.IO.Compression.ZstandardDictionary Create(System.ReadOnlyMemory<byte> buffer) { throw null; }
    }

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions