Skip to content

Rust-based WebAssembly bindings to read and write Apache Parquet data

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE_APACHE
MIT
LICENSE_MIT
Notifications You must be signed in to change notification settings

kylebarron/parquet-wasm

Repository files navigation

WASM Parquet npm version

WebAssembly bindings to read and write the Apache Parquet format to and from Apache Arrow.

This is designed to be used alongside a JavaScript Arrow implementation, such as the canonical JS Arrow library.

Including all compression codecs, the brotli-encoded WASM bundle is 907KB.

Install

parquet-wasm is published to NPM. Install with

yarn add parquet-wasm
# or
npm install parquet-wasm

API

Two APIs?

These bindings expose two APIs to users because there are two separate implementations of Parquet and Arrow in Rust.

  • parquet and arrow: These are the "official" Rust implementations of Arrow and Parquet. These projects started earlier and may be more feature complete.
  • parquet2 and arrow2: These are safer (in terms of memory access) and claim to be faster, though I haven't written my own benchmarks yet.

Since these parallel projects exist, why not give the user the choice of which to use? In general the reading API is identical in both APIs, however the write options differ between the two projects.

Choice of bundles

Presumably no one wants to use both parquet and parquet2 at once, so the default bundles separate parquet and parquet2 into separate entry points to keep bundle size as small as possible. The following describe the six bundles available:

Entry point Rust crates used Description
parquet-wasm parquet and arrow "Bundler" build, to be used in bundlers such as Webpack
parquet-wasm/node/arrow1 parquet and arrow Node build, to be used with require in NodeJS
parquet-wasm/esm/arrow1 parquet and arrow ESM, to be used directly from the Web as an ES Module
parquet-wasm/bundler/arrow2 parquet2 and arrow2 "Bundler" build, to be used in bundlers such as Webpack
parquet-wasm/node/arrow2 parquet2 and arrow2 Node build, to be used with require in NodeJS
parquet-wasm/esm/arrow2 parquet2 and arrow2 ESM, to be used directly from the Web as an ES Module

Note that when using the esm bundles, the default export must be awaited. See here for an example.

parquet API

This implementation uses the arrow and parquet Rust crates.

Refer to the API documentation for more details and examples.

parquet2 API

This implementation uses the arrow2 and parquet2 Rust crates.

Refer to the API documentation for more details and examples.

Debug functions

These functions are not present in normal builds to cut down on bundle size. To create a custom build, see Custom Builds below.

setPanicHook

setPanicHook(): void

Sets console_error_panic_hook in Rust, which provides better debugging of panics by having more informative console.error messages. Initialize this first if you're getting errors such as RuntimeError: Unreachable executed.

The WASM bundle must be compiled with the console_error_panic_hook for this function to exist.

Example

import { tableFromArrays, tableFromIPC, tableToIPC } from "apache-arrow";
import { readParquet, writeParquet } from "parquet-wasm";

// Create Arrow Table in JS
const LENGTH = 2000;
const rainAmounts = Float32Array.from({ length: LENGTH }, () =>
  Number((Math.random() * 20).toFixed(1))
);

const rainDates = Array.from(
  { length: LENGTH },
  (_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i)
);

const rainfall = tableFromArrays({
  precipitation: rainAmounts,
  date: rainDates,
});

// Write Arrow Table to Parquet
const parquetBuffer = writeParquet(tableToIPC(rainfall, "stream"));

// Read Parquet buffer back to Arrow Table
const table = tableFromIPC(readParquet(parquetBuffer));
console.log(table.schema.toString());
// Schema<{ 0: precipitation: Float32, 1: date: Date64<MILLISECOND> }>

Compression support

The Parquet specification permits several compression codecs. This library currently supports:

  • Uncompressed
  • Snappy
  • Gzip
  • Brotli
  • ZSTD. Supported in arrow1, will be supported in arrow2 when the next version of the upstream parquet2 package is released.
  • LZ4. Work is progressing but no support yet.

Custom builds

In some cases, you may know ahead of time that your Parquet files will only include a single compression codec, say Snappy, or even no compression at all. In these cases, you may want to create a custom build of parquet-wasm to keep bundle size at a minimum. If you install the Rust toolchain and wasm-pack (see Development), you can create a custom build with only the compression codecs you require.

Example custom builds

Reader-only bundle with Snappy compression using the arrow and parquet crates:

wasm-pack build --no-default-features --features arrow1 --features parquet/snap --features reader

Writer-only bundle with no compression support using the arrow2 and parquet2 crates, targeting Node:

wasm-pack build --target nodejs --no-default-features --features arrow2 --features writer

Debug bundle with reader and writer support, targeting Node, using arrow and parquet crates with all their supported compressions, with console_error_panic_hook enabled:

wasm-pack build --dev --target nodejs \
  --no-default-features --features arrow1 \
  --features reader --features writer \
  --features parquet_supported_compressions \
  --features console_error_panic_hook
# Or, given the fact that the default feature includes several of these features, a shorter version:
wasm-pack build --dev --target nodejs --features console_error_panic_hook

Refer to the wasm-pack documentation for more info on flags such as --release, --dev, target, and to the Cargo documentation for more info on how to use features.

Available features

  • arrow1: Use the arrow and parquet crates
  • arrow2: Use the arrow2 and parquet2 crates
  • reader: Activate read support.
  • writer: Activate write support.
  • parquet_supported_compressions: Activate all supported compressions for the parquet crate
  • parquet2_supported_compressions: Activate all supported compressions for the parquet2 crate
  • parquet compression features. Should only be activated when arrow1 is activated.
    • parquet/brotli: Activate Brotli compression in the parquet crate.
    • parquet/flate2: Activate Gzip compression in the parquet crate.
    • parquet/snap: Activate Snappy compression in the parquet crate.
    • parquet/lz4: Activate LZ4 compression in the parquet crate. WASM-compatible version not yet implemented in the parquet crate.
    • parquet/zstd: Activate ZSTD compression in the parquet crate.
  • parquet2 compression features. Should only be activated when arrow2 is activated.
    • parquet2/brotli: Activate Brotli compression in the parquet2 crate.
    • parquet2/gzip: Activate Gzip compression in the parquet2 crate.
    • parquet2/snappy: Activate Snappy compression in the parquet2 crate.
    • parquet2/lz4: Activate LZ4 compression in the parquet2 crate. WASM-compatible version not yet implemented, pending jorgecarleitao/parquet2#91
    • parquet2/zstd: Activate ZSTD compression in the parquet2 crate. ZSTD should work in parquet2's next release.
  • console_error_panic_hook: Expose the setPanicHook function for better error messages for Rust panics.

Future work

  • More tests 😄

Acknowledgements

A starting point of my work came from @my-liminal-space's read-parquet-browser (which is also dual licensed MIT and Apache 2).

@domoritz's arrow-wasm was a very helpful reference for bootstrapping Rust-WASM bindings.